Oracle Cloud Infrastructure
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, timing, and pricing of any features or functionality described for Oracle’s products may change and remains at the sole discretion of Oracle Corporation.
This FAQ answers common questions about how Oracle achieves resilience and continuous availability of our core infrastructure services and hosting platform. Customers of Oracle Cloud might be interested in these answers for several reasons:
- They help customers practice due diligence when assessing Oracle’s hosting platform and services.
- Many of the answers discuss challenges and solutions that are fundamental to all cloud-scale systems, and so can inform the architecture and design of systems that customers want to build in the cloud.
Resilience and Continuous Availability of Oracle Cloud Infrastructure Services and Platform FAQ
Does Oracle distinguish different classes of service, such as critical services, continuously available services, or single-location services?
We don’t make such distinctions. Instead, we categorize our services by dependency level, availability scope, and data plane versus control plane. These categories are designed to provide various useful tradeoffs among availability, durability, performance, and convenience.
These levels might be considered layers or tiers in an architectural block diagram. Each layer may depend only on the layers below it.
From bottom to top:
- Core services: These services form the foundation of Oracle Cloud Infrastructure. They include Identity and Access Management (IAM), Key Management, Networking, Compute, Block Volumes, Object Storage, Telemetry, and several shared internal services. They are designed to have minimal dependencies, even among each other. (See later in this document for details about dependencies).
- IaaS: This layer provides more infrastructure-level functionality that is built on top of the core. Services in this layer include File Storage, Database, and Container Engine for Kubernetes.
- SaaS: This layer is rich software as a service that is built on lower layers.
To meet the goals for availability and durability for a service, one of the following availability scopes is chosen for each service:
- Availability domain local: Each availability domain contains one independent instance of the service. Such services offer high durability of stored data by using synchronous replication between replicas within the same availability domain (for details, see the section on fault domains later in this document). These services can tolerate an outage of a third or more of the infrastructure in the availability domain, depending on the nature of the service. Availability domain local services achieve this level of fault tolerance by using two different kinds of logical data center—logical groups of fault isolation and performance isolation—within the availability domain. For details, see the sections on fault domains and service cells later in this document. Finally, these services can continue to function as normal even if the availability domain can't communicate with any other availability domains. As a result, they tolerate the loss of other availability domains or the complete failure of the wide area network within the region.
- Multiple availability domain regional: Each region with multiple availability domains contains one independent instance of the service, with components located in each availability domain in that region. These services offer very high durability of stored data by using synchronous replication to multiple availability domains in the same region. These services can tolerate an outage of, or the inability to communicate with, any single availability domain in the region.
- Single availability domain regional: If a region contains only a single availability domain, the observable characteristics of a regional service match those of an availability domain local service, as described earlier. The distinction between an availability domain local service and a single availability domain regional service becomes relevant only when a single-availability-domain region is expanded by adding one or more availability domain. When that happens, each regional service automatically expands to make appropriate use of the new availability domains while remaining a single instance of the service. For example, the Object Storage data plane would expand to use the additional availability domains to improve the durability of existing data. Whereas, for availability domain local services, each of the new availability domains receives its own new and separate instance of each availability domain local service.
- Distributed across regions: A foundational principle of Oracle Cloud Infrastructure is that each region is as operationally independent from other regions as possible. The qualification as possible reflects the fact that regions must necessarily share at least some infrastructure, for example, the inter-region backbone network. Otherwise, we don't build close-coupling mechanisms between regions, such as transparent high availability or failover, that could cause problems that affect multiple regions simultaneously. Instead, we provide two mechanisms for distributing services across regions with loose coupling:
- Disaster recovery (DR): Enabling our customers to build systems with DR characteristics is a cornerstone of our approach and investment in cloud. Several core services already offer DR mechanisms—for example, Block Volumes inter-region backup and Object Storage inter-region copy. All of our services have DR functionality as high priority items in their roadmap.
- Inter-region subscriptions: We currently provide inter-region subscriptions only for IAM data. Conceptually, IAM data has a global scope. Customers can subscribe (opt-in) to a set of regions, and we automatically replicate the relevant IAM data and subsequent updates to the specified regions. To avoid close-coupling, replication is asynchronous and eventually consistent. Customers make modifications to their IAM data in a "home" region that they nominate. If the current home region becomes unavailable or unsuitable for some reason, a different region can be nominated.
Control Plane Versus Data Plane
The data plane of a service is the collection of data-processing interfaces and components that implement the functionality of the service that is intended to be used by applications. For example, the virtual cloud network (VCN) data plane includes the network packet processing system, virtualized routers, and gateways, while the Block Volumes data plane includes the implementation of the iSCSI protocol and the fault-tolerant replicated storage system for volume data.
The control plane of a service is the set of APIs and components responsible for the following tasks:
- Handling customer requests to provision, reconfigure, scale up/down, or terminate resources
- Performing automated patching of large fleets, rapidly and safely
- Detecting failed, degraded, or misconfigured resources
- Performing automated repair, or paging human operators for assistance
- Collaborating with other control planes (e.g. Compute, VCN, Block Storage are coupled during LaunchInstance)
- Managing unused capacity
- Coordinating with humans, e.g. on arrival of new equipment, and physical repair & maintenance
- Providing operational visibility and control
How does Oracle ensure that services are resilient and continuously available?
For all types of service, we use the same set of engineering principles to achieve resilience and availability, because the fundamental engineering challenges of building fault-tolerant, scalable, distributed systems are the same for all types of service.
To achieve resilience and continuous availability, it’s necessary to understand and then deal with all of the causes of unavailability—degraded performance and unhandled failures—in cloud-scale systems. There are a vast number of such causes, so we group them into categories according to their fundamental nature.
Traditionally, analysis of the availability of enterprise IT systems has focused on the category of hardware failure. However, for cloud systems, hardware failure is a relatively minor and well-understood problem. It's now relatively easy to avoid or mitigate most single points of hardware failure. For example, racks can have dual power feeds and associated power distribution units, and many components are hot-swappable. Large-scale hardware failure and loss is of course possible—for example, because of natural disasters. However, our experience, and reports in public post-mortems from other cloud vendors, shows that failure or loss of an entire data center happens extremely rarely, relative to the other causes of unavailability. Large-scale hardware failure must still be handled (for example, with disaster recovery and other mechanisms), but it is far from being the dominant availability problem.
The dominant causes of unavailability in cloud-scale systems are as follows:
- Software bugs
- Configuration errors
- Mistakes made by human operators
Note: The main lesson from the industry is that these three forms of human error are the biggest causes of unavailability by far. Although their frequency can be reduced by tools, automation, and training, they can't be eliminated. Therefore, they must be tackled as primary concerns in the architecture, design, and implementation of the system.
- Unacceptable variance in performance (latency or throughput) for any reason, including the following ones:
- Multitenant "noisy neighbors" (failure of QoS mechanisms)
- Inability to efficiently reject overload (accidental or malicious) while continuing to do useful work
- Distributed thrash, message storms, retry storms, and other expensive "emergent" interactions
- Cold-shock (empty caches) after power-cycle, particularly simultaneous power-cycle of multiple systems
- Overhead when scaling the system (for example, re-sharding)
- Failure to limit the "blast radius" (number of affected customers and systems) of any of the preceding issues
These challenges are universal—they are part of the "laws of physics" for cloud-scale distributed systems.
For each of the preceding categories, we use proven engineering strategies to tackle the problem. The most important of these are:
- Principles of architecture and system design
- New architectural concepts (which typically arise from applying the principles)
- Service engineering procedures
Principles of Architecture and System Design
Many of these principles exist, but we'll focus on those most relevant to resilience and availability.
To handle software bugs and mistakes by operators that have relatively localized effects, we follow the principles of recovery-oriented computing1. At a high level, this means that rather than trying to guarantee that we never have a problem (which is impossible to test), we focus on handling any problems unobtrusively, in a way that can be tested. In particular, we focus on minimizing mean time to recovery (MTTR), which is a combination of mean time to detect, mean time to diagnose, and mean time to mitigate.
Our aim is to recover so quickly that human users aren’t inconvenienced by the issue. The following points help us to achieve this goal:
- Quickly and automatically detect the symptoms of bugs and mistakes by operators, by pervasive use of assertions in code, and active monitoring and alarming at all levels.
- Package functionality into many separate fine-grained units of isolation (threads, processes, fibers, state machines, and so on) that are loosely coupled—this is, they don’t directly share memory that might become corrupted.
- On detecting the symptoms of a bug or a mistake by an operator, automatically restart the enclosing unit of isolation as quickly as possible. Restarting is a practical way to try to recover from an arbitrary failure, because it attempts to reestablish a state that has been well tested, and so restores invariants.
- If recovery at the fine-grained level of isolation doesn’t work (for example, assertions continue to fire at that level too frequently, causing spin-crashing), escalate to the next larger unit (process, runtime, host, logical data center, paging a human operator).
- Build mechanisms to enable a "system-wide undo," including versioning of all persistent state and configuration, in order to quickly identify and undo bad commits.
Minimizing the Effects of Issues
To deal with bugs and mistakes that might have broader effects, we build mechanisms to minimize the "blast radius" of any issues. That is, we focus on minimizing the number of customers, systems, or resources that are affected by any issues, including the particularly challenging issues of multitenant "noisy neighbors," offered overload, degraded capacity, and distributed thrash. We achieve this by using various isolation boundaries and change-management practices (see the following sections).
Architectural Concepts Arising from Design Principles
Many of these concepts exist, but we’ll focus on concepts for limiting the blast radius.
Placement Concepts Enshrined in Our Public API: Regions, Availability Domains, and Fault Domains
Because fault domains are relatively new, we’ll describe those in more detail.
Fault domains are used to limit the blast radius of problems that happen when a system is being actively changed—for example, deployments, patching, hypervisor restarts, and physical maintenance.
The guarantee is that, in a given availability domain, resources in at most one fault domain are being changed at any point in time. If something goes wrong with the change process, some or all of the resources in that fault domain might be unavailable for a while, but the other fault domains in the availability domain aren't affected. Each availability domain contains at least three fault domains, in order to allow quorum-based replication systems (for example, Oracle Data Guard) to be hosted with high availability within a single availability domain.
As a result, for a dominant category of availability problems—software bugs, configuration errors, mistakes by operators, and performance issues that occur during a change procedure—each fault domain acts as a separate logical data center within an availability domain.
Fault domains also protect against some kinds of localized hardware failure. The properties of fault domains guarantee that resources placed in different fault domains don't share any potential single points of hardware failure within the availability domain, to the greatest practical extent. For example, resources in different fault domains don't share the same "top-of-rack" network switch, because the standard design of such switches lacks redundancy.
However, the ability for fault domains to protect against problems in hardware or in the physical environment stops at that local level. In contrast to availability domains and regions, fault domains do not provide any large-scale physical isolation of infrastructure. In the rare case of a natural disaster or availability-domain-wide infrastructure failure, resources in multiple fault domains would likely be affected at the same time.
Our internal services use fault domains in the same way that customers should be using them. For example, the Block Volumes, Object Storage, and File Storage services store replicas of data in three separate fault domains. All components of all control planes and data planes are hosted in all three fault domains (or, in a multiple-availability-domain region, in multiple availability domains).
Service cells are used to limit the blast radius of issues that happen even when a system is not being actively changed. Such problems can arise because the workload of a multitenant cloud system can change in extreme ways at any time, and because complex partial failures can occur in any large distributed system at any time. These scenarios might trigger subtle hidden bugs or emergent performance issues.
In addition, service cells also limit the blast radius in some rare but challenging scenarios when the system is being actively changed. A classic example is when deployment to an individual fault domain appears successful—no errors or change in performance—but as soon as the second or final fault domain has been updated, new interactions within the system (at full cloud scale with production workload) cause a performance issue.
Note that the use of service cells is an architectural pattern, not a concept that is explicitly named in the Oracle Cloud API or SDK. Any multitenant system can use this architectural pattern; it doesn't require special support from the cloud platform.
Service cells work as follows:
- Each instance of the service (for example, in a particular region, or in a particular availability domain for availability domain local services) consists of multiple separate deployments of the service’s software stack. Each separate deployment is called a cell. Each cell is hosted on its own infrastructure as much as is practical. At minimum, cells do not share hosts or VMs.
- A service might start out with a handful of cells in each availability domain or region. As the service scales to meet increasing demand, more cells are added to maintain the limit on the size of the blast radius of any issues. A large, popular service might have many cells. In other words, cells provide n-to-m multiplexing of customer workloads to separate hosting environments—separate islands of resource isolation. Cells don’t have an obvious cardinality, such as exists for fault domains. (As mentioned earlier, an obvious choice for the cardinality of fault domains is three per availability domain, in order to enable quorum-based replication systems to be hosted with high availability in a single availability domain.)
- Each "natural unit" of a customer workload is assigned to a particular cell. The definition of "natural unit" depends on the nature of the particular service. For example, for our internal shared Workflow service (described later), the natural unit might be "all workflows in this availability domain or region for a particular control plane."
- In front of each group of cells is either a minimalistic routing layer or an API for discovering cell endpoints. For example, the Streaming/Messaging system has an API to discover the current data plane endpoint for a particular topic, and the internal Metadata store has a separate endpoint per cell. However, other cell-based services have a single data plane endpoint and a shared routing layer. The routing layer is a potential cause of correlated failure of multiple cells, but this is mitigated by keeping the routing layer extremely simple, predictable, and performant (no expensive operations), and provisioning it with a large amount of headroom capacity and sophisticated QoS quota and throttling mechanisms.
- Service owners can move a workload from one cell to another, as needed. Following are some example scenarios:
- To help avoid the multitenant "noisy neighbor" issue by moving a heavy workload so that other users of a cell are not impacted.
- To help recover from an overload or a brown-out, perhaps caused by a distributed denial of service attack. We have quota and throttling mechanisms to defend against such attacks, but sometimes edge cases occur in which a particular use case (API, access pattern) is more strenuous for the service than the quota or throttling system currently understands. Cells provide a mechanism for short-term mitigation.
- To separate critical workloads into different cells, to significantly reduce the probability of correlated failure. For example, for our internal shared Workflow for control planes, the "critical core" control planes (for example, Platform, Compute, Networking, and Block Volumes) are each assigned to different cells, and thus have significantly less correlation of failure than they would if cells were not used, or they were assigned to the same cell.
Note: that this use of cells reduces the need for customers to consider the internal dependencies of services in order to build resilient applications. Considering the dependency graph is still a good practice (more on that later in this document), but there’s less need for it when a decorrelation mechanism is already active.
The result is that each service cell is yet another kind of "logical data center"—a logical grouping of performance isolation and fault isolation—within a single availability domain or region.
In summary, service cells and fault domains complement each other in the following ways:
- Fault domains protect against issues when a system is being actively changed.
- Service cells limit the blast radius when a system experiences potentially severe issues—whether or not it is being actively changed.
We combine the properties of fault domains and service cells into a unified strategy when we perform deployments and patching.
Service Engineering Procedures
Because both testing and operational excellence are critical to the reliability of cloud systems, we have a large number of engineering procedures. Following are some of the more important ones that leverage the concepts mentioned in the preceding section:
- We deploy services incrementally, with careful validation between steps, and reflexive rollback if anything surprising happens. In concrete terms, the process is as follows:
- In each availability domain, we deploy to one service cell at a time. For each cell, we deploy to one fault domain at a time, until we’ve completed all fault domains for that cell. Then, we progress to the next cell in that availability domain.
- After each step of the deployment (after each fault domain and cell), we validate that the change is functioning as intended—that is, we haven't degraded performance or introduced any errors, either internal or external. If anything appears to be wrong or unexpected, we reflexively roll back the change. We greatly emphasize preparation and testing, including automated testing, of rollback procedures, including changes that affect persistent state or schemas.
- In this manner, we deploy the change one availability domain at a time to each region. We deploy to all of the regions in a realm in such a way that we don’t concurrently modify any pair of regions that a customer might be using for primary and disaster recovery sites.
- We regularly verify that error-handling mechanisms and other mitigations work as expected and don’t make the problem worse at scale. Without such testing, it's common for error-handling mechanisms (such as retries, crash-recovery algorithms, and state machine reconfiguration algorithms) to have bugs, be too expensive, or interact in surprising ways, and so cause distributed thrash or other serious performance problems.
- We verify our ability to quickly and safely roll back to last-known good software and configuration, including persistent state and schema, as described previously.
In Oracle regions that contain multiple availability domains, are all critical services distributed across the availability domains?
Yes. In each region, all availability domains offer the same set of services.
How does Oracle, and its customers, avoid having a critical service that depends on a single logical data center?
In single-availability-domain regions, customers can use fault domains (logical groups with decorrelated failure modes between groups) to achieve most of the properties of separate "logical data centers." Customers can also use multiple regions for disaster recovery (DR).
In multiple-availability-domain regions, customers can use fault domains in the same way. Customers can also use a combination of availability domain local services, inter-availability-domain failover features (such as DBaaS with Data Guard), and regional services (Object Storage, Streaming) to achieve full HA across higher-level "logical data centers" (availability domains). Finally, customers can also use multiple regions for DR.
In all cases, customers can use the concept of service cells to further isolate even the most severe issues, such as distributed thrash.
How does Oracle conduct maintenance activities without making any critical service temporarily unavailable to any customer?
We achieve this via fault domains, service cells, and our operational procedures for incremental deployment and validation. See the discussion earlier in this document.
Are serverless platform services deployed across multiple logical data centers for higher availability?
Yes. All categories of services are deployed across multiple logical data centers—separate logical groupings of fault isolation and performance isolation—for resilience and continuous availability.
If resilience is not the default configuration, are customers offered the choice of a multiple-logical-data-center deployment (for example, a multiple-availability-domain or cross-region configuration)?
In single-availability-domain regions, we offer fault domains as the mechanism for “multiple logical data centers”, as discussed elsewhere in this document.
In multiple-availability-domain regions, we offer services and features that provide an even higher level of physical durability of synchronously replicated data (at modest performance, cost because of the distance between availability domains in the region, and the speed of light).
We do not offer automatic HA or fail-over mechanisms across regions, as this would create a close-coupling relationship between regions, and incur risk that multiple regions may experience problems at the same time. Instead, we enable various forms of asynchronous replication between regions, and offer a growing list features, such as asynchronous copy & backup, to enable Disaster Recover across regions.
How does Oracle help customers avoid correlated failure of applications caused by internal dependencies between the various infrastructure and platform services?
This is a complicated question, so to clarify, we’ll restate it in a couple of different ways:
- If a customer wants to use two Oracle services (service A and service B) and wants to build an application that is resilient if either of those services fails, then does the customer need to know whether service A internally depends on service B? Do internal dependencies lead to correlated failure to a significant degree? If so, then the customer might need to know about such internal dependencies, in order to decide which other uses to make of service A and service B—or whether to instead pull in an unrelated service C for those additional cases—when building their own resilience mechanisms at the application level.
- How does the customer best defend against any correlated failure of Oracle services?
The answer is in two parts.
We use architectural principles that significantly reduce correlated failure across dependent services. In some cases, this technique reduces the probability of correlated failure to a degree that it can be ignored from the perspective of meeting an availability service level agreement (SLA).
In particular, we use service cells, as described earlier in this document. Cells help with this problem because if internal service A is affected by a problem in one of its dependencies, service B, then the problem with service B is very likely confined to a single cell. Other higher-level services—and the customer's own applications—that use service B are likely to be using other cells that are not affected. This is a probabilistic argument that varies with the number of cells, which is a hidden internal parameter that does change (increases), so no quantification or guarantee is given, beyond the standalone service SLAs of services A and B. But in practice, this can significantly decorrelate failures between services.
Many of our shared internal services—for example, the Workflow and Metadata services for control planes, and the Streaming/Messaging service—use service cells to decorrelate outages for the upstream services that use them.
The following guidance is high level because the low-level implementation and details of services can and do change. But for the key dimensions of compute, storage, networking, and authentication/authorization, we indicate the following dependencies.
For control planes, the common dependencies are as follows:
- The Identity/Platform data plane for authentication and authorization
- The audit tracking service
- Internal services that provide, for example, workflow, metadata storage, and logging
- Load balancers of various types
Some control planes obviously have service-specific dependencies. For example, the Compute control plane, when launching a bare metal or VM instance, depends on:
- Object Storage (to retrieve the specified operating system image)
- Block Volumes control plane (for provisioning and attaching the boot volume)
- Networking control plane (for provisioning and attaching VNICs)
For core service data planes, the general principle is that each data plane is intentionally designed to have minimal dependencies, in order to achieve high availability, fast time to diagnosis, and fast time to recovery. The results of that principle are as follows:
- The Networking data plane is self-contained.
- The Block Volumes data plane is self-contained.
- Compute bare metal and VM instances depend the Block Volumes data plane (for boot volumes) and the Networking data plane.
- The Object Storage data plane depends on the Identity/Platform data plane for authentication and authorization (because of industry expectations). The Object Storage data plane does not depend on Block Volumes or File Storage.
- All services that support backup and restore depend on the Object Storage data plane for that feature.
For IaaS data planes, the general principle is to depend only on core or lower-level data planes (in order to avoid cyclic dependencies).
- Database multi-node RAC depends on the Networking data plane and the Block Volumes data plane.
- Container Engine for Kubernetes obviously depends on Kubernetes and its transitive dependencies (for example, etcd) and the Networking data plane.
- All support for backup and restore depends on the Object Store data plane.
1A public research program out of Stanford and Berkeley, led by Armando Fox and Dave Patterson