Authors:
Adit Sinha, Sid Telang
Acknowledgments:
Alex Lurye, Lauren Hughes
Last update:
March 2022
This page contains a set of recommendations for Google Cloud customers
requiring a highly available External Key Manager (EKM) service deployment
integrated with Cloud EKM.
Using Cloud EKM and the associated EKM service involves an explicit
risk tradeoff for customers between cloud workload reliability and data
protection controls. Encrypting data-at-rest in the cloud with off-cloud
encryption keys adds new failure modes that may result in the inaccessibility or
even loss of data stored in Google Cloud services. To address these,
incorporating high availability and fault tolerance into the design of the EKM
service is paramount.
These recommendations are aimed at those who develop and operate the EKM
service. If you are a customer of a supported partner, you might share some of
these responsibilities with the partner, depending on the design of their
product and how it integrates with Cloud EKM
Background and terminology
External key manager solutions, and specifically Google Cloud's
Cloud EKM, allow cloud customers to use non-cloud resident key material
to control the access to their data stored in supported Google Cloud
services.
Cloud EKM introduces the ability to create and manage
Cloud KMS key resources with the
EXTERNAL
and
EXTERNAL_VPC
protection levels. Keys with the
EXTERNAL
and
EXTERNAL_VPC
protection levels
are stored and managed in an external key management system. These
Cloud KMS resources, like Cloud KMS keys of other protection
levels, can be used to encrypt data-at-rest in supported Google Cloud services
using
CMEK
. Every cryptographic operation requested on such a
Cloud KMS resource results in a cryptographic operation on the external
key requested by Cloud KMS. The success of the former operation
critically depends on that of the latter.
Cloud KMS requests operations on external keys using a special-purpose
API that integrates with the external key management system. Throughout this
document we refer to a service that provides this API as an EKM service.
If an EKM service becomes unavailable, reads and writes from the data planes for
integrated Google Cloud services may fail. These failures surface in a
similar way as failures do when the dependent Cloud KMS key is in an
unusable state, for example, when it is disabled. The end-user facing error
message describes in detail the source of the error and a course of action.
Furthermore, Cloud KMS data access audit logs persist a record of these
error messages together with descriptive error types that can be
programmatically consumed. More information can be found in
the Cloud EKM error reference documentation
.
Guiding principles
Google's
Site Reliability Engineering book
illustrates a number of high-level principles to guide the development and
maintenance of reliable systems. In this section, we highlight some of these
principles in the context of how your EKM service integrates with
Google Cloud. We apply these principles using three reference
architectures, and place primary importance on the following three high-level
reliability objectives:
- Low latency, reliable network connectivity
- High availability
- Fast failure detection and mitigation
For each objective, we highlight factors that affect reliability and provide
recommendations for taking them into account in your EKM service's architecture.
Low-latency, reliable network connectivity
Cloud KMS connects to EKM services via either a
Virtual Private Cloud (VPC) or the public internet. VPC solutions
will often use
hybrid connectivity
to host the EKM
service in an on-premise datacenter. The connection between Google Cloud
and the datacenter must likewise be fast and reliable. When using the public
internet, stable, uninterrupted reachability and fast, reliable DNS resolution
are of the utmost importance. From the point of view of Google Cloud, any
interruption manifests as unavailability of the EKM service and the potential
inability to access EKM-protected data.
When a Google Cloud service's data plane communicates with the EKM
service, each EKM service-bound call has a defined timeout period (currently 150
ms). The timeout is measured from the Cloud KMS service in the
Google Cloud location of the Cloud KMS key. If the
Google Cloud location is a multi-region, then the timeout begins in the
region where Cloud KMS receives the request, which is typically where
the operation on the dependent CMEK-protected data resource occurred. This
timeout is adequate to allow an EKM service to handle requests in the nearby
Google Cloud region the requests originate from.
Note that this is a much shorter timeout than the 10s timeout that is common
across Google Cloud APIs. The reduced timeout helps prevent cascading
failures in downstream services that depend on the external key. Tail latency
issues that might normally cause a poor user experience in higher level
applications can actually manifest as failed accesses to the external key
resulting in the failure of the higher-level logical operation.
Recommendations
Minimize latency of round-trip communication with Cloud KMS.
Configure EKM services to serve requests as geographically near as possible
to the Google Cloud locations corresponding to the Cloud KMS
keys configured to use the EKM service. For more information, see
the best
practices guide for choosing Google Cloud regions on the basis of
latency
considerations
and the
documentation on where Google Cloud regions and zones are located
.
Use Cloud Interconnect when possible.
Cloud Interconnect
creates a
highly available, low-latency connection between Google Cloud and your
datacenter via VPC and eliminates dependencies on the public
internet.
Deploy Google Cloud networking solutions in the region closest to
the EKM service, when necessary.
Ideally, Cloud KMS keys should be in the region nearest to the EKM
service. If there is a Google Cloud region that is closer to the EKM
service than the region holding the Cloud KMS keys, use
Google Cloud networking solutions, such as
Cloud VPN
, in the region closest to the
EKM service. This ensures that network traffic uses Google infrastructure
when possible and reduces dependence on the public internet.
Use Premium Tier networks for cases where EKM traffic transits through the
internet.
Premium Tier
routes traffic through the internet using
Google's infrastructure where possible to improve reliability and reduce
latency.
High Availability
In general, the reliability of a system is determined by that of its least
reliable component. The existence of a single point of failure in the EKM
service will reduce the availability of dependent Google Cloud resources
to that of the single point of failure. Such points of failure may live in
critical dependencies of the EKM service as well as the underlying compute and
network infrastructure.
Recommendations
Deploy replicas across independent failure domains.
We recommend deploying at least 2 replicas of the EKM service. EKM services
intended to be used with multiregional Google Cloud locations should
have at least 2 separate geographical locations with at least 2 replicas
each.
Special care should be taken to ensure that each replica does not only
represent a replicated data plane of the EKM service but also cross-replica
failure vectors are well understood, minimized and hardened. For example,
- production mutations, including server binary and configuration pushes,
should modify only one replica at a time, and should be carried out
under supervision, with tested rollbacks readily available.
- cross-replica failure modes from the underlying infrastructure should be
well-understood and minimized. For example, ensuring replicas depend on
independent and redundant power feeds.
Replicas should be resilient to single machine outages.
Each replica of the service itself should comprise at least 3 appliances,
machines, or VM hosts. This allows the system to serve traffic while one
machine is down for update while another has an unexpected outage (N+2
provisioning).
Limit the "blast radius" of control plane issues.
The Control plane (e.g. key creation/deletion) of the EKM service is marked
by requiring configuration or data to be replicated across replicas. The
operations are generally more complex, because they require synchronization
and affect all replicas. Issues can quickly propagate to affect the entire
system. Some strategies to reduce the impact of issues are to:
Control propagation speed
: By default, changes, should propagate as
slowly as is acceptable for usability and security. However, there might
be exceptions to allow some changes, like permitting access to a key, to
propagate quickly to allow a user to undo a mistake.
Shard the system
: If many users share the EKM, partition them into
logical shards that are completely independent, so that issues triggered
by a user in one shard cannot affect users in the other.
Preview the effect of changes
: If possible, allow users to see the
effect of changes applying them. For example, when modifying a key
access policy, the EKM could confirm the number of recent requests that
would have been rejected under the new policy.
Data canarying
: First push data to a small subset of the system, and
only push everywhere if that subset remains healthy.
Implement holistic health checks.
Create health checks that measure whether the full system is functioning.
For example, health checks that only validate network connectivity will not
be helpful in responding to many application-level issues. Ideally the
health check will closely mirror the dependencies for real traffic.
Set up failover across replicas.
Set up load balancing in your EKM service components such that it consumes
the above health checks and actively drains traffic to unhealthy replicas
and safely fails over to healthy replicas.
Include safety mechanisms to manage overload and avoid cascading
failures.
Systems may become overloaded for a variety of reasons. For example, when
some replicas become unhealthy, traffic redirected to the healthy replicas
could overload them. When faced with more requests than it can serve, the
system should attempt to serve what it can safely and quickly, while
rejecting excess traffic.
This chapter
from the Site Reliability Engineering book contains more details and
recommendations for preventing overload.
Ensure a robust durability story.
Data in Google Cloud that is encrypted with an external key in the EKM
service is unrecoverable without the external key. Therefore key durability
is one of the central design requirements of the EKM service. The EKM
service should securely back up redundant copies of key material in multiple
physical locations. High value keys, such as roots of trust, should have
additional protection measures, such as offline backups. Deletion mechanisms
should proceed slowly enough to allow for recovery in cases of accidents and
bugs. See
this chapter
from
the Site Reliability Engineering book for more information.
Reference:
Building Secure Reliable Systems
.
The EKM service is both a security-critical and reliability-critical system.
The above book delves into the interesting interplay between security and
reliability concerns and provides guiding principles for designing a service
that needs to address both.
Fast failure detection and mitigation
For every minute the EKM service suffers an outage, dependent Google Cloud
resources may be inaccessible, which can further increase the likelihood of a
cascading failure of other dependent components of your infrastructure.
Recommendations
Instrument the EKM service to report metrics that signal
reliability-threatening incidents.
Examples of important metrics include response error rates and response
latencies.
Set up operational practices for timely notification and mitigation of
incidents.
The effectiveness of such efforts should be quantified by tracking Mean Time
To Detect (MTTD) and Mean Time To Restore (MTTR) metrics, and defining
objectives measured in terms of these metrics.
This chapter
in the Site
Reliability Engineering book contains recommendations on tracking outages.
Once these metrics are available, you can find patterns and deficiencies in
the current processes and systems for responding to incidents and address
them.
Reference Architectures
The following architectures describe a few ways to deploy the EKM service using
Google Cloud networking and load balancing products. Each architecture may
be deployed be either a customer or by a partner who operates the service for
multiple customers, depending on the desired operational model. The
architectures are only pertinent to those who are building and operating the
EKM.
Direct connection over Cloud VPN or Cloud Interconnect
In the above architecture, Cloud EKM accesses the EKM service located
in Oregon through
hybrid connectivity
in the
us-west1
region without any intermediate load balancing in Google Cloud.
When possible, the Cloud EKM to EKM service connection should be
deployed using the
99.9% availability configuration for single region applications
.
If the connection to the on-premise datacenter uses the Internet, you should use
HA VPN
.
The primary advantage of this architecture is that there are no intermediate
hops in Google Cloud, which reduces latency and potential bottlenecks.
Using this architecture with an EKM that is hosted across multiple data centers
requires having the load balancer in all data centers use the same (anycast) IP
address. A disadvantage is that load balancing and failover among data centers
are based solely on route availability. This often means that such decisions
will only be based on the state of the network as opposed to the whole state of
the EKM deployment.
Load balanced in a VPC in Google Cloud
In the above architecture, Cloud EKM accesses the EKM service
replicated between Oregon and California through
hybrid connectivity
with layers of intermediate load
balancing in the
us-west1
region in Google Cloud.
The
internal passthrough Network Load Balancer
provides a single
IP address to which to send traffic using
virtual networking
. It
fails over
to the backup datacenter
based on actively
health checking
the backends.
The
VM instance group
is necessary to proxy
traffic, because the internal load balancer cannot route traffic directly to
on-premise backends. One strategy to deploy the load balancer proxies is to run
an Nginx Docker image from
Cloud Marketplace
in instance groups. Nginx can be used as a
TCP load balancer
.
Since this approach uses load balancers in Google Cloud, this could be
deployed without an on-premise load balancer. The Google Cloud load
balancers can connect directly to instances of the EKM service and balance load
among them. Eliminating the on-premise load balancer results in simpler
configuration but reduces flexibility available in the EKM service. For example,
an on-premise L7 load balancer could automatically retry requests in case one
EKM instance returns an error..
Load balanced from public internet in Google Cloud
In the above architecture, the EKM has replicas in on-premise sites in
California and Virginia. Each backends is represented in Google Cloud
using a
hybrid connectivity network endpoint group (NEG)
.
The deployment uses an
external proxy Network Load Balancer
to forward
traffic directly to one of the replicas. Unlike the other approaches, which rely
on VPC networking, the TCP proxy has a public IP address, and
traffic comes from the public internet.
Each hybrid connectivity network NEG may contain multiple IP addresses, which
allows the TCP Proxy Load Balancer to balance traffic directly to instances of
the EKM service. An additional load balancer in the on-premise datacenter is not
necessary.
The TCP proxy load balancer is not tied to a specific region. It can direct
incoming traffic to the nearest healthy region, which makes it suitable for
multiregional Cloud KMS keys. However, the load balancer does not allow
configuration of primary and failover backends. Traffic would be distributed
evenly across multiple backends in a region.
Comparison
|
Direct connection
|
Load balanced in a VPC
|
Load balanced from public internet
|
Fully-Managed EKM provided by partner
|
Public internet or VPC
|
VPC
|
VPC
|
Internet
|
Internet
|
Load balancer in Google Cloud
|
No
|
Yes
|
Yes
|
No
|
On-premise load balancer required
|
Yes
|
No
|
No
|
Yes (managed by partner)
|
Supports multiregional Cloud KMS locations
|
No
|
No
|
Yes
|
Yes
|
Recommended for
|
High throughput applications where the EKM service runs in a single site.
|
Most EKM services where you deploy your EKM.
|
When multiregional Cloud KMS keys are required.
|
Customers who are able to use a partner's EKM instead of deploying their own.
|