Book of Nutanix Clusters
Nutanix Clusters on AWS
Nutanix Clusters on AWS provides on-demand clusters running in target cloud environments using bare metal resources. This allows for true on-demand capacity with the simplicity of the Nutanix platform you know. Once provisioned the cluster appears like any traditional AHV cluster, just running in a cloud providers datacenters.
The solution is applicable to the configurations below (list may be incomplete, refer to documentation for a fully supported list):
Core Use Case(s):
- On-Demand / burst capacity
- Backup / DR
- Cloud Native
- Geo Expansion / DC consolidation
- App migration
- Nutanix Clusters Portal - Provisioning
- Prism Central (PC) - Nutanix Management
- AWS Console - AWS Management
- AWS (EA)
- EC2 Metal Instance Types:
- Part of AOS
- AOS Features
- AWS Services
Key terms / Constructs
The following key items are used throughout this section and defined in the following:
- Nutanix Clusters Portal
- The Nutanix Clusters Portal is responsible for handling cluster provisioning requests and interacting with AWS and the provisioned hosts. It creates cluster specific details and handles the dynamic CloudFormation stack creation.
- A geographic landmass or area where multiple Availability Zones (sites) are located. A region can have two or more AZs. These can include regions like US-East-1 or US-West-1.
- Availability Zone (AZ)
- An AZ consists of one or more discrete datacenters inter-connected by low latency links. Each site has it’s own redundant power, cooling, network, etc. Comparing these to a traditional colo or datacenter, these would be considered more resilient as a AZ can consist of multiple independent datacenters. These can include sites like US-East-1a or US-West-1a.
- A logically isolated segment of the AWS cloud for tenants. Provides a mechanism to to secure and isolate environment from others. Can be exposed to the internet or other private network segments (other VPCs, or VPNs).
- Amazon’s object service which provides persistent object storage accessed via the S3 API. This is used for archival / restore.
- Amazon’s volume / block service which provides persistent volumes that can be attached to AMIs.
- Cloud Formation Template (CFT)
- A Cloud Formation Template simplifies provisioning, but allowing you to define a “stack” of resources and dependencies. This stack can then be provisioned as a whole instead of each individual resource.
From a high-level the Nutanix Clusters Portal is the main interface for provisioning Nutanix Clusters on AWS and interacting with AWS.
The provisioning process can be summarized with the following high-level steps:
- Create cluster in Nutanix Clusters Portal
- Deployment specific inputs (e.g. Region, AZ, Instance type, VPC/Subnets, etc.)
- The Nutanix Cluster Orchestrator creates associated resources
- Host agent in Nutanix AMI checks-in with Nutanix Clusters on AWS
- Once all hosts as up, cluster is created
The following shows a high-level overview of the Nutanix Clusters on AWS interaction:
Nutanix Clusters on AWS - Overview
The following shows a high-level overview of a the inputs taken by the cluster orchestrator and some created resources:
Nutanix Clusters on AWS - Cluster Orchestrator Inputs
The following shows a high-level overview of a node in AWS:
Nutanix Clusters on AWS - Node Architecture
Given the hosts are bare metal, we have full control over storage and network resources similar to a typical on-premise deployment. For the CVM and AHV host boot, EBS volumes are used. NOTE: certain resources like EBS interaction run through the AWS Nitro card which appears as a NVMe controller in the AHV host.
Nutanix Clusters on AWS uses a partition placement policy with 7 partitions by default. Hosts are striped across these partitions which correspond with racks in Nutanix. This ensures you can have 1-2 full “rack” failures and still maintain availability.
The following shows a high-level overview of the partition placement strategy and host striping:
Nutanix Clusters on AWS - Partition Placement
In cases where multiple node types are leveraged (e.g. i3.metal and m5d.metal, etc.), each node type has its own 7 partitions which nodes are striped across.
The following shows a high-level overview of the partition placement strategy and host striping when multiple instance types are used:
Nutanix Clusters on AWS - Partition Placement (Multi)
Storage for Nutanix Clusters on AWS can be broken down into two core areas:
- Core / Active
Core storage is the exact same as you’d expect on any Nutanix cluster, passing the “local” storage devices to the CVM to be leveraged by Stargate.
Given that the "local" storage is backed by the AWS instance store, which isn't fully resilient in the event of a power outage / node failure additional considerations must be handled.
For example, in a local Nutanix cluster in the event of a power outage or node failure, the storage is persisted on the local devices and will come back when the node / power comes back online. In the case of the AWS instance store, this is not the case.
In most cases it is highly unlikely that a full AZ will lose power / go down, however for sensitive workloads it is recommended to:
- Leverage a backup solution to persist to S3 or any durable storage
- Replicate data to another Nutanix cluster in a different AZ/Region/Cloud (on-prem or remote)
One unique ability with Nutanix Clusters on AWS is the ability to “hibernate” a cluster allowing you to persist the data while spinning down the EC2 compute instances. This could be useful for cases where you don’t need the compute resources and don’t want to continue paying for them, but want to persist the data and have the ability to restore at a later point.
When a cluster is hibernated, the data will be backed up from the cluster to S3. Once the data is backed up the EC2 instances will be terminated. Upon a resume / restore, new EC2 instances will be provisioned and data will be loaded into the cluster from S3.
Networking can be broken down into a few core areas:
- Host / Cluster Networking
- Guest / UVM Networking
- WAN / L3 Networking
Native vs. Overlay
Instead of running our own overlay network, we decided to run natively on AWS subnets, this allows VMs running on the platform to natively communicate with AWS services with zero performance degradation.
Nutanix Clusters on AWS are provisioned into an AWS VPC, the following shows a high-level overview of an AWS VPC:
Nutanix Clusters on AWS - AWS VPC
New vs. Default VPC
AWS will create a default VPC/Subnet/Etc. with a 172.31.0.0/16 ip scheme for each region.
It is recommended to create a new VPC with associated subnets, NAT/Internet Gateways, etc. that fits into your corporate IP scheme. This is important if you ever plan to extend networks between VPCs (VPC peering), or to your existing WAN. I treat this as I would any site on the WAN.
The hosts running on baremetal in AWS are traditional AHV hosts, and thus leverage the same OVS based network stack.
The following shows a high-level overview of a AWS AHV host’s OVS stack:
Nutanix Clusters on AWS - OVS Architecture
The OVS stack is relatively the same as any AHV host except for the addition of the L3 uplink bridge.
For UVM (Guest VM) networking, VPC subnets are used. A UVM network can be created during the cluster creation process or via the following steps:
From the AWS VPC dashboard, click on ‘subnets’ then click on ‘Create Subnet’ and input the network details:
Nutanix Clusters on AWS - OVS Architecture
NOTE: the CIDR block should be a subset of the VPC CIDR range.
The subnet will inherit the route table from the VPC:
Nutanix Clusters on AWS - Route Table
In this case you can see any traffic in the peered VPC will go over the VPC peering link and any external traffic will go over the internet gateway.
Once complete, you will see the network is available in Prism.
WAN / L3 Networking
In most cases deployments will not be just in AWS and will need to communicate with the external world (Other VPCs, Internet or WAN).
For connecting VPCs (in the same or different regions), you can use VPC peering which allows you to tunnel between VPCs. NOTE: you will need to ensure you follow WAN IP scheme best practices and there are no CIDR range overlaps between VPCs / subnets.
The following shows a VPC peering connection between a VPC in the eu-west-1 and eu-west-2 regions:
Nutanix Clusters on AWS - VPC Peering
The route table for each VPC will then route traffic going to the other VPC over the peering connection (this will need to exist on both sides if communication needs to be bi-directional):
Nutanix Clusters on AWS - Route Table
For network expansion to on-premise / WAN, either a VPN gateway (tunnel) or AWS Direct Connect can be leveraged.
Given these resources are running in a cloud outside our full control security, data encryption and compliance is a very critical consideration.
The recommendations can be characterized with the following:
- Enable data encryption
- Only use private subnets (no public IP assignment)
- Lock down security groups and allowed ports / IP CIDR blocks
- For more granular security, leverage Flow
Usage and Configuration
The following sections cover how to configure and leverage Nutanix Clusters on AWS.
The high-level process can be characterized into the following high-level steps:
- Create AWS Account(s)
- Configure AWS network resources (if necessary)
- Provision cluster(s) via Nutanix Clusters Portal
- Leverage cluster resources once provisioning is complete
More to come!