Nutanix ROBO, Edge, and Tactical – Design – Part 7- TDCs

Tactical Data Centres

Tactical data centers have unique characteristics that other ROBO and edge environments do not. They need to be deployed quickly, have a small footprint, be mobile, maintain data security if they are compromised physically, be resilient to physical damage and to the elements, and be battery backed. They also need to adhere to the strict security standards that are required by federal government and law enforcement agencies.

First let’s look at some use cases, then we’ll dig into the technology required.

1) Rapid Emergency Response / Disaster Relief

When natural disasters occur, such as hurricanes, forest fires and flooding, regular communication channels and systems are ineffective. Emergency personnel need reliable real-time voice and data communications systems. These are provided by base stations that connect to localized teams, and backhaul networks, providing the required services and connectivity to save lives and rebuild infrastructure.

2) Mobile and short-lived deployments

Mobile & Modular datacenters for short lived deployments need to be able to be stood up and started up very quickly, provide applications and services immediately, be resilient to component failure and environmental situations. These get moved around a lot, jostled and battered. They cannot afford a services outage just because a server has components or connections shaken loose.

3) Field office / Forward Operating Base (FOB)

When operations and missions require a longer term (but possible mobile) base, an FOB will be setup. These will have applications and services that are centrally managed, updated and deployed. There may also be data replication from FOBs to regional datacenters in Main Operating Bases (MOBs). The datacenter could be in a permanent structure with racks, but most likely a smaller than normal footprint.

4) In-theatre tactical environments

In-theatre tactical environments have a lot of similarities to Mobile and Short-lived deployments. However, they may also have GTFO (Get The **** Out) operational procedures. Those might include taking data with you, or erasing local systems. These systems may be in command posts or in vehicles. 

With command posts, mobility is normally not immediate and rather done in several phases. First, troops would start with tent infrastructure, then generators and UPS units. Then local network switching, server, and storage infrastructure would go up. Next would be satellite and other RF links, followed by the running of cables to provide the local area network in the command post.

There has also been a shift from relying on dedicated proprietary data communications systems to ruggedized COTS (commercial of the shelf) systems that adhere to the security specifications required. This has changed the landscape of communications methods for classified. The old method was:

In an FOB, use a locally wired LAN connected to another LAN via point to point RF, or secure satellite link. This meant that no WiFI, or cellular/LTE communications could be used for classified data.

In the new method, validated COTS hardware is used with layers of encryption in the hardware, network and at the application level. This increases the capability for situational awareness with sensor data, and mobile communications. It also increases network availability by leveraging alternative network paths and not have a single point of failure.

A good example of this is in the United States, where the NSA has established a program called Commercial Solutions for Classified.

The ultimate goal for in-theatre tactical environments is to be secure, quick, agile and with a high level of communication performance and availability. I’ll go deeper into tactical network architecture in another blog post when I discuss C4/C5/C6ISR. Here is a quick link that has a good overview. Basically, it is an acronym that encompasses the technological capabilities of a platoon, company, division, etc.

C2 is Command and Control

C3 is Command and Control, and Communications

C4 is Command and Control, Communications, and Computers

C5 is Command and Control, Communications, Computers, and Cyber Defense

C6 is Command and Control, Communications, Computers, Cyber Defense and Combat Systems

ISR is Intelligence, Surveillance, and Reconnaissance 

Most modernized militaries have some aspect of C4ISR / CS5ISR in theatre and can be achieved with a tactical datacenter. C6ISR is reserved for integrated Naval weapons systems like AEGIS.


Now lets look at what the technology looks like.

Nutanix has partnerships with a number of hardware vendors that operate in this space. Nutanix provides the software and the partner vendor provides the hardware. In the TDC space, a few partners stand out, KLAS, Crystal and HPE.

I would do a disservice if I only mentioned the hyperconverged infrastructure in the solutions, because to make it work there is always a networking component. There also may be other supporting systems that need integration and connectivity. So lets first look at what a logical diagram for the entire network topology may look like.

There are systems in place to get data from sensors. This data could be for surveillance, or for EMF radiation signatures, or a myriad of other things. The data can then be analyzed using an inference engine and provide immediate actionable information. That inference engine can operate within the TDC.

There are physical connections to each of the networks available to a site, but there are also management platforms and controllers for the connections. For instance, a platform used for managing the mesh networking and data synchronization of tactical radio endpoints could run on the TDC. Or cellular base-station software could run in a VM on the TDC, or a virtual routing platform to connect radio, cellular, satellite and point to point microwave links. 

In the diagram I show 3 x TDCs and a main operating base. Some TDCs can communicate directly via point to point links, while others need to route through other TDCs. This allows for the replication and distribution of data and uplinks that can provide alternative routes if some forms of communication are compromised. As vehicles or troops are mobile, they can use their coms devices to access any of the available networks as a client, then route to wherever as needed, automatically. 

It may seem complicated, but its really not, nor should it be. In the following diagram, I highlight the import part of this architecture, as it relates to the Nutanix TDC.

In these TDCs you have your HCI infrastructure and the local switching to connect them. The local networking may included firewalls, routers and other WAN connectivity and isolation devices. However, those are out of the scope of what I show with the following architectures. Some networking devices can actually run as software appliances within the TDC, further reducing 3rd party hardware requirements.

KLAS Voyager2 and the HPE DX8000 come with networking built into them, while the Crystal Servers do not. There are benefits and drawbacks to each approach. With Crystal, it is assumed that you will utilize your own pre-existing switching, so nothing is provided. With the other 2, the networking has a very small profile and does not hinder the setup, rack mounting, or overall space.


KLAS has the Voyager Tactical Data Center solution.

Overview of the KLAS Voyager TDC

 Voyager 8 case and chassis with built-in UPS and AC/DC charge options

  • 4x TDC Blades each with
    • Xeon D CPU with 128GB RAM
    • NVMe storage for caching
    • 4x 2.5″ SATA SSDs
    • VIK for easy configuration changes
  • Voyager TDC Switch with
    • 12x 10Gbits/s port with copper and fiber options
    • 121 Gbps backplane for line-speed processing on all ports simultaneously
    • 40 Gbps trunk for interconnection with 3rd party switches
    • Inter-VLAN routing in hardware at line rate
    • 1x 40 Gbps QSFP+ port for high-speed uplink. Can also operate as 4 x 10 Gbps SFP+ ports using included breakout cable
    • Port mirroring, IPFix
    • Ansible playbook management supported
    • Voyager Ignition Key (VIK) for configuration and storage
  • Validated with Nutanix AOS on VMware ESXi and Nutanix AHV
  • Can be removed from travel case and mounted in a regular 2, or 4 post rack

The KLAS Voyager system is very modular and the TDC is not the only solution they offer. You can mount routers, radios, and a whole host of other devices in the Voyager 8 series chassis. Also, the chassis depth is very short because of the orientation of the nodes and switching. This makes it light and very portable.

All hardware maintenance is done from the front of the chassis, including node/switch/cabling, etc. The chassis can also be removed from the protective hard-case to be mounted in a rack.


Crystal Group has a whole range of ruggedized servers and switches. 

The form factor used for Nutanix is the 2U rugged Carbon Fibre server.

The footprint is not going to be as small as the KLAS or HPE solution, but these things are indestructible. From the Crystal Group site:

“…withstands harsh environmental conditions, including shock and vibration, temperature extremes, sand/dust, sea spray/salt fog, and more.”

Stronger and lighter than their steel chassis counterparts, they minimize SWaP (Size, Weight and Power), while maximizing performance. A minimum rack RU height  for a scalable cluster would be 7RU with 3 x nodes and 1 x switch. However, if you wanted to minimize size, it would be possible to only use 3RU with a single node cluster and a single switch.

The switching used is a ruggedized version of a Brocade ICX switch.

If you wanted to have a completely self contained environment that can be maintained locally with no other hardware, like a laptop, then it may make sense to add a KVM to this. Crystal provides a 1RU 8-port KVM that fits the bill.

Space will be required for maintenance of cabling, because all cabling and power connectivity is in the rear. Drive removals are done in the front, which is normal. However, in regular operation these nodes would not need any maintenance done for 10 years on average. So the maintenance issue is not a real concern.

Crystal Group Solution overview

Crystal Edge Networking


HPE has the DX8000 

The DX8000 series is essentially the same as the EdgeLine 8000 (EL8000), but specifically designed to run Nutanix AOS.

Here is the solution brief.

There are some specific things that I like about the DX8000, which I think are unique. With the PCI riser cards in each node, you can add network adapters to connect to external switching, or add GPUs for add processing capabilities. If you want to keep things a bit more compact, you can add internal unmanaged switches to the chassis. This will then use internal traces from the nodes (you will need internal network adapters) to the switch cards (as shown above). The switches will then uplink to another switch for greater connectivity. I don’t think you can make a VPC with those uplinks, as they are pretty basic. So they would essentially be a single uplink from each internal switch.

Another interesting thing with the DX8000, is that the widths is pretty small, so you could have two of these units side by side and not use any more rack units. That would give you 8 x nodes in 5RU, which is pretty dense. For non rack environments, the DX8000 can be bolted into a vehicle, or moved around fairly easily.

Nutanix ROBO, Edge, and Tactical – Design – Part 6- MEC

MEC or Multi-access Edge Computing

MEC is a continuation of the idea of Edge. It was originally referred to as Mobile Edge Computing, but has since been renamed to Multi-access Edge Computing. The main idea is that some data is processed locally and some data is passed to the cloud for processing. You have some devices that generate data, which then gets put in data pipelines and then applications follow business logic based on the data.

There are “System level”, “Host level”, and “Network level” components. Another way of looking at them is management, cluster/host, and network components. 

The system provides overarching management of the applications and is centrally controlled. This allows for the MEC applications to be deployed and treated as functions for data streams. The location of a host or cluster would be considered a “Service Domain”. The  network would be a combination of local and remote elements. Locally, data streams have aggregation and can then be processed by the functions in the local service domain, or they can be forwarded to another service domain to be processed. Those other service domains can be in another datacenter, or in the cloud.

The MEC applications might employ some level of AI for inferencing and thus use the real-time data collected and apply a function on it. The models that the inferencing engines use are trained with a lot of data. So the datapipe could continue to another service domain which would have the Machine Learning analytics capabilities for continuous training and improvement of the models. As those models get updated, they would then update the inferencing engines. 

Here is an example of this MEC architecture in use in a manufacturing context. Sensor data from PLC controllers, SCADA, etc., are collected. The data is monitored in real-time and custom dashboards are presented locally. Based on the defined business logic, there may be alternate machine / controller profiles that need to be loaded if there are variances in the sensor data. For example an increase in temperature may cause manufacturing process failures, unless they are accounted for or addressed. The inferencing engine could do this automatically, or provide indication on a dashboard suggesting preventative maintenance to avoid impact in the future. The output from the inferencing engines as well as the raw data can roll up to another service domain for longer retention, analysis and global management of multiple edge sites.


So how does Nutanix do this? 

Well there are always a number of ways and components involved, but the simplest answer is Nutanix Karbon Platform Services, or KPS.

KPS provides a simple management plane (provided via SaaS), that you specify your service domains, data sources, functions, business logic and machine learning. It works across different clouds, on-prem, with virtual machines, kubernetes / containers, and even baremetal. Its the glue for complex application architectures, both cloud native and traditional.

Some neat aspects of KPS:

  • KPS can provide management of Kubernetes clusters running in Nutanix Karbon, AWS, Azure, GCP and others.
  • GPUs are supported for inferencing at the edge.
  • A Terraform provider is available to automate the deployment of a KPS service domain in the public cloud.
  • KPS can manage service domains running on non-Nutanix VMware ESXi hosts

Here is a graphic that provides a high-level overview.


If you want to dive a bit deeper into KPS, have a look at these articles and links:

Nutanix ROBO, Edge, and Tactical – Design – Part 5- Edge overview

EDGE

I want to make the distinction of what Edge Computing is and how it is distinct from general purpose computing running at remote locations. The core idea behind Edge Computing is to decrease latency and increase service reliability. The reason this is something that needs to be addressed is because we are in an interim technological stage in the evolution of computing.

Constraints are the cause of this, specifically network constraints. In the near future we will have ubiquitous computing that is available everywhere and capable of performing all functions that are required of it. Bandwidth will be far beyond any individual needs and it will be in surplus to everyone with a near to zero cost for the consumer. 5G networks can provide (in theory) up to 20GbE down and 10GbE upstream using Enhanced Mobile Broadband (eMBB) standards. Low latency high bandwidth network communication across the world is ideal, but what about transit costs, security, and availability? Its a big step on a long path to where we want to be.

Ubiquitous Computing (UbiComp) has many meanings, depending on the context of the decade it is discussed. Some of the aspects of have already come to fruition in day to day life. Smartphones, tablets, smart devices, IoT devices, voice assistants, cloud computing, SaaS/PaaS/XaaS; these are all manifestations of UbiComp. The 3 components are compute, network and data. The location of each of these will dynamically change based on many factors.

Here are some of the challenges we face with ubiquitous computing:

  • Compute capacity is generally centralized.
  • Decentralized compute capacity is often limited to end-user devices
  • Network bandwidth and latency is most often best-effort with no SLA
  • Network availability and performance SLAs require a cost uplift with a steep curve
  • When a shared architecture is used, the integrity of the infrastructure/network/data security needs continuous auditing oversight and validation
  • In a peered computing model, there are no standards for secure multi-tenant guest compute across independent infrastructures
  • As compute dynamically moves, latency to data will fluctuate based on data gravity and dependencies 

This is a non-exhaustive list, but it illustrates that we have a long way to go. If ubiquitous computing was to be depicted as a world, it would be one of many islands on a stormy sea with bridges, boats, and hot air balloons connecting them. What Edge Computing does is allow for smaller islands to collect, process and interpret data without navigating the unreliable paths to the larger islands.

Let’s look at one of the above challenges and dig into it a bit deeper.

Compute capacity is generally centralized.

Whether this is in a datacenter, or in a cloud, data needs to get onto a platform for processing. There is capability for some processing to be done at a desktop/end user device level, but it is often limited to single user operations. In this example, I will show some of the constraints experienced when data is collected, processed and the output needs to be consumed by others.

If data were collected by an individual, then you could use the example of a spreadsheet. I have outlined three examples with common Microsoft applications and services. 

I am not advocating the use of these for any workflow, but rather using them as illustration points because they are familiar applications for many people and can show data logging, processing and replication to the cloud.

(A) Excel on a desktop computer. Resources are limited to the local capabilities. Nothing is shared.

(B) Excel on a desktop with synchronization of a folder to Microsoft OneDrive. Resources are limited to the local capabilities, but can be synchronized to the cloud with virtually unlimited storage. Once the data is in the cloud, others can access it an manipulate / process / interpret it. There is a certain amount of latency between when it is written locally and when it is synced to the cloud and accessible by others.

(C) Excel Online. Data is written to the cloud in real-time, however the latency to write in the cloud is greater than it is on local storage. There is virtually no limit to the storage, and the local resource requirements can be quite low. Data is immediately available to others when written, however if there are connectivity or bandwidth issues, then data loss can occur.

A person could use a desktop version of Excel on their computer to edit/create a document. If they had a new copy for every version, then they could see the changes that occur over a period of time. This is the core idea of datalogging. Data is captured in time based intervals, put into a table and saved. If they wanted to capture multiple data sets and save them into discrete files, then the I/O would increase and the storage requirements would increase based on the amount of concurrent data sets being captured. There would be a point where local storage could not meet the demands of the storage capacity requirements or the I/O requirements.

This is captured in scenario (A) with the below example.

Because the local resources are finite, continuously increasing capacity and performance requirements are not feasible. Some ways to address this are:

  • Provide shared storage to increase total capacity.
  • Limit the amount of data retention to a fixed period, thus ensuring that an upper limit threshold will not be surpassed if the data size per interval is consistent.

This provides some breathing room, but the next constraints to be hit are the compute capacity and availability. If the amount of compute required to collect the data, and process it is beyond the capability of a single machine, then several machines need to do this. The most efficient way to do this is with virtual machines or containers on a cluster. Those methods ensure that the availability of data logging is not impacted by updates, maintenance or component failure. Or perhaps some predefined analytics need to be conducted on the data before it ages out and is overwritten. The amount of time it takes to run the analytics on the data needs to be less than the data retention period so that the job won’t fail.

If more analytics need to be run beyond the data retention period or local storage, then the data needs to be replicated to another storage system with more capacity. The cloud is a good option for this with virtually unlimited capacity, where the constraint is just your budget.

This is represented below in example (B).

I’ll add a bit of explanation to the above graphs. The grey graphs are identical to those of the local desktop (example (A)), and the blue are for the sync with OneDrive (or the cloud in general). The cloud has virtually unlimited storage, so the available storage always stays at 100%. There will always be a base amount of latency going to the cloud and that will increase as I/O increases, for three reasons. First, additional I/O means more bandwidth requirements, which may require buffering or queuing before a sync completes. Second, there are limits to I/O in the cloud when the same data is being accessed by multiple parties or systems. It may be better than a single local system, but its not meant for heavy I/O. A third factor for latency is that if the other parties or systems are not running in the cloud, then there is another latency hop to sync the data to them before it is accessible. 

One way to remove latency from all these data syncs before analytics can be performed, is to write directly to the cloud and use compute in the cloud to process the data and run analytics. This removes the need for local data retention and sync processes, but it does put more emphasis on the criticality of the network link. This is seen in example (C) below:

The overall latency is lower than that of (B) because of the removal of sync jobs, and the local storage requirements are very small because there is no local data retention. There will always be a base latency greater than that of local storage because of electromagnetic propagation constraints. Basically, the cloud is farther than local storage and therefore takes longer to write to it.


So now lets bring this back to Edge Computing and the real world

Example A with shared storage and a cluster is really an example of Edge Computing. Local storage, low latency, local compute, and the ability to collect and process data without reliance on external network SLAs.

Example B is also an example of Edge Computing, but with local constraints on compute and storage. Thus the reasoning for the cloud to be leveraged for storage and some processing.

Example C is an example of a data collector and proxy to the cloud. This is one of the most common models for IoT devices to simplify management.

All of these Edge Computing examples depict different requirements and scenarios that may even be intermixed. I’ll use a real world example in the Aerospace sector to showcase this.


“At last year’s Paris Air Show for example, Bombardier showcased its C Series jetliner that carries Pratt & Whitney’s Geared Turbo Fan (GTF) engine, which is fitted with 5,000 sensors that generate up to 10 GB of data per second. A single twin-engine aircraft with an average 12-hr. flight-time can produce up to 844 TB of data. In comparison, it was estimated that Facebook accumulates around 4 PB of data per day; but with an orderbook of more than 7,000 GTF engines, Pratt could potentially download zeta bytes of data once all their engines are in the field. It seems therefore, that the data generated by the aerospace industry alone could soon surpass the magnitude of the consumer Internet.” – Aviation Week


So how do these complex systems save all this data? They only keep it for short periods, then its deleted. In these short periods, all sorts of detailed analytics are run against it.  Some data is processed immediately to provide instant (no latency) feedback and dictates the response of the plane. Some data is processed and the output is then synced via satellite or long range communications antenna to base stations for further review and processing. Some data is sent directly to another site for near real-time processing with more powerful compute.

Edge Computing systems need to be capable of running different types of workloads, but also need to be easily manageable, resilient and modular. 

Nutanix addresses Edge Computing requirements with these following features:

Those are just a few of the features that help to support Edge Computing. In the following article from Steve Ginsberg (CIO Technology Analyst at GigaOm), the evolution of Edge Computing and the current state is also discussed.

https://www.nutanix.com/cxo/thought-leadership/edge-computing-redux

So far this addresses one part of the Edge equation, which is the local compute and storage for providing a low latency high performance solution. The other parts are analytics and processing in the cloud, and management of the overall solution and application components. I will talk about those in the next blog which will continue the Edge Computing discussion.

Nutanix ROBO, Edge, and Tactical – Design – Part 4 – Sizing

SIZING

When you are looking at sizing an environment there are several factors that need to be addressed and questions that should be considered:

1) What are the workloads that will run on the environment?

2) What does “fit for use” look like in terms of metrics (availability, manageability, performance, recoverability, security)?

3) What are the priorities if trade-offs need to be considered (ie; fast, good, cheap, etc.)?

4) Create a requirement rubric that can have weight against a solution.

5) Identify requirements that are “must-have”, vs “nice-to-have”.

6) If this is an existing workload, get historical consumption and performance data for the workload and its existing infrastructure.

7) If this is a new workload, model based on recommended practices for that type of workload

8) Determine the constraints for a solution (power, space, cooling, physical access, etc.).

9) Determine the hardware vendors being considered and compare their portfolios of viable solutions.

10) Determine the budget for this initiative and evaluate against solutions with a TCO (Total Cost of Ownership).

This is not a comprehensive list, but serves as a starting ground for discussions. All too often people will just look at source hardware specs and utilization, then try to replicate that in a final solution, with a little space for growth. Although this may work for a quick and dirty estimate of what a solution may be, it might miss the finer requirement points and provide a functional but non-optimal solution.

With ROBO environments this can make a big difference because often they are replicated based on a certain configuration, or template. If that design is non-optimal, then the deficiency is multiplied by the number of sites.

Some key points to address:

  • What does the network look like and what is the available port type, count and bandwidth?
  • What do the traffic patterns look like? North-south? East-west? Traffic types? Network load?
  • How is the network logically segmented? 
  • What method is used to secure the environment? Network zone ACLs? Layer 3/7?
  • Is there value in microsegmentation for the workload?
  • Is the workload mobile, or does it need to have the ability to migrate automatically or manually?
  • Does the workload need to have a backup, BC/DR plan?
  • Is there currently a BC/DR plan in place? Does it meet the application owner requirements?
  • Can the BC/DR plan be improved? Would it cost more, or less? Can management complexity be reduced?
  • Is there application level data replication? Are there any infrastructure native alternatives that may be more beneficial for cost, performance, or manageability?
  • How resilient does the platform need to be? How many simultaneous, and serial failures for components does it need to protect against?
  • Does the platform need to be centrally managed?
  • Would there be benefit for deployment automation?

With ROBO environments the most common constraints are cost, manageability, and lifecycle management. An organization may have a set budget that they need to design against, which will limit options right out of the gate. This is where we can start looking at the trade-offs of designing against that budget and see if it is able to provide an optimal fit-for-use solution. If budget was increased, would other options that are more in line with the requirements be available?

To size based against the most common constraints, lets dig into them a bit:

Cost

There are hardware costs, software costs, operational costs, and provisioning costs.

  • Hardware will come from your vendor of choice. Its all x86 anyway, so the differences often come down to non-technical aspects like client relationships, procurement vehicles, capex costs for comparable hardware, and vendor support.
  • Software costs cover the workload applications, hypervisor, guest OS, and infrastructure components.
  • Operational costs can be calculated by the number of individuals supporting the solution, multiplied by the number of hours used for management, multiplied by the staff resource cost per hour.
  • Provisioning costs are the operational costs required to stand up a new environment. Sometimes the operational staff resources will be different than the provisioning staff resources, so cost per hour may change.

Manageability

Some management solutions can manage multiple sites with no additional cost, while others will charge you for that capability. In a ROBO environment, it is good to understand this. Also where the management plane resides is important. A centralized management plane may be an issue if there is no site access. Local management capabilities may also be required.

Lifecycle Management

This includes updates, patches, expanding the environment, and removing end-of-life hardware. The expense for this is usually attributed to operational expense if it is a manual process. Also there may be limited time available for this because of staff being over-worked. This results in fewer updates, lower performance and higher risks of hitting software vulnerabilities or bugs. If it is automated, is there a cost for the platform that does this?

Some Nutanix Solutions that work very well in ROBO environments.

Prism Central

Centralized management for multiple clusters / sites.

Prism Pro

License tier on Prism Central that allows for playbook automation for operations, capacity planning, reporting analytics and VM right sizing. This will tell you when you are running out of resources, any issues on any sites and create automated actions for remediation. Essentially a self-driving datacenter model for all your sites.

Nutanix Flow

This allows for microsegmentation, environment isolation on layer 2, layer3 firewalling, ID based security policies with Active Directory integration, overlay networking, IPSEC VPN, NAT capability and VPC constructs.

Nutanix Files

This allows scale-out file sharing for SMB and NFS. Gets rid of CIFS silos, Windows file shares that need constant patching and downtime, the associated security risks, and single point of failure. All Nutanix software tiers come with 1TB free files licenses.

Protection Domains

Perform async and near-sync replication to another site, with RPOs as low as 20 seconds.

Nutanix On-Prem Leap

Perform orchestrated DR with the ability to perform start-up order prioritization, and IP re-addressing.

Foundation Central

This provides the ability to deploy and configure sites via automation from a central location, without the need for an onsite resource (except for racking and cabling).

Here are some resources to help with sizing Nutanix for ROBO environments:

Deploy and Operate ROBO on Nutanix

Nutanix ROBO solution page

Nutanix Design Guide (See chapter 16, by Greg White)

Nutanix Sizer

Nutanix ROBO, Edge, and Tactical – Design – Part 3 – Licensing

December 16, 2020|Design, ROBO

LICENSING

This article is not only about Nutanix licensing, but also software licensing in general. A general rule with software licensing is that customers want to get as much functionality as possible to meet their requirements, and pay as little as possible. Software vendors would like to make as much money of their customers as possible, while still providing value. Some models for licensing are:

One price, perpetual version license.

You buy it once and you own that version with updates, forever. You buy version 1.0 of a software and have access to all updates of the 1.x series. However, you will need to rebuy the software at version 2.0 to get new features.

Multiple feature tier prices, perpetual version license.

You buy it once and you own that version and tier with updates, forever. Feature sets may be labeled like bronze, silver, gold, or standard, advanced, enterprise, etc. If you buy version 1.0 bronze, you can pay the difference to go to 1.0 silver / gold. However if you want 2.0 silver, you need to rebuy the whole thing from scratch. 

All features, All you can eat, perpetual version license.

This license let you use the software as much as you want and make use of all the available features. This may be referred to as an ELA (Enterprise Licensing Agreement).

 In some cases, you may get a post-paid bill depending on which of the features you use and the quantity of deployed instances. This is called a true-up

The above licensing models are conducive to capex purchases that are performed every few years, depending on the software release cycle. As long as the software meets the demand of the user and updates are provided then there is little incentive to purchase a new version, unless additional features warrant the expense. One way for software vendors to increase sales for customers that do not want to buy the new version, is to sell support for a term. For instance if version 1.x is out and so is 2.x, then perhaps the customer does not want to buy 2.0 yet. So they will pay the software vendor a percentage of the purchase price to receive technical support and software updates for the 1.x series until it is end of life. This ends up being cheaper than a purchase of the 2.0 version, but only delays the capex expense. In the interim, the software vendor gets residual revenue, which is known as the “long tail”.

Another software licensing model is called subscription based licensing. There are a few different versions of this, but the two I will discuss are restrictive and non-restrictive. These aren’t necessarily the exact terms, but they convey the idea.

Restrictive subscription licensing

When the license term expires, all software functionality stops. Think of 30-day trial software, then extend that for longer terms.

Non-restrictive subscription licensing.

When the license term expires, all features and functions remain, but updates are no longer provided. Some software has separate support and subscription costs, so you may have one expire and the other one not. For instance, if your support expires, but not your subscription, then you can keep getting updates but no help with any issues. If your subscription expires, but not your support, then you cannot update your software, but you can get assistance on issues from the software vendor. They may tell you that you “need to update”, which would be code for “renew your subscription”. 

Nutanix licensing

Alright now that you have a bit of a foundation in a non-exhaustive list of comparative software licensing models, I’ll describe how Nutanix does licensing and support.

Nutanix offers tiered non-restrictive subscription licensing with integrated support. So there is just one price for licensing and support for a period of time. You can get licensing for 1-5 year terms, but will get the best bang for your buck on a 1-3-year contract. After your term is up, you can still use the software, but you can’t get updates or create support tickets. Renew your subscription and you get the latest version, updates and support again. 

As long as you have an active subscription, you can update to the latest version.

Now we look at the software tiers and how this all relates to ROBO environments.

Core AOS provides hyperconverged functionality and management with Prism. The tiers for AOS are Starter, Pro, and Ultimate. See here for a functionality comparison.

https://www.nutanix.com/products/software-options

There are also tiers for Prism (also called Starter, Pro, Ultimate). Prism Starter comes at no additional cost to all AOS versions. It allows for multi-cluster management and health monitoring and troubleshooting. Prism Pro adds operational insights, planning and automation. Prism Ultimate adds service level insights for applications, cost metering and chargeback.

License metering

The cost of license depends on how it is metered. Nutanix has a few ways to do this, even though the license tiers may be the same. 

  • Capacity Based Licensing (CBL) will look at the total cluster capacity of SSD (in TiB) and the total core count. For example, lets look at a 3-node cluster with the following individual node specs.

2 x 12-core CPUs (24 cores per node)

2 x 1.92TB SSDs (3.84TB or 3.49TiB – https://www.gbmb.org/tb-to-tib )

So the licensing cost for the 3-node cluster would be 11TiB SSD and 72 cores.

  • Per VM licensing, which is sometimes referred to as ROBO licensing, is a defined cost based on the number of VMs on the cluster. The amount of resources is not measured. This works well in situations where you required only a few VMs and want to give them the maximum resources possible.  One limitation of this is that you are only allowed a maximum of 32GB of RAM per VM. This limitation stops things like massive databases running on that licensing.
  • VDI core licensing will use concurrent users as a measure of metering. This allows for a linear cost based on users and not infrastructure resources. Another benefit is that you can have license mobility between sites, depending on where your users are. A 2-site architecture with 200 concurrent users and active / passive failover, can have 200 x VDI core licenses at site 1 and 1 x VDI core license at site 2. In the process of a DR event, you would then reallocate 199 licenses from site 1 to site 2. This allows a very cost effective licensing for DR capability of VDI environments. VDI core could be applied to Citrix, VMware Horizon, Nutanix Frame, or any other VDI solution.

Here is a link to the Nutanix Licensing Guide

Nutanix ROBO, Edge, and Tactical – Design – Part 2 – Node Types

NODE TYPES

In the Nutanix world, there are several main node types:

 1) Hyperconverged (HC)

These are your standard nodes that provide a combination of storage and compute. In this node classification, you also have a sub-classification:

  • Light Compute. – This is usually lower end processors and was historically used for older Storage-Only nodes, like the NX-6035C G5. You don’t see these anymore.
  • Single CPU – This is used in very small environments that need to keep costs, power, and core licensing down. An example of this is the NX-1175S-G7, or the NX-1120S. The “S” designation on the model indicates a single CPU.
  • Entry level- Lower end dual CPU, but only a single SSD and limited memory scalability. These reduce core and SSD costs to a near minimum. An SSD failure will be seen as a CVM failure because there is no redundancy. An example is the NX-1065-G7.
  •  Multi-node blocks- These cram as many nodes as possible into a small form factor. You get 4-nodes in 2U with something like the NX-3065-G7, or 2 nodes in 2U with the NX-8035-G7.
  • Slim profile blocks – These 1U servers have the same node density as the 8035 nodes, but you have physical block separation. An example of this is the NX-3170-G7
  • GPU nodes – These nodes have physical space and slots for GPU cards. An example of this is the NX-3155-G7

2) Storage Nodes (SN) 

Any node can be configured as a storage node. The main difference will be that of memory and CPU resources which will be reduced because only the CVM has to run on the node.

There are more node types for other workloads, and a full list of current node types from Nutanix and OEM vendors, can be viewed here. https://www.nutanix.com/products/hardware-platforms/specsheet

I am only focusing on the node types that are most often seen in ROBO environments. Therefore Storage Heavy, Compute Heavy, and Compute Only will not be discussed.

I’ve given examples for each node type for ROBO in NX hardware. The same, or equivalent can also be obtained from other hardware vendors. See the dynamic spec sheet above to find the OEM nodes that best fit your use case.

Some key node models that should be looked at from the OEMs are:

HPE DX 8000 

This ruggedized 5U 4-node block is fairly small and can be affixed to a vehicle or other tactical or edge environment.

HPE ProLiant DX8000 5U Front Cabling Chassis View

Description

1. 1U Blank

2. ProLiant DX910 1U Server Blade

3. Chassis Manager

4. HPE ProLiant DX8000 1500W (AC/DC) Power Supplies

DX 8000 Specs Here:

KLAS Voyager 2 TDC

These nodes are ruggedized, small, come with their own switching and can be fit in an airline carry-on compartment. If they were to be installed in a rack, it would be 5U and short depth.

Voyager 2 TDC specs here

Lenovo HX 1021 

This node is 1U half width, so you can put 2 nodes in 1RU. A very small form factor indeed. This is great for maximizing space with a limited number of nodes.

HX 1021 Specs here:

Nutanix NX-1120S

Not an OEM node, but a recent addition to the Nutanix NX line-up, and worth mentioning. This tiny node is 1U, less than 1′ deep and only uses about 120W. That’s the same power draw as running a bright incandescent light bulb. 

NX-1120S Specs here.

Nutanix ROBO, Edge, and Tactical – Design – Part 1 – Use Cases


USE CASES

ROBO can mean a lot of different things to people, depending on the use case. So I’m going to list some of those use cases and drill into them so that we can better understand how to design for them.

1) Remote Office Branch Office

This is the literal interpretation of ROBO, where a small amount of infrastructure is required for local workloads and a larger datacenter infrastructure is located somewhere else.

2) SMB

Small and Medium Business. This is a small infrastructure that may act as the only datacenter for a company. Costs and management overhead are primary factors. Some SMBs may have multiple sites, and require some datacenter-like functionality.

3) DR site

A DR site may require the capabilities of running all workloads from another site, or run a subset of them, or possibly just act as a replication target.

4) Pop-up site (Edge)

On-demand, or pop-up sites need to be provisioned and deployed in a very short period of time, then torn down just as quickly. There is no room for config or operational error and the entire process must be streamlined.

5) Mobile (Edge)

This means physical mobile. Terrestrial vehicle, aircraft, watercraft, spacecraft. These require physically hardened systems resistant to shock, vibrate, temperature extremes, pressure, fire, water, and physical damage.

6) Industrial (Edge)

Hardened, resilient systems that need extremely high availability. These are often parts of critical systems and have no room for loss of availability. An example are SCADA control and monitoring systems.

7) IoT (Edge)

Internet of Things device management, data analysis and processing. Run applications to interpret data closure to the edge for quicker response times or higher security, than sending data to the cloud.

8) Tactical (Edge)

These are meant for rapid deployment and centralized management. They provide operational services for military, government and emergency services. They need to be easily deployable, replicate data, ensure encryption and can be disposed of in short order if required.

9) Small datacenter

This may be a small quantity of nodes, thus it fits the ROBO ideal, but it has tier 1 workloads. These could be databases, Artificial Intelligence GPU backed workloads, etc.

10) Dev / Test / Lab

Also not entirely ROBO, it will often start small to replicate a production workload. Because of its often small initial footprint, it is being added to this category.

In the next post we will look at node types from Nutanix and various OEMs to see what hardware can be used for these use cases.

Nutanix HCI Analyst Reports

Understanding HCI Analyst Reports (Gartner, Forrester) in 2020

December 11, 2020|news

There are a lot of different organizations that perform market research and analysis. some of them are well known and others not as much. The model is essentially this:

Information gathered by researchers> Reports Created > Investors buy reports

There’s a bit more nuance to it though.

1) Each research organization will perform a detailed analysis of several market sectors. This will include a set of criteria that encompass the capabilities of one or more solutions by a number of companies. Each participating company will need to provide information back to the research organization to be included in the report. If a company does not provide the information requested or does not wish to participate, then they will be excluded for that specific report.

Here are a bunch of well known research organizations,

2) The reports generated by these companies are used in a number of ways. First and foremost, they are provided at a price, to people or organizations that want insight into market information for investment purposes. The reports also directly affect share value of the companies listed in them. The second use case is for comparative analysis of prospective solutions for CIOs, managers and general consumers. However, since the reports are only provided at a cost, the public facing free information is not the entire report. So the movement, trends, and reasoning for the ranking presented is not presented. Thus context can be lost without supporting information.

Two very popular reports that are generated are the Gartner Magic Quadrant (MQ) and Forrester Wave reports. These generally have a lot of sway and credibility in the industry. Nutanix has been in a number of different reports over the years and the categories have changed with the industry. When Nutanix first started to get noticed, it was included in a converged infrastructure comparison, and not hyperconverged, because the market was too small. As time went on and there were more competitors in the space, it made sense to have a separate HCI analysis. However, there were some flaws in the reporting as some hardware vendors were using other software solutions to provide a complete solution. So they were getting the double the recognition that they should have. A solution to that is to have a software-only HCI comparison. The problem with that is that not all analyst organizations see eye to eye on which companies should be included in the reports. So you will see different players in the Gartner MQ vs the Forrester Wave because of this.

Below are the latest 2020 HCI results for both. After it is all said and done, the current market data shows a 2-horse race with Nutanix and VMware. Both have a large portfolio of solutions to tackle any workload, however the implementation strategy is different. VMware has the software defined datacenter (SDDC) with VMware Cloud Foundation (VCF), and Nutanix has AOS and Prism. 

  • VCF includes ESXi, VSAN, NSX and the vRealize Suite
  • The Nutanix stack includes AOS (Core HCI) and Prism (Management, monitoring, automation, network virtualization). 

Both companies also have an array of offerings that can be enabled to provide additional functionality.

Forrester WAVE for Hyperconverged Infrastructure Q3 2020 

Gartner Magic Quadrant for Hyperconverged Infrastructure Software 

December 2020

What I am seeing over time is that Nutanix and VMware lead the market by all measures, regardless of the report. Nutanix has a higher ability to execute and a stronger current offering. VMware has a slightly higher completeness of vision, which is due to their foray into public cloud and partnerships. However, that may change soon due to some leadership changes at Nutanix

The other contenders are for niche use cases, only participate in some regional markets, or only target SMBs. I think there will always be competition, but the magnitude of difference almost requires different categories to compare against. The less mature offerings are also being acquired, consolidated and in a state of flux in the current economy.

Get the reports from here:

https://www.nutanix.com/go/forrester-wave-2020

https://www.nutanix.com/go/gartner-magic-quadrant-for-hyperconverged-infrastructure