December 28, 2020|Design, ROBO
Here are the things that I am going to cover in this design series. As each blog is written, I will update the links below to the other pages. The idea here is to create a live design book that can be referenced and updated for different types of use cases. This first one is ROBO.
- Use cases
- Node types
- Tactical Data Centers
- Single and dual node
- Re-use existing hardware
- Re-use existing storage
- Competitive outlook
- External factors, (radiation, heat, water, etc.)
- Designing for each use case
I want to make the distinction of what Edge Computing is and how it is distinct from general purpose computing running at remote locations. The core idea behind Edge Computing is to decrease latency and increase service reliability. The reason this is something that needs to be addressed is because we are in an interim technological stage in the evolution of computing.
Constraints are the cause of this, specifically network constraints. In the near future we will have ubiquitous computing that is available everywhere and capable of performing all functions that are required of it. Bandwidth will be far beyond any individual needs and it will be in surplus to everyone with a near to zero cost for the consumer. 5G networks can provide (in theory) up to 20GbE down and 10GbE upstream using Enhanced Mobile Broadband (eMBB) standards. Low latency high bandwidth network communication across the world is ideal, but what about transit costs, security, and availability? Its a big step on a long path to where we want to be.
Ubiquitous Computing (UbiComp) has many meanings, depending on the context of the decade it is discussed. Some of the aspects of have already come to fruition in day to day life. Smartphones, tablets, smart devices, IoT devices, voice assistants, cloud computing, SaaS/PaaS/XaaS; these are all manifestations of UbiComp. The 3 components are compute, network and data. The location of each of these will dynamically change based on many factors.
Here are some of the challenges we face with ubiquitous computing:
- Compute capacity is generally centralized.
- Decentralized compute capacity is often limited to end-user devices
- Network bandwidth and latency is most often best-effort with no SLA
- Network availability and performance SLAs require a cost uplift with a steep curve
- When a shared architecture is used, the integrity of the infrastructure/network/data security needs continuous auditing oversight and validation
- In a peered computing model, there are no standards for secure multi-tenant guest compute across independent infrastructures
- As compute dynamically moves, latency to data will fluctuate based on data gravity and dependencies
This is a non-exhaustive list, but it illustrates that we have a long way to go. If ubiquitous computing was to be depicted as a world, it would be one of many islands on a stormy sea with bridges, boats, and hot air balloons connecting them. What Edge Computing does is allow for smaller islands to collect, process and interpret data without navigating the unreliable paths to the larger islands.
Let’s look at one of the above challenges and dig into it a bit deeper.
Compute capacity is generally centralized.
Whether this is in a datacenter, or in a cloud, data needs to get onto a platform for processing. There is capability for some processing to be done at a desktop/end user device level, but it is often limited to single user operations. In this example, I will show some of the constraints experienced when data is collected, processed and the output needs to be consumed by others.
If data were collected by an individual, then you could use the example of a spreadsheet. I have outlined three examples with common Microsoft applications and services.
I am not advocating the use of these for any workflow, but rather using them as illustration points because they are familiar applications for many people and can show data logging, processing and replication to the cloud.
(A) Excel on a desktop computer. Resources are limited to the local capabilities. Nothing is shared.
(B) Excel on a desktop with synchronization of a folder to Microsoft OneDrive. Resources are limited to the local capabilities, but can be synchronized to the cloud with virtually unlimited storage. Once the data is in the cloud, others can access it an manipulate / process / interpret it. There is a certain amount of latency between when it is written locally and when it is synced to the cloud and accessible by others.
(C) Excel Online. Data is written to the cloud in real-time, however the latency to write in the cloud is greater than it is on local storage. There is virtually no limit to the storage, and the local resource requirements can be quite low. Data is immediately available to others when written, however if there are connectivity or bandwidth issues, then data loss can occur.
A person could use a desktop version of Excel on their computer to edit/create a document. If they had a new copy for every version, then they could see the changes that occur over a period of time. This is the core idea of datalogging. Data is captured in time based intervals, put into a table and saved. If they wanted to capture multiple data sets and save them into discrete files, then the I/O would increase and the storage requirements would increase based on the amount of concurrent data sets being captured. There would be a point where local storage could not meet the demands of the storage capacity requirements or the I/O requirements.
This is captured in scenario (A) with the below example.
Because the local resources are finite, continuously increasing capacity and performance requirements are not feasible. Some ways to address this are:
- Provide shared storage to increase total capacity.
- Limit the amount of data retention to a fixed period, thus ensuring that an upper limit threshold will not be surpassed if the data size per interval is consistent.
This provides some breathing room, but the next constraints to be hit are the compute capacity and availability. If the amount of compute required to collect the data, and process it is beyond the capability of a single machine, then several machines need to do this. The most efficient way to do this is with virtual machines or containers on a cluster. Those methods ensure that the availability of data logging is not impacted by updates, maintenance or component failure. Or perhaps some predefined analytics need to be conducted on the data before it ages out and is overwritten. The amount of time it takes to run the analytics on the data needs to be less than the data retention period so that the job won’t fail.
If more analytics need to be run beyond the data retention period or local storage, then the data needs to be replicated to another storage system with more capacity. The cloud is a good option for this with virtually unlimited capacity, where the constraint is just your budget.
This is represented below in example (B).
I’ll add a bit of explanation to the above graphs. The grey graphs are identical to those of the local desktop (example (A)), and the blue are for the sync with OneDrive (or the cloud in general). The cloud has virtually unlimited storage, so the available storage always stays at 100%. There will always be a base amount of latency going to the cloud and that will increase as I/O increases, for three reasons. First, additional I/O means more bandwidth requirements, which may require buffering or queuing before a sync completes. Second, there are limits to I/O in the cloud when the same data is being accessed by multiple parties or systems. It may be better than a single local system, but its not meant for heavy I/O. A third factor for latency is that if the other parties or systems are not running in the cloud, then there is another latency hop to sync the data to them before it is accessible.
One way to remove latency from all these data syncs before analytics can be performed, is to write directly to the cloud and use compute in the cloud to process the data and run analytics. This removes the need for local data retention and sync processes, but it does put more emphasis on the criticality of the network link. This is seen in example (C) below:
The overall latency is lower than that of (B) because of the removal of sync jobs, and the local storage requirements are very small because there is no local data retention. There will always be a base latency greater than that of local storage because of electromagnetic propagation constraints. Basically, the cloud is farther than local storage and therefore takes longer to write to it.
So now lets bring this back to Edge Computing and the real world
Example A with shared storage and a cluster is really an example of Edge Computing. Local storage, low latency, local compute, and the ability to collect and process data without reliance on external network SLAs.
Example B is also an example of Edge Computing, but with local constraints on compute and storage. Thus the reasoning for the cloud to be leveraged for storage and some processing.
Example C is an example of a data collector and proxy to the cloud. This is one of the most common models for IoT devices to simplify management.
All of these Edge Computing examples depict different requirements and scenarios that may even be intermixed. I’ll use a real world example in the Aerospace sector to showcase this.
“At last year’s Paris Air Show for example, Bombardier showcased its C Series jetliner that carries Pratt & Whitney’s Geared Turbo Fan (GTF) engine, which is fitted with 5,000 sensors that generate up to 10 GB of data per second. A single twin-engine aircraft with an average 12-hr. flight-time can produce up to 844 TB of data. In comparison, it was estimated that Facebook accumulates around 4 PB of data per day; but with an orderbook of more than 7,000 GTF engines, Pratt could potentially download zeta bytes of data once all their engines are in the field. It seems therefore, that the data generated by the aerospace industry alone could soon surpass the magnitude of the consumer Internet.” – Aviation Week
So how do these complex systems save all this data? They only keep it for short periods, then its deleted. In these short periods, all sorts of detailed analytics are run against it. Some data is processed immediately to provide instant (no latency) feedback and dictates the response of the plane. Some data is processed and the output is then synced via satellite or long range communications antenna to base stations for further review and processing. Some data is sent directly to another site for near real-time processing with more powerful compute.
Edge Computing systems need to be capable of running different types of workloads, but also need to be easily manageable, resilient and modular.
Nutanix addresses Edge Computing requirements with these following features:
- The platform is self-healing with no single point of failure.
- A single Scale-Out Prism Central management instance, can manage up to 300 clusters or 1000 nodes.
- Run both VMs and containers managed by native Kubernetes on a single platform
- Built-in on-demand or event driven automation.
- Easily extend automation to use REST-APIs with any 3rd party platform or application
- Built-in application lifecycle management for complex multi-tiered applications
- Predictable performance and modular scaling
Those are just a few of the features that help to support Edge Computing. In the following article from Steve Ginsberg (CIO Technology Analyst at GigaOm), the evolution of Edge Computing and the current state is also discussed.
So far this addresses one part of the Edge equation, which is the local compute and storage for providing a low latency high performance solution. The other parts are analytics and processing in the cloud, and management of the overall solution and application components. I will talk about those in the next blog which will continue the Edge Computing discussion.