Cloud software stacks, like those used at Amazon, Google, Facebook, Microsoft and other hyper scale operations, have radically altered how system architects design and build data centers. Out are big, gold-plated integrated hardware systems like fault-tolerant x86 boxes, mainframes and enterprise storage arrays. In are vast arrays of interconnected commodity pizza box servers with local disks, the Legos for building cloud-based applications. In this environment, size, power efficiency and cost are of prime importance as sophisticated cloud management software distributes workloads across many systems, automatically moving them as necessary without disrupting running applications or their users. As the saying goes, in the cloud you treat systems like cattle, not pets. Servers are disposable.
Disposability also means economization, in cost, space and power usage, which poses a threat to high-performance Intel processor and name-brand servers. But unlike with mobile processors, where Intel completely missed the boat and is still playing catchup, the firm’s reflexes in the data center, where it’s already the dominant platform, have been much more agile, actually anticipating the needs of cloud system designers. Yet there is room for low-end, consumer-oriented processors to disrupt the market, however early indications are that next-generation ARM-based products won’t be the catalyst for change.
What’s different about the cloud?
Cloud hardware is entirely virtualized, so given the speed of today’s processors it means each system can handle perhaps dozens of workloads. There’s no need to build servers with beefy, high-performance CPUs when a bunch of slower, more efficient chips can do the same job. Thus, it’s long been thought that cloud designs would evolve to look more like a colony of ants than a herd of elephants. When applied to systems and processors, the cloud philosophy places a premium on low-power chips that can be densely packed, not the highest performance per CPU socket.
The cloud design paradigm presented a challenge for Intel which evolved its CPU architecture from the days of monolithic client-server applications where workloads were confined to a single server and at best multithreaded enough to use multiple processing cores simultaneously. Prior to the rise of cloud services, massively parallel software was the stuff of government and university research labs. Thus, Intel has long focused its engineering prowess on maximizing performance at any cost. But the cloud changed the rules for system design by making compute density paramount. The relevant system metrics are now performance per watt and performance per cubic inch, not raw horsepower. Much like RISC instruction sets exploited hardware bottlenecks in traditional CPUs with a new, streamlined architecture to become the predominant platform for Unix boxes in the 90s, hyper scale clouds seemed to leave an opening for alternative processors better aligned with the needs of virtualized workloads and tuned for density, efficiency and cost.
Since the qualities of power efficiency and cost are precisely those of smartphones vis-a-vis PCs, ARM designs, which power virtually all of today’s mobile devices, including the iPhone, iPad and Galaxy lineup, are the natural ‘anti-Intel’. Although ARM servers have been tried before, notably by Calxeda, neither the market nor technology were ready for mass adoption, as evidenced by the firm’s humbling demise a year ago. For servers, ARM’s major drawback has been an antiquated 32-bit instruction set and no hardware support for virtualization that together made ARM systems infeasible to host cloud software stacks or VMware, the virtualization software of choice for enterprise IT. ARM addressed both of these limitations in its Cortex A-57 reference design. Although Apple introduced the first 64-bit ARM chip over a year ago in its iPhone 5s, commercially available ARM-64 products are just now coming to market. The first Cortex-A57 SoC specifically designed for server and embedded applications comes from Applied Micro. With pre-production development kits made available this fall, it’s finally possible to see how well an ARM-based system performs on actual server workloads.
Unfortunately for ARM and its design partners, Intel hasn’t been standing still and has a substantial lead in process technology over any of the semiconductor foundries available to manufacture ARM-based designs that translates into fundamental performance advantages derived from the pure physics of semiconductor operation. Case in point is Intel’s recently released third-generation E-series Xeon server CPUs that push an already capable product to new levels of performance, efficiency and cloud workload management. Superior overall performance of the E5-v3 is well documented, however the open question is how it would compare to next-generation ARM-based servers in dense, hyper scale cloud environments. The release of Applied Micro’s kit provided the first clues.
Leave it to the Scientists: Analyzing ARM-64 on Real World Applications
With 64-bit APM development systems now available, the first independent tests were published by physicists at CERN this fall. The team of researchers benchmarked Applied Micro’s X-Gene system, with an 8-core Cortex-A57 SoC, against two Intel-based systems: a conventional 8-core Xeon and a many-core Xeon Phi. Since the Xeon Phi is designed for highly parallelized workloads, the interesting results for data center designers is ARM’s performance against Xeon. As expected, in raw horsepower Xeon mopped the floor with the ARM box, but things are closer when measuring performance per watt. However the CERN group’s results understate Intel’s advantage since they used a now-obsolete first-generation E5 Xeon (Sandy Bridge) in their tests, not the latest Haswell E5v3. The table below presents key specifications for each of the processors:
|CPU family||CPU model||Cores||Threads/core||Freq. (GHz)||TDP (W)||Process Geom (nm)||Launch date|
|Xeon E5 v2||E5-2650 v2||8||2||2.6||95||22||Q3’13|
|Xeon E5 v3||E5-2630v3||8||2||2.4||85||22||Q3’14|
|ARM64 A57||APM Helix 1||8||1||2.4||42||42||Q3’14 samples 2015 prod.|
- used in CERN tests
When comparing the basic parameters, a couple of things stand out:
- The two-generation process technology advantage enjoyed by Intel’s latest parts over the APM device, which is built by TSMC’s foundry
- The APM device’s remarkably high power usage compared to mobile ARM processors that typically run 2.5 (phone) to 5 (tablet) watts. Indeed, Intel now has a gen-3 E5 part that comes within 10 watts (23%) of the APM device.
A mere 2-to-1 power advantage over Intel’s latest v3 parts leaves the APM SoC severely lacking in performance efficiency as we’ll see in the following benchmarks. Of course, APM and other ARM builders could reduce power by moving to a more advanced process node, although it’s unclear whether TSMC’s 20 nm process used by Apple would work for APM’s design. For now it’s just not competitive.
Extrapolating the CERN team’s test results to derive a comparison against Intel’s current-generation part is necessarily imprecise, but one way is to look at the relative performance differences between roughly equivalent Xeon products from the three E5 generations. SPEC benchmarks are widely available for all three and for overall system performance, the SPECint Rate is the best measure. The following table shows data for each generation:
|Generation||CPU model||SPECint2006_rate||Relative Perf.|
|Xeon E5 v2||E5-2650 v2||683||1.270|
|Xeon E5 v3||E5-2630 v3||686||1.275|
We can apply the relative performance measure to compare the APM device to a E5 v3 Xeon. CERN’s data found the first-generation E5 has about 2.5-times the overall performance of the ARM SoC on a mix of tests (called CMS) that are representative of the workloads for particle physics data analysis. This means a current-generation Xeon of roughly the same specs as the part used in CERN’s analysis would have over 3-times the raw performance of the ARM device.
Things tighten up when looking at compute efficiency, i.e. performance per watt. Here the APM system came within 10% of the Sandy Bridge Xeon, however since the Haswell v3 is both faster and more efficient, the gap widens considerably with the ARM device delivering only 65% of the Xeon’s compute efficiency. The chart below summarizes the data.
Intel Still Rules the Cloud
The first 64-bit ARM server processors illustrate the nearly impossible task of competing with Intel in the data center, a lesson AMD learned the hard way. Its tick-tock product development strategy means the steady delivery of improved performance derived either from new, denser fabrication processes or major circuit design changes. Although one can’t judge the merits of ARM’s 64-bit platform based on a single implementation, the first results show no benefit for cloud infrastructure where the Haswell generation Xeons deliver superior absolute and energy-adjusted performance while preserving the time-tested x86 instruction set.
Cloud builders seeking maximum density should stick with hyper scale 2U, quad node x86 systems like those used for VMware’s EVO:RAIL since it’s unlikely ARM systems will match their performance, density and flexibility any time soon.