Intel Still Rules the Cloud: Haswell Puts the Squeeze on ARM’s Server Plans

By | December 9, 2014
Update: A rewritten and condensed version of this report appeared as a Forbes column, available here.

Cloud software stacks, like those used at Amazon, Google, Facebook, Microsoft and other hyper scale operations, have radically altered how system architects design and build data centers. Out are big, gold-plated integrated hardware systems like fault-tolerant x86 boxes, mainframes and enterprise storage arrays. In are vast arrays of interconnected commodity pizza box servers with local disks, the Legos for building cloud-based applications. In this environment, size, power efficiency and cost are of prime importance as sophisticated cloud management software distributes workloads across many systems, automatically moving them as necessary without disrupting running applications or their users. As the saying goes, in the cloud you treat systems like cattle, not pets. Servers are disposable.

Disposability also means economization, in cost, space and power usage, which poses a threat to high-performance Intel processor and name-brand servers. But unlike with mobile processors, where Intel completely missed the boat and is still playing catchup, the firm’s reflexes in the data center, where it’s already the dominant platform, have been much more agile, actually anticipating the needs of cloud system designers. Yet there is room for low-end, consumer-oriented processors to disrupt the market, however early indications are that next-generation ARM-based products won’t be the catalyst for change.

What’s different about the cloud?

Cloud hardware is entirely virtualized, so given the speed of today’s processors it means each system can handle perhaps dozens of workloads. There’s no need to build servers with beefy, high-performance CPUs when a bunch of slower, more efficient chips can do the same job. Thus, it’s long been thought that cloud designs would evolve to look more like a colony of ants than a herd of elephants. When applied to systems and processors, the cloud philosophy places a premium on low-power chips that can be densely packed, not the highest performance per CPU socket.

zt-ants-ifc_01

The cloud design paradigm presented a challenge for Intel which evolved its CPU architecture from the days of monolithic client-server applications where workloads were confined to a single server and at best multithreaded enough to use multiple processing cores simultaneously. Prior to the rise of cloud services, massively parallel software was the stuff of government and university research labs. Thus, Intel has long focused its engineering prowess on maximizing performance at any cost. But the cloud changed the rules for system design by making compute density paramount. The relevant system metrics are now performance per watt and performance per cubic inch, not raw horsepower. Much like RISC instruction sets exploited hardware bottlenecks in traditional CPUs with a new, streamlined architecture to become the predominant platform for Unix boxes in the 90s, hyper scale clouds seemed to leave an opening for alternative processors better aligned with the needs of virtualized workloads and tuned for density, efficiency and cost.

Since the qualities of power efficiency and cost are precisely those of smartphones vis-a-vis PCs, ARM designs, which power virtually all of today’s mobile devices, including the iPhone, iPad and Galaxy lineup, are the natural ‘anti-Intel’. Although ARM servers have been tried before, notably by Calxeda, neither the market nor technology were ready for mass adoption, as evidenced by the firm’s humbling demise a year ago. For servers, ARM’s major drawback has been an antiquated 32-bit instruction set and no hardware support for virtualization that together made ARM systems infeasible to host cloud software stacks or VMware, the virtualization software of choice for enterprise IT. ARM addressed both of these limitations in its Cortex A-57 reference design. Although Apple introduced the first 64-bit ARM chip over a year ago in its iPhone 5s, commercially available ARM-64 products are just now coming to market. The first Cortex-A57 SoC specifically designed for server and embedded applications comes from Applied Micro. With pre-production development kits made available this fall, it’s finally possible to see how well an ARM-based system performs on actual server workloads.

Applied Micro X C1

APM X-Gene Motherboard with 64-bit ARM CPU

Unfortunately for ARM and its design partners, Intel hasn’t been standing still and has a substantial lead in process technology over any of the semiconductor foundries available to manufacture ARM-based designs that translates into fundamental performance advantages derived from the pure physics of semiconductor operation. Case in point is Intel’s recently released third-generation E-series Xeon server CPUs that push an already capable product to new levels of performance, efficiency and cloud workload management. Superior overall performance of the E5-v3 is well documented, however the open question is how it would compare to next-generation ARM-based servers in dense, hyper scale cloud environments. The release of Applied Micro’s kit provided the first clues.

Leave it to the Scientists: Analyzing ARM-64 on Real World Applications

With 64-bit APM development systems now available, the first independent tests were published by physicists at CERN this fall. The team of researchers benchmarked Applied Micro’s X-Gene system, with an 8-core Cortex-A57 SoC, against two Intel-based systems: a conventional 8-core Xeon and a many-core Xeon Phi. Since the Xeon Phi is designed for highly parallelized workloads, the interesting results for data center designers is ARM’s performance against Xeon. As expected, in raw horsepower Xeon mopped the floor with the ARM box, but things are closer when measuring performance per watt. However the CERN group’s results understate Intel’s advantage since they used a now-obsolete first-generation E5 Xeon (Sandy Bridge) in their tests, not the latest Haswell E5v3. The table below presents key specifications for each of the processors:

CPU family CPU model Cores Threads/core Freq. (GHz) TDP (W) Process Geom (nm) Launch date
Xeon E5 E5-2650 8 2 2 95 32 Q3’12
Xeon E5 v2 E5-2650 v2 8 2 2.6 95 22 Q3’13
Xeon E5 v3 E5-2630v3 8 2 2.4 85 22 Q3’14
ARM64 A57 APM Helix 1 8 1 2.4 42 42 Q3’14 samples 2015 prod.
  • used in CERN tests

When comparing the basic parameters, a couple of things stand out:

  • The two-generation process technology advantage enjoyed by Intel’s latest parts over the APM device, which is built by TSMC’s foundry
  • The APM device’s remarkably high power usage compared to mobile ARM processors that typically run 2.5 (phone) to 5 (tablet) watts. Indeed, Intel now has a gen-3 E5 part that comes within 10 watts (23%) of the APM device.

A mere 2-to-1 power advantage over Intel’s latest v3 parts leaves the APM SoC severely lacking in performance efficiency as we’ll see in the following benchmarks. Of course, APM and other ARM builders could reduce power by moving to a more advanced process node, although it’s unclear whether TSMC’s 20 nm process used by Apple would work for APM’s design. For now it’s just not competitive.

Extrapolating the CERN team’s test results to derive a comparison against Intel’s current-generation part is necessarily imprecise, but one way is to look at the relative performance differences between roughly equivalent Xeon products from the three E5 generations. SPEC benchmarks are widely available for all three and for overall system performance, the SPECint Rate is the best measure. The following table shows data for each generation:

Generation CPU model SPECint2006_rate Relative Perf.
Xeon E5 E5-2650 538 1.0
Xeon E5 v2 E5-2650 v2 683 1.270
Xeon E5 v3 E5-2630 v3 686 1.275
CMS XGene abs perf

CERN CMS Benchmarks

We can apply the relative performance measure to compare the APM device to a E5 v3 Xeon. CERN’s data found the first-generation E5  has about 2.5-times the overall performance of the ARM SoC on a mix of tests (called CMS) that are representative of the workloads for particle physics data analysis. This means a current-generation Xeon of roughly the same specs as the part used in CERN’s analysis would have over 3-times the raw performance of the ARM device.

Things tighten up when looking at compute efficiency, i.e. performance per watt. Here the APM system came within 10% of the Sandy Bridge Xeon, however since the Haswell v3 is both faster and more efficient, the gap widens considerably with the ARM device delivering only 65% of the Xeon’s compute efficiency. The chart below summarizes the data.

Intel ARM Perf comp

Intel Still Rules the Cloud

The first 64-bit ARM server processors illustrate the nearly impossible task of competing with Intel in the data center, a lesson AMD learned the hard way. Its tick-tock product development strategy means the steady delivery of improved performance derived either from new, denser fabrication processes or major circuit design changes. Although one can’t judge the merits of ARM’s 64-bit platform based on a single implementation, the first results show no benefit for cloud infrastructure where the Haswell generation Xeons deliver superior absolute and energy-adjusted performance while preserving the time-tested x86 instruction set.

Cloud builders seeking maximum density should stick with hyper scale 2U, quad node x86 systems like those used for VMware’s EVO:RAIL since it’s unlikely ARM systems will match their performance, density and flexibility any time soon.

 

 

5 thoughts on “Intel Still Rules the Cloud: Haswell Puts the Squeeze on ARM’s Server Plans

  1. Pedro

    I think there are two important concerns missing here.
    1) The X86 product most closely matching the ARM part here is the Atom based Avoton C2730 and it is missing from CERN’s analysis.
    2) The Cern people have a serious case of “Confirmation Bias” – Their agenda is obvious from their preamble which I’ll summarise as :
    “We just got paid to build an ARM server box and we want our paychecks to continue”

  2. ARM64

    Pedro- the Avoton based solution severely underperforms vs the X Gene1 ARMv8 64 bit solution from APM, and there are no mega data center cloud players designing it in for this reason. They are sticking with Xeon based solutions until ARM64 platforms are qualified. Also your so called “confirmation bias” from CERN is misguided, as they were not paid a penny to build ARM server platforms, they paid for all of these developments on their own to investigate alternatives to Intel’s monopoly and identifying cost effective alternatives. Take a look at Moor Research’s white paper on the 32% TCO benefits achieved by HP’s new ProLiant m400 Moonshot microsever platform built on APM’s X Gene1 CPU vs Xeon based alternatives.

    Marko- there are many inaccuracies in your opinion piece above. I would suggest you contact APM to get better assessment of their products. A couple of obvious notes:
    -APM’s first product is X Gene1, not HeliX 1, its ARMv8 64 bit based, but its APM’s proprietary ISA compliant version of ARMv8, not A57 based standard cores from ARM. This is a significant distinction since the APM version delivers ~30% better SpecInt_2006 performance vs standard ARM cores. The X Gene 1 has been production since Q3’14.
    -Your SpecInt comparsions to Intel should be done via an open source apples to apples metric…ie GCC SpecInt, not Intel’s proprietary ICC SpecInt numbers that you reference that are optimized for Intel X86 only and biased vs ARM based solutions. Show the GCC numbers for X Gene1 and Xeon E3/E5 to get a real represenation of SpecInt performance comparison.
    -Lastly, you don’t note it in your piece, but its referenced in the CERN report, the X Gene2 device is a 28nm version that provides 15% better SpecInt performance vs X Gene 1 and 30% lower TDP consumption, is sampling now and in production in 1H’15. Start plugging in the numbers vs E5 Haswell and you see an even more competitive alternative to Intel. The process node advantage will be going away for Intel by the time X Gene3 is sampling in 2015 and in production in 2016 where you will be making performance comparisons of 16nm vs 14nm process nodes and looking at a 3-4x the number of cores vs today’s ARMv8 64 bit 8 core offerings. The Intel monopoly is strong, but its not the end of the discussion. Qualcomm would not have announced their intentions to enter this market with their own ARMv8 64 bit solution if they did not think it could be a new billion dollar market for them.

    1. Kurt Marko

      My report is based on public data from APM’s website along with the publish research paper from CERN. Neither of these sources go into the details regarding the nature of ARM cores used in the APM part. As I noted, the X Gene uses an uncompetitive process generation and performance would likely improve by scaling to smaller geometries, however the magnitude of improvement is knowable since there are too many variables involved in porting a design from one generation and voltage specs to another.

      For the record, APM has reached out to brief me on their upcoming products and I will hopefully have more details to publish soon, but for now I remain convinced that Intel’s performance advantage is significant and not threatened by ARM CPUs. How that might change in another year or two as both parties move to 14nm-scale processes and new chip architectures is anyone’s guess, but ARM advocates can be assured that Xeon is a fast-moving target.

    2. Bolak

      To ARM64 – You should have at the least consulted the wikipedia http://en.wikipedia.org/wiki/Intel_Tick-Tock before making multiple erroneous comments.

      Firstly, as soon as the APM X Gene2 goes into production in 1H’2015, Intel will have not one, but 2 next generation Xeons ready to be launched in Q3’2015. These will be the E3-1200 v4 series (Broadwell) and the even more advanced E3-1280 v5 (Skylake). Both will be on 14nm technology. See http://www.digitimes.com/news/a20141027PD205.html for details. So comparing the 28nm X Gene2 with Intel 22nm Haswell is pointless since it will have to compete with the 14nm Broadwell (or better).

      Secondly, if and when the 16nm APM X Gene3 starts production in 2016, Intel will also release their Cannonlake processor on 10nm, not 14nm as you were thinking for some odd reason. In short, Intel Xeon is indeed a very fast moving target.

      About the use of Intel ICC compiler suite for benchmarking of SpecInt, there is nothing wrong with using a better compiler if one is available on a specific platform. After all, the customer would be the one to benefit from it the most.

Comments are closed.