Quantcast
Channel: Moor Insights & Strategy » HP
Viewing all articles
Browse latest Browse all 2

The Machine at HP Discover – The First Year

$
0
0

A year ago I wrote about HP’s first public disclosures for “The Machine”. At HP Discover last week in Las Vegas, HP Labs showed mechanical mock-ups for an early functional prototype of The Machine. I think their prototype architecture shows differentiated architectural direction from Intel’s Rack Scale Architecture (RSA) and from others’ earlier disaggregation experiments. It is also a good idea for HP to keep the industry up-to-date on their progress as the impending split between HP Enterprise and HP Inc. approaches.

HP simplified the mock-ups for clarity as HP communicates their initial architecture for The Machine, and so the mock-ups show only the major system components. However, I was told the architecture is locked and will be implemented as depicted.

Rack Level Mechanical Prototype

The prototype will be based on HP’s existing Apollo 6000 System chassis, which HP Labs seems determined to call an “enclosure”, so I’ll use enclosure here. It looks like HP Labs has gutted the Apollo 6000 enclosure and is using the bent metal, fans, and maybe the power supply as a handy scaffolding for their new architectural work.

HP showed three enclosures in a rack, with a couple mechanical board mock-ups housed in one of them as well as an optical connector and fiber cable coming out the back of the enclosures instead of the currently available Ethernet and InfiniBand options.

As per the Apollo 6000 System specs, 8 of the 5U tall enclosures can fit in a rack and each enclosure seats 10 sleds, so a rack can physically seat up to 80 sleds. Operating 80 sleds in one rack requires “power permitting” caveats. For example, at 18kW power delivery per rack and assuming 250W power consumption for each enclosure’s NIC and switch board, each compute and memory sled needs to fit into about 200W of total power budget. At 24kW of rack power, each sled might receive 275W of power budget.

Enclosure Switch Mechanical Prototype

HP has designed their own optical-to-electrical VCSEL (vertical cavity surface emitting laser) based transceivers as the basis for their optical interconnect between enclosures. Each VCSEL operates at 100Gbps, and the transceivers bundle 24 fibers for 2.4Tbps aggregate cable throughput, or 1.2Tbps in each direction.

As a point of reference, Intel’s first announced MXC connector-based cable uses 24 fibers. HP’s disclosure last week quadruples Intel’s two-year-old 300Gbps bandwidth goal for silicon photonics signaling on 24 fibers, which still seems to stand at 25Gbps per fiber. Silicon photonics, which emits light in the horizontal plane of a silicon chip, is newer and much different than VCSEL technology, which emits light vertically out of the top of a chip. In this case VCSELs may let HP get to market sooner with a faster solution, or at least maintain parity by the time both organizations can prototype their architectures and then move them into production.

Within an enclosure, HP has designed its own non-Ethernet copper-based point-to-point connectors capable of supporting 100Gbps or greater speeds (one source indicated “multiple 100s of Gbps”) between the enclosure switch and each of the 10 sleds. The new connectors will use multiple copper lanes; HP is not planning on using single-lane 100Gbps signaling. Also, I’ll note that there are no cables associated with this copper interconnect; it is a direct connection between connectors on the switch module and a connector on each of the sleds.

The decision to go point-to-point made losing the Ethernet protocol stack easy for HP, as much higher signaling efficiency can be achieved with lighter weight protocols. The downside is that network driver stacks are likely to be significantly different with The Machine. This may be an acceptable compromise as HP must also rewrite significant parts of their OS and virtualization kernels to enable new memory models in The Machine. As long as HP is creating a new hardware architecture and redesigning OS kernels, maintaining ties with current network stacks is probably not high on their list and may limit their architectural innovation.

The two enclosure fabric switch chips are fully redundant. I modeled power consumption for the entire subsystem at 250W using the HP Moonshot-180G Switch Module (singular, non-redundant) as an approximate match at 200W typical and 280W maximum. There are obviously a lot of unknowns with the new enclosure fabric switches. The 180G switch has a lot of Ethernet lanes and throughput capability, so it is not an unreasonable choice.

Memory and Compute Sled Mechanical Prototype

The Machine’s first prototype sleds have a dual personality. They are split between a globally addressable memory subsystem and a single SoC-based compute node.

The first implementation of the sled will use an off-the-shelf ARMv8-based SoC for the compute node. I was told this with firm certainty. HP also pointed out that they can and are likely to use other architectures (x86, FPGA, etc.) in subsequent prototypes. I’m calling the compute SoC power budget at 35-45W a reasonable estimate for a reasonable core count and reasonable speed. With 10 compute nodes per chassis and 80 compute nodes per rack, the prototype will have more than enough compute power available for HP Labs to start analyzing the impact of large global memory pools on system performance.

The compute node has its own local memory system. HP indicated they would use high-density DRAM as a target: 128GB per slot and therefore 1TB per node. I budgeted 6-7W per slot for high-density DDR4 memory. Compute node memory will probably remain DRAM for at least a few years, as local compute performance is still bound to performant memory write speeds.

The globally addressable memory subsystem consists of 4 banks of 6 DIMMs each. Each bank of memory is controlled by global system memory fabric interface. For the initial prototype, HP indicates the globally addressable memory will be implemented as DRAM, also at 128GB density. That works out to 3TB of global memory per sled and therefore 30TB per enclosure and 240TB per rack.

HP intends to convert the global addressable memory pool to non-volatile memory as soon as new non-volatile memory technologies enter the market with near DRAM read speeds and not horrible write speeds. I can speculate that new NVDIMM technologies might meet HP’s requirements by the time HP demonstrates a working prototype of this system.

In the meanwhile HP intends to simulate non-volatile global system memory using DRAM…by simply not turning the system off. I have to assume HP will write some clever software to stall the system when a sled fails in order to enable the sled to be replaced, and then they’ll un-stall the system and continue the test. Or perhaps they will punt to restarting their analysis in the event of such failures. Whatever, this is really not a deal-breaker for an architectural prototype.

Both the globally addressable memory subsystem and the compute node are connected to a sled fabric switch. The sled fabric switch routes data through the enclosure fabric switch, and it can also route data directly between the two subsystems on the sled.

I allocated a few watts each for the sled fabric switch and the global system memory fabric interfaces in my power model.

Observations

It is a good idea to leverage as much available infrastructure as possible in early prototypes. It helps direct R&D costs to the most differentiated parts of a new architecture and speeds time to prototype when a lab does not have to debug every aspect of a new infrastructure.

I calculate that this prototype will require at least a 24kW rack-level power supply to support the intended memory types and densities for this initial prototype. 24kW is not out of line with high-end state-of-the-art scale out datacenter analytics architecture today. Rack level power consumption is likely to fall quite a bit as new non-volatile memories become available for the global system memory pool, but that is beyond the scope of this prototype. This is a proof-of-concept, and HP Labs looks like they are leveraging HP datacenter infrastructure where it is appropriate.

Memory cost is also not a factor for this prototype. 128GB DDR4 DIMMs will be expensive; they will be the top end of DIMM production when they ship later this year from SK Hynix and perhaps Samsung. But these densities are required to start modeling system behavior and performance to prepare for much denser, lower cost, new non-volatile memory technologies later in the decade.

Although HP has placed a big bet on memristor technology, I’ll note that I made it through writing this post without mentioning memristors until now. HP Labs showed an obligatory memristor wafer (close-ups of individual dies on the wafer are in the preview photo for this post), but they have designed this prototype of The Machine to use any memory that will fit in a standard DDR4 DIMM format. As with future choices for compute node SoC, HP intends to be non-denominational with respect to rapidly developing The Machine architecture using available memory technologies.

I think that HP needs to demonstrate forward progress on The Machine as they split into two companies. The announced date for the split to be effective is November 1. HP is scheduled to host their next Discover event in London one month later. I hope to see a functional hardware prototype of The Machine near the end of this year, and London would be a terrific venue to do so. HP Enterprise should keep that date and venue and continue their Discover event series.

 [Thanks to Gina Longoria for sanity checking my power consumption model!]

 


Viewing all articles
Browse latest Browse all 2

Latest Images

Trending Articles





Latest Images