Say what you like about Prime Minister Rishi Sunak’s government, down the road the UK will be grateful for one big decision: Isambard-AI.
Sunak’s interest in technology (Guardian sketchwriter frequently describes him, not kindly, as RishGPT) has seen him back a £225 million project that will deliver a genuinely meaningful supercomputer to Bristol; taking the UK from supercomputing laggard to if not leader, then contender.
Isambard-AI (built on the HPE Cray EX) will pack 5,448 NVIDIA GH200 Grace Hopper Superchips to deliver 21 exaflops of AI performance; which would make it one of the world’s Top 10 supercomputers today.
“Isambard-AI represents a huge leap forward for AI computational power in the UK,” said Professor Simon McIntosh-Smith, director of Bristol's Isambard National Research Facility. “when in operation later in 2024, it will be one of the most powerful AI systems for open science anywhere.”
(Grace Hopper Superchips combine NVIDIA’s Arm-based Grace CPU with a Hopper-based GPU optimized for power efficiency. Isambard-AI will feature the HPE Slingshot 11 interconnect, and nearly 25 petabytes of storage using the Cray Clusterstor E1000 optimised for AI workflows, said HPE.)
In response to queries about the choice of Grace Hopper hardware and whether it was optimal for AI workloads, he added on X: “There are some very interesting large memory and multi-GPU features enabled by the EX Grace-Hopper solution which will be incredibly powerful for large language model training, but I’m not sure if those are public yet…”
He spoke as OpenAI Engineering Manager Evan Morikowa this month lifted the veil on some engineering challenges the company had had around LLM architectures, saying “bottlenecks can arise from everywhere: memory bandwidth, network bandwidth between GPUs, between nodes, and other areas. Furthermore, the location of those bottlenecks will change dramatically based on the model size, architecture, and usage patterns.”
He added: “GPU RAM is actually one of our most valuable commodities.
“It's frequently the bottleneck, not necessarily compute.
“Cache misses have this weird massive nonlinear effect into how much work the GPUs are doing because we suddenly need to start recomputing all this stuff. And this means when scaling Chat GPT, there wasn't some simple CPU utilization metric to look at. We had to look at this KV cache utilization and maximize all the GPU RAM that we had…”
See also: A bootleg API, AI’s RAM needs, cat prompts and GPU shortages: Lessons from scaling ChatGPT
A second less powerful but still impressive system coming next year to the National Composites Centre (NCC) in the Bristol and Bath Science Park (home also to Isambard-AI), will showcase chipmaker Arm’s energy efficiency for non-accelerated high performance computing workloads.
Isambard-3 will deliver an estimated 2.7 petaflops of FP64 peak performance and consume less than 270 kilowatts of power, ranking it among the world’s three greenest non-accelerated supercomputers.
The system — part of a research alliance among universities of Bath, Bristol, Cardiff and Exeter — will sport 384 Arm-based NVIDIA Grace CPU Superchips to power medical and scientific research. Together, they will give researchers access to resources with more than 30-times the capacity of the UK’s current largest public AI computing tools, HMG said this week.
“They will be able to use the machines, which will be running from summer 2024, to analyse advanced AI models to test safety features and drive breakthroughs in drug discovery and clean energy” it said.
The investment will also connect Isambard-AI to a newly announced Cambridge supercomputer called ‘Dawn’. This computer – delivered through a partnership with Dell and UK SME StackHPC – will be powered by over 1000 Intel chips that use water-cooling to reduce power consumption. It is set to be running in the next 2 months and target breakthroughs in fusion energy, healthcare and climate modelling.
Chaired by Ian Hogarth, the Frontier AI Taskforce (explored in detail in The Stack here) will have priority access to the connected computing tools to support its work to mitigate the risks posed by the most advanced forms of AI, including national security from the development of bioweapons and cyberattacks. The resource will also support the work of the AI Safety Institute, as it develops a programme of research looking at the safety of frontier AI models and supports government policy with this analysis.
HPE said Isambard-AI will feature "sophisticated direct liquid-cooling capabilities... to improve energy efficiency and overall carbon footprint impact. The system will be hosted in a self-cooled, self-contained data center and HPE is also collaborating with the University of Bristol, it said, "on a highly-energy efficient heat re-use model, extracting waste heat from the Isambard-AI system to use as renewable energy to heat local buildings."