Model for success
The university’s first step towards making ConFlux a reality was choosing a vendor for its HPC infrastructure.
Duraisamy says: “IBM was an ideal choice for both servers and storage. In our benchmarks, the IBM Power System S822LC performed better than the competition. Its low-latency architecture, IBM POWER8 processors, and integrated NVIDIA Tesla GPUs offered better performance than equivalent x86-based systems. To store and manage our big data, we needed a high-performance storage solution that we could seamlessly scale out as data volumes grew, and IBM Elastic Storage Server (ESS)—based on IBM Spectrum Scale—was the perfect fit.”
In addition to offering eight hardware threads per core and more on-chip cache than x86 processors, IBM® POWER8® features IBM’s Coherent Accelerator Processor Interface (CAPI) technology. CAPI provides a high-performance solution for customizable, computation-intensive algorithms. Where a traditional accelerator can access shared memory only via the main processor’s I/O subsystem, CAPI connects acceleration engines directly to the coherent fabric of the POWER8 chip, removing complexity, cutting latency, and resulting in significant overall performance gains for many applications.
For GPU accelerated workloads, IBM Power System S822LC for HPC servers are unlocking new performance from massively parallel and tightly connected Tesla P100 with NVLink GPUs. The advanced Phase 2 POWER8 with NVLink nodes in ConFlux feature a broader pipe delivering 2.5 times the bandwidth CPU:GPU of alternatives, for superior application performance.
By incorporating Mellanox InfiniBand fabric, the IBM HPC platform delivers high bandwidth and low latency when jobs span the cluster. This means that the IBM POWER8 processors spend more time doing useful work and less time waiting for data. With physics simulations that can last days, or even weeks, the cumulative impact of eliminating processor idling represents a huge performance advantage.
IBM ESS combines the parallel file system of the IBM Spectrum Scale™ (formerly GPFS) solution with IBM POWER8 servers and dual-ported storage enclosures. IBM Spectrum Scale facilitates system throughput growth while still provisioning a single namespace, eliminating data silos, simplifying storage management, and delivering peak performance. By consolidating storage requirements across an organization, IBM ESS helps reduce inefficiency, lower acquisition costs, and support demanding workloads.
Duraisamy adds: “With the Power Systems approach, IBM is bringing together leading technologies from the OpenPOWER foundation—NVIDIA GPUs, Mellanox InfiniBand networking, and POWER8 processors—in a single HPC platform.”
Todd Raeker, Research Technology Consultant for the University of Michigan, comments: “The workload on ConFlux will be hugely data-intensive. Any given simulation will generate 10 to 20 terabytes of data, and there could be dozens of simulations running at one time. With this in mind, we expect to see surges in storage requirements as the datasets grow—which is why we chose IBM Spectrum Scale as our software-defined storage solution. With Spectrum Scale, we can seamlessly increase performance and capacity by simply adding nodes.”
Raeker expands on the advantages of choosing one vendor for both computing and storage: “One of our objectives was to build a relationship with a single vendor, so rather than having separate vendors for software and hardware, we wanted IBM to be responsible for the whole stack. We knew that with everyone on the same page, we would be able to resolve potential issues much more quickly, avoiding downtime.”
The university will be running a variety of big-data-driven workflows, including some on Apache Hadoop. With IBM Spectrum Scale, U-M can effectively layer a Hadoop Distributed File System (HDFS) type environment over a Spectrum Scale environment, greatly increasing flexibility.
Raeker says: “We can work with data in HDFS and Spectrum Scale simultaneously, which significantly simplifies scientists’ workflow; they can focus on their experiments instead of worrying about whether the infrastructure is in the correct state to support them.”
With an environment optimized for rapidly moving large volumes of data into and out of compute nodes, U-M can ensure that results from data queries are delivered much faster than before. Crucially, machine learning algorithms will then pull fresh insights from these results and use them to adapt predictive models in real time, continuously improving the accuracy of simulations even as they are in progress.
For the first phase of the deployment for ConFlux, U-M installed IBM Power System S822LC with POWER8 processors alongside an IBM Elastic Storage™ Server, and in the second phase it deployed additional compute nodes with newer POWER8 with NVLink processors and NVIDIA Tesla P100 with NVLink GPUs.
Duraisamy comments: “IBM explained to us that we could get an S822LC with POWER8 straightaway, and then move up to the next generation of processors when they become available. It was great knowing that we could invest immediately and also take advantage of future performance enhancements.”
With ConFlux in place, the university is set to advance the threshold of scientific knowledge in a wide range of fields. Already, the HPC at U-M is being used for several projects, including studies in cardiovascular disease, turbulence, and dark matter.
For example, a strong sign of certain cardiovascular diseases like hypertension is arterial stiffness. By combining noninvasive imaging techniques, such as MRI and CT scans, with a model of blood flow produced by ConFlux, scientists at the university hope to be able to accurately estimate arterial stiffness in less than an hour, helping doctors to deliver the appropriate treatment to patients more quickly than ever.
Even though this research is in its early stages, scientists in this field are already realizing the benefits of the ConFlux HPC environment, with one academic noting in a recent journal that this innovative approach could save significant amounts of time in the initial assessment of cardiovascular diseases, and improve outcomes by helping establish the best way to approach surgery in each case.
When it comes to designing efficient aircraft and rocket engines, understanding turbulence is key, yet the patterns that air produces when it breaks up are exceptionally difficult to predict. U-M is working with leading engineers to improve the accuracy of turbulence simulations using ConFlux, enabling better testing and accelerating the development of superior aircraft designs.
The university also intends to use ConFlux to better understand our universe. By feeding data from large galaxy-mapping studies like The Dark Energy Survey into simulations of galaxy formation, U-M hopes to gain a better understanding of the role that dark matter plays in the continuing expansion of the universe—potentially driving a revolution in the understanding of core scientific concepts such as gravity.
Raeker adds: “With IBM hardware boosting our HPC environment, we can offer scientists the tools to conduct research that could revolutionize entire industries. Because Conflux is at the bleeding edge of supercomputing technology, we are continually learning how to optimize our architecture. IBM hardware provides us with the flexibility to adapt to a variety of experimental scenarios so we can always create the right conditions for innovation.”
Looking further to the future, U-M has plans to share the new platform with research groups at other universities, and is looking to integrate material learned from ConFlux into its degree programs.
Duraisamy concludes: “ConFlux allows us to bring together large-scale scientific computing and machine learning to accomplish research that was previously impractical, given time and resource constraints. This capability will close a gap in the US research computing infrastructure and place U-M at the forefront of the emerging field of big data-driven physics.”