Barcelona Supercomputing Center Accelerates HPC

Intelligent allocation of data to Intel® Optane™ persistent memory stores more data close to CPU with fewer DIMMs.

Executive Summary
Antonio Peña, a senior researcher at Barcelona Supercomputing Center/Centro Nacional de Supercomputación (BSC) is heading development of new software tools and expanding the software ecosystem for 2nd Generation Intel® Xeon® Scalable processors and Intel® Optane™ persistent memory (Intel® Optane™ PMem). His work, under the Intel-BSC Exascale Lab, and in collaboration with the EPEEC project (European joint Effort toward a Highly Productive Programming Environ­ment for Heterogeneous Exascale Computing), is helping accelerate High Perfor­mance Computing (HPC) applications using heterogeneous memory architectures.

Peña is no stranger to High Performance Computing (HPC). At the U.S. Depart­ment of Energy's (DoE) Argonne National Laboratory (ANL) he contributed to projects that included accelerating Message Passing Interface (MPI)-based applica­tions, studying remote virtualized GPUs, and pioneering heterogeneous memory systems in HPC clusters. Under the direction of computer science director profes­sor Jesús Labarta and in close collaboration with Dr. Harald Servat and Marie-Christine Sawley from Intel, he continues to explore how to accelerate large HPC workloads by leveraging hierarchical memory systems. With Intel Optane PMem and 2nd Gen Intel Xeon Scalable processors, he is driving breakthrough architectures that enable high-performance workloads with large data sets on HPC clusters using less power than DRAM.

"Right now, HPC applications are constrained by the amount of DRAM in the nodes and cluster," Peña explained. "They need more and more memory but adding larger and more DIMMs with the current technology is not feasible due to the power constraints on the overall system."

In today's large HPC clusters, main memory is typically sized between 2 and 3 GB per core to optimize performance. A study in 2016 showed that scores from High-Performance Linpack (HPL) runs tend to plateau around 2 GB of main memory in typical HPC systems.1 Indeed, the leading x 86-based clusters on the Top 500 list follow this trend. For example, BSC's MareNostrum 4, with 3,240 nodes (165,888 Intel® Xeon® Platinum 8160 processor cores), has 2 GB/core on all but a relatively few of the nodes. Two hundred sixteen of MareNostrum 4's nodes offer large memory capacity with 8 GB/core to accommodate much larger data sets. The same study predicted that as data sets expand and systems of the future continue to grow to very large clusters (e.g., one million cores), to reach as much as 99 percent of the ideal HPL performance will require per-core memory of 7 to 16 GB. That will have a large impact on server power consumption.

Traditional memory manufacturers recommend budgeting about three watts per 8 GB of DDR3 or DDR4 memory—and more for RDIMMs.2 In a large HPC node with 56 x 86 cores and 112 GB of memory (2 GB/x 86 core), memory budget should be between 42 and 50 watts. Large memory nodes with 8 GB/core could consume 168 watts or more. Looking into the future with massively large systems based on processors with 112 cores per node, for example, memory power itself becomes unwieldly, possibly consuming nearly 700 watts in one node.3 Hosts of today's HPC systems want to accelerate applications, but reduce power demand, not increase it by expanding memory.

Peña's research shows promise for using heterogeneous memories, including large banks of non-volatile memory, to help accelerate applications at lower power.

"We're trying to reduce server power while accelerating applications by using Intel Optane PMem and intelligently managing where the data is located and its movements," Peña said. "We can take advantage of the big memory footprint that the new technology offers and put more data closer to the processor. There is a slightly longer latency than DRAM, but we don't have to pay for the penalty of even more latency going to other storage technologies."

Peña's vision of heterogeneous memory architectures formed while at ANL, eventually working with the Intel® Xeon Phi™ processor. He has continued his work at BSC using Intel Optane PMem.

Simplified view of a heterogeneous memory node sample.4

Intel Optane PMem is a high-density 3D memory technology in a DIMM form factor that delivers a unique combination of large capacity, low power, and support for data persistence on the memory bus. 2nd Generation Intel Xeon Scalable processors support the memory modules.

"Large non-volatile memory as a basis for hierarchical memory solutions is a great candidate," Peña added. "It's byte-addressable, so we can use it for regular load-stores, offers large capacity, and uses less power."

Intel Optane persistent memory can operate in different modes—Memory Mode, App Direct Mode, and Storage over App Direct. In Memory Mode, the CPU manages the memory with DRAM transparently used as a cache for the most fre­quently accessed data. In App Direct Mode it can be used as either volatile or nonvolatile memory where the application intelligently handles allocation. Peña is using the technology in App Direct mode, treating the memory modules as large capacity, manageable system memory with data objects intel­ligently allocated to optimize the performance of the running application. To optimize usage in this mode, Peña's team develops and runs custom software to both tune where data is placed in the memory structure and directly manage the per­sistent memory. The team is collaborating with Intel to help build a new software ecosystem around byte-addressable per­sistent memory to accelerate HPC applications while enabling more power-efficient supercomputers.

Using Intel Optane PMem, power reduction is significant. Compared with the recommended 3 watts per 8 GB (375 mW/GB) for standard DDR3 and DDR4 DIMMs, Intel Optane PMem's 128 GB memory modules consume only 117 mW/ GB, and the 512 GB modules use just 35 mW/GB—a 10X reduction in power compared to DRAM DIMMs.

Innovative Data Profiling and Memory Allocation Tools for Intelligent Data Management
To enable his approach with heterogeneous memories, Peña and his team have created several software tools using Extrae, a general-purpose profiler developed by BSC, VTune™ profiler, and Extended Valgrind for Object Differentiated Profiling (EVOP), among others. EVOP was first developed by Peña at ANL and is now maintained at BSC. Their tools first perform what Peña calls data-oriented profiling by running the profiling tools while the application executes normally. The tools analyze the demand and latencies for different objects and create a large file listing all data accesses.

"Knowing how each data object is accessed during execution helps us decide in the optimization step where those have to be allocated in the different memories," Peña described. "In a simplified view, we associate metrics with the different data objects. Then we count the number of accesses or the number of last level cache misses for each object. From this, we can apply different algorithms for memory allocations to maximize the performance."

Once the software knows how and when data objects are accessed and the size of the data object and size of memory tiers, the results are fed into variations of multiple knapsack problems.5  A knapsack optimization algorithm attempts to fit the most objects of the most value into a given 'container' with a limited capacity. In Peña's case, the memory objects are the items and memories in the system are the knapsacks or containers. The output of the problems provides guidance to allocate the data objects to appropriate memories in the hierarchy.

"After profiling, a script we call the hmem advisor, for heterogeneous memory advisor, parses the large data object profile and generates a distribution list of different objects to the memories," Pena continued. "Then we can run the application binary without changes and, as regular mallocs are called, we have a runtime library, an interposer, that intercepts these calls to allocate the different data objects to the appropriate memory tier."

For their research, Peña's team uses a system at BSC with 2nd Gen Intel Xeon Scalable processors and 6 TB of Intel Optane PMem to do their development and run their testing. The system is part of the Intel-BSC Exascale Lab, where, since 2011, innovative research and collaborative development projects have been carried out to enable next-generation Exascale supercomputing.

Heterogeneous memory research is ongoing at BSC. Currently, Peña's team's code runs statically. It allocates memory objects based on a profiling run and optimization steps. But the team's plans are to make it much more dynamic to accommodate changes at run time and user-specific marking of data.

"Today, the tool focuses on allocating data for optimized performance," Pena explained. "If there are standard DIMMs, ECC DIMMs, and Intel Optane PMem modules in a system, we would allocate the most called data to the standard DIMMs, then less frequently demanded data to ECC, and even less demanded objects to NVRAM. But we envision the tool to not only be dynamic at run time but be able to respond to user marks of data as well. For example, if the user wants to ensure data is protected, we will allocate it to ECC instead of standard DRAM. Or, if it has certain access patterns, such as many writes, we will allocate it to regular memory instead of NVRAM, which has slower write speeds."

Peña's work targets large HPC workloads seen in the world's supercomputing centers, but they are also running it on smaller applications, such as Intel® Distribution for HPCG, Lulesh, miniFE and SNAP. His team is benchmarking their code performance against using Intel Optane PMem in Memory Mode, where the processor itself manages the data.

"We are testing our code on most of the applications from the U.S. DoE, like LAMMPS, OpenFOAM, and NWChem. A key goal of this project is to enable large applications to run with high performance on systems with large NVRAM capacities and smaller amounts of DRAM. We are currently seeing performance improvements in many mini-applications, plus up to 18 percent in OpenFOAM and 10 percent on LAMMPS compared to Memory Mode," Peña concluded.

Peña's approach optimizes performance and power for supercomputers where larger and larger problems on bigger and bigger machines means more and more memory.

Solution Summary
Barcelona Supercomputing Center and the Intel-BSC Exas­cale Lab are at the heart of innovative hierarchical memory research using 2nd Gen Intel Xeon Scalable processors and Intel Optane persistent memory in App Direct Mode. The center hosts a team lead by Antonio Peña to develop software tools that intelligently allocate data to multiple memory tiers, including Intel Optane PMem, that help optimize performance of large supercomputing applications. Their ongoing research for both large and small HPC applications is showing up to a 2x performance speedup in some mini-applications such as MiniFE compared to Memory Mode according to Peña.

Solution Ingredients

  • Experimental platform for developing hierarchical memory software tools
  • Intel Xeon Platinum 8260L processors
  • 6 TB of Intel Optane persistent memory

Explore Related Products and Solutions

Intel® Xeon® Scalable Processors

Drive actionable insight, count on hardware-based security, and deploy dynamic service delivery with Intel® Xeon® Scalable processors.

Learn more

Intel® Optane™ Persistent Memory

Extract more actionable insights from data – from cloud and databases, to in-memory analytics, and content delivery networks.

Learn more


Intel® 技術的功能與優勢取決於系統配置,而且可能需要支援的硬體、軟體或服務啟動才能使用。實際效能會依系統組態而異。沒有電腦系統能提供絕對的安全性。詳情請洽詢購入系統的製造商或零售商,或是上網參閱 效能測試中使用的軟體與工作負載,可能只有針對 Intel® 微處理器進行效能最佳化。包括 SYSmark* 與 MobileMark* 在內的效能測試是使用特定電腦系統、零組件、軟體、作業與功能進行測。這些因素若有任何異動,均可能導致測得結果產生變化。考慮購買時,為了協助您充分評估,您應該參考其他資訊及效能測試,包括該產品結合其它產品使用時的效能表現。如需更完整的資訊,請造訪 效能結果係根據截至組態中所示日期的測試,可能無法反映所有公開提供的安全性更新。請查看組態公開資料以獲得詳細資訊。沒有產品或元件能提供絕對的安全性。// 所述之成本降低情境,用意是要提供範例,指出搭載特定 Intel® 處理器的產品,在特定情況與配置,可能會如何影響未來各項成本以及提供成本節省。實際情況可能有所差異。對於各項成本,或是成本降低幅度,Intel 不提供任何保證。// Intel 並不控制或稽核本文件提及的第三方效能標竿資料或網站。您應造訪該網站並確認本文件提及的資料是否正確。// 部分測試案例結果係採用 Intel 內部分析或架構模擬或模型進行預估或模擬,僅供參考之用。系統硬體、軟體或配置如有任何差異,都可能會影響實際的效能表現。


3Intel Xeon 9282 processor with 56 cores/socket * 2 sockets/node = 112 cores * (16 GB/core ÷ 8 GB * 3 watts/GB)
4Based on Intel Optane persistent memory brief at