MLPerf Results Validate CPUs for Deep Learning Training

I have worked on optimizing and benchmarking computer performance for more than two decades, on platforms ranging from supercomputers and database servers to mobile devices. It is always fun to highlight performance results for the product you are building and compare them with others in the industry. SPEC*, LINPACK*, and TPC* have become familiar names to many of us. Now, MLPerf* is filling in the void of benchmarking for Machine Learning.

I am excited to see the Intel® Xeon® Scalable processor MLPerf results submitted by our team because we work on both the user side and the computer system development side of deep learning. These results show that Intel® Xeon® Scalable processors have surpassed a performance threshold where they can be an effective option for data scientists looking to run multiple workloads on their infrastructure without investing in dedicated hardware.1 2 3

Back in 2015, I had a team working on mobile devices. We had to hire testers to manually play mobile games. It was fun initially for the testers, then it became boring and costly. One tester we hired quit on the same day. Our team created a robot to test mobile games and adopted deep learning. Our game testing robot played games automatically and found more bugs than human testers. We wanted to train neural networks on the machines we already had in the lab, but they were not fast enough. I had to allocate budget for the team to buy a GPU, an older version than the MLPerf reference GPU.4

Today CPUs are capable of deep learning training as well as inference. Our MLPerf Intel® Xeon® Scalable processor results compare well with the MLPerf reference GPU4 on a variety of MLPerf deep learning training workloads.1 2 3 For example, the single-system two-socket Intel® Xeon® Scalable processor results submitted by Intel achieved a score of 0.85 on the MLPerf Image Classification benchmark (Resnet-50)1; 1.6 on the Recommendation benchmark (Neural Collaborative Filtering NCF)2; and 6.3 on Reinforcement Learning benchmark (mini GO).3 In all these scores, 1.0 is defined as the score of the reference implementation on the reference GPU.4 For all the preceding results, we use FP32, the common numerical precision used in today's market. From these MLPerf results, we can see that our game testing robot could easily train on Intel® Xeon® Scalable processors today.

The deep learning and machine learning world continues to evolve from image processing using Convolutional Neural Networks (CNN) and natural language processing using Recurrent Neural Networks (RNN) to recommendation systems using MLP layers and general matrix multiply, reinforcement learning (mixing CNN and simulation) and hybrid models mixing deep learning and classical machine learning. A general purpose CPU is very adaptable to this dynamically changing environment, in addition to running existing non-DL workloads.

Enterprises have adopted CPUs for deep learning training. For example, today, Datatonic* published a blog showing up to 11x cost and 57 percent performance improvement when running a neural network recommender system used in production by a top-5 UK retailer on a Google Cloud* VM powered by Intel® Xeon® Scalable processors.5 CPUs can also accommodate the large memory models required in many domains. The pharmaceutical company Novartis used Intel® Xeon® Scalable processors to accelerate training for a multiscale convolutional neural network (M-CNN) for 10,000 high-content cellular microscopic images, which are much larger in size than the typical ImageNet* images, reducing time to train from 11 hours to 31 minutes.6

High performance computing (HPC) customers use Intel® Xeon® processors for distributed training, as showcased at Supercomputing 2018. For instance, GENCI/CINES/INRIA trained a plant classification model for 300K species on a 1.5TByte dataset of 12 million images using 128 2S Intel® Xeon® processor-based systems.7 DELL EMC* and SURFSara used Intel® Xeon® processors to reduce training time to 11 minutes for a DenseNet-121 model.8 CERN* showcased distributed training using 128 nodes of the TACC Stampede 2 cluster (Intel® Xeon® Platinum 8160 processor, Intel® OPA) with a 3D Generative Adversarial Network (3D GAN) achieving 94% scaling efficiency.9 Additional examples can be found at https://software.intel.com/en-us/articles/intel-processors-for-deep-learning-training.

CPU hardware and software performance for deep learning has increased by a few orders of magnitude in the past few years. Training that used to take days or even weeks can now be done in hours or even minutes. This level of performance improvement was achieved through a combination of hardware and software. For example, current-generation Intel® Xeon® Scalable processors added both the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set (longer vector extensions) to allow a large number of operations to be done in parallel, and with a larger number of cores, essentially becoming a mini-supercomputer. The next-generation Intel® Xeon® Scalable processor adds Intel® Deep Learning Boost (Intel® DL Boost): higher throughput, lower numerical precision instructions to boost deep learning inference. On the software side, the performance difference between the baseline open source deep learning software, and the Intel-optimized software can be up to 275X10 on the same Intel® Xeon® Scalable processor (as illustrated in a demo I showed at the Intel Architecture Day forum yesterday).

Over the past few years, Intel has worked with DL framework developers to optimize many popular open source frameworks such as TensorFlow*, Caffe*, MXNet*, PyTorch*/Caffe2*, PaddlePaddle* and Chainer*, for Intel® processors. Intel has also designed a framework, BigDL for SPARK*, and the Intel® Deep Learning Deployment Toolkit (DLDT) for inference. Since the core computation is linear algebra, we have created a new math library, Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), specifically for deep learning, based on many years of experience with the Intel® Math Kernel Library (MKL) for high performance computing (HPC). The integration of Intel MKL-DNN into the frameworks, and the additional optimizations contributed to the frameworks to fully utilize the underlying hardware capabilities, are the key reason for the huge software performance improvement.

I've often been asked whether CPUs are faster or slower than accelerators. Of course, accelerators have certain advantages. For a specific domain, if an accelerator is not generally faster than a CPU, then it is not much of an accelerator. Even so, given the increasing variety of deep learning workloads, in some cases, a CPU may be as fast or faster while retaining that flexibility that is core to the CPU value proposition. Thus, the more pertinent question is whether CPUs can run deep learning well enough to be an effective option for customers that don't wish to invest in accelerators. These initial MLPerf results1 2 3, as well as our customer examples, show that CPUs can indeed be effectively used for training. Intel's strategy is to offer both general purpose CPUs and accelerators to meet the machine learning needs of a wide range of customers.

Looking forward, we are continuing to add new AI and deep learning features to our future generations of CPUs, like Intel® Deep Learning Boost (Intel® DL Boost), plus bfloat16 for training, as well as additional software optimizations. Please stay tuned. For more information on Intel® software optimizations, see ai.intel.com/framework-optimizations. For more information on Intel® Xeon® Scalable processors, see intel.com.tw/xeonscalable.

產品與效能資訊

1

MLPerf 影像分類效能標竿 (Resnet-50) 得分為 0.85,即採用 2 晶片數 Intel® Xeon® Platinum 8180 處理器的 MLPerf 基準(+) 之 0.85 倍。 MLPerf v0.5 training 封閉部門;系統使用 Intel® Optimization for Caffe* 1.1.2a 搭載 Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) v0.16 程式庫。擷取自 2018 年 12 月 12 日 www.mlperf.org,條目 0.5.6.1。MLPerf 名稱和圖誌均為商標。如需詳細資訊,請參閱 www.mlperf.org

2

推薦效能標竿 (神經協同過濾 NCF) 得分為 1.6,即採用 2 晶片數 Intel® Xeon® Platinum 8180 處理器的 MLPerf 基準(+) 之 1.6 倍。 MLPerf v0.5 training 封閉部門;系統使用 Framework BigDL 0.7.0。擷取自 2018 年 12 月 12 日 www.mlperf.org,條目 0.5.9.6。MLPerf 名稱和圖誌均為商標。如需詳細資訊,請參閱 www.mlperf.org

3

強化學習效能標竿 (mini GO) 得分為 6.3,即採用 2 晶片數 Intel® Xeon® Platinum 8180 處理器的 MLPerf 基準(+) 之 6.3 倍。 MLPerf v0.5 training 封閉部門;系統使用 TensorFlow 1.10.1 搭載 Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) v0.14 程式庫。擷取自 2018 年 12 月 12 日 www.mlperf.org,條目 0.5.10.7。MLPerf 名稱和圖誌均為商標。如需詳細資訊,請參閱 www.mlperf.org

(+) MLPerf 基準(採用 MLPerf v0.5 社群新聞稿簡報):MLPerf Training v0.5 是一個測量 ML 系統速度的效能標竿套件。每個 MLPerf Training 效能標竿均由一個資料集和一個品質目標加以定義。MLPerf Training 也針對每個使用特定模型的效能標竿提供參考實作。下表摘錄套件版本 v0.5 中的七個效能標竿。

效能標竿

資料集

品質目標

參考實作模型

影像分類

ImageNet

74.90% 的分類

Resnet-50 v1.5

物體偵測(輕量級)

COCO 2017

21.2% mAP

SSD (Resnet-34 骨幹)

物體偵測(重量級)

COCO 2017

0.377 Box min AP, 0.339 Mask min AP

Mask R-CNN

翻譯(重複)

WMT 英文 - 德文

21.8 BLEU

神經機器翻譯

翻譯(不重複)

WMT 英文 - 德文

25.0 BLEU

轉換器

推薦的產品

MovieLens-20M

0.635 HR@10

神經協同過濾

強化學習

專業遊戲

40.00% 的移動預測

Mini Go


MLPerf 培訓規則:https://github.com/mlperf/tw/tw-tw-productions/blobb/producthe-tions/stem-adoc

4

MLPerf* 參考系統: Google Cloud Platform 組態:16 個 vCPU,Intel Skylake 或更新版本, 60 GB RAM (n1­standard­16),1 NVIDIA* Tesla* P100 GPU,CUDA* 9.1(TensorFlow* 則為 9.0),nvidia­docker2,Ubuntu* 16.04 LTS,Preemtibility:關閉,自動重啟:關閉,30GB 開機磁碟 + 1 SSD 永久磁碟 500 GB,docker* 影像:9.1­cudnn7­runtime­ubuntu16.04(TensorFlow* 則為 9.0­cudnn7­devel­ubuntu16.04)。

6

Novartis:於 2018 年 5 月 25 日測量。以 8 節點相對於單一節點的加速為依據。節點組態:CPU:Intel® Xeon® Gold 6148 處理器 @ 2.4GHz,192GB 記憶體,超執行緒:啟用。NIC:Intel® Omni-Path Host Fabric Interface (Intel® OP HFI),TensorFlow:v1.7.0,Horovod:0.12.1,OpenMPI:3.0.0。OS:CentOS* 7.3,OpenMPU 23.0.0,Python 2.7.5。在模型中聚合至 99% 準確度所花訓練時間。來源:https://newsroom.intel.com/news/using-deep-neural-network-acceleration-image-analysis-drug-discovery

7

GENCI:Occigen:3306 個節點 x 2 Intel® Xeon® 處理器(12-14 核心)。運算節點:2 插槽 Intel® Xeon® 處理器搭載 12 核心,每個 @ 2.70GHz,每節點共有 24 核心,每核心為 2 執行緒,96 GB DDR4,Mellanox InfiniBand Fabric Interface,雙電壓源。軟體:Intel® MPI 程式庫 2017 Update 4 Intel® MPI 程式庫 2019 Technical Preview OFI 1.5.0PSM2 含 Multi-EP,10 Gbit 乙太網路,200 GB 本機 SSD,Red Hat* Enterprise Linux 6.7。 Caffe*:Intel® Optimization for Caffe*:https://github.com/intel/caffe Intel® Machine Learning Scaling Library (Intel® MLSL):https://github.com/intel/MLSL 資料集:Pl@ntNet: CINES/GENCI。內部資料集效能結果依據截至 2018 年 10 月 15 日所做之測試。

8

Intel、Dell 和 Surfsara 協作:於 2018 年 5 月 17 日在 2 插槽 Intel® Xeon® Gold 6148 處理器的 256 個節點上測量。運算節點:2 插槽 Intel® Xeon® Gold 6148F 處理器搭載 20 核心,每個 @ 2.40GHz,每節點共有 40 核心,每核心為 2 執行緒,L1d 32K;L1i 快取記憶體 32K;L2 快取記憶體 1024K;L3 快取記憶體 33792K,96 GB DDR4,Intel® Omni-Path Host Fabric Interface (Intel® OP HFI),雙電壓源。軟體:Intel® MPI 程式庫 2017 Update 4 Intel® MPI 程式庫 2019 Technical Preview OFI 1.5.0PSM2 含 Multi-EP,10 Gbit 乙太網路,200 GB 本機 SSD,Red Hat* Enterprise Linux 6.7。TensorFlow* 1.6:建立及安裝來源: https://www.tensorflow.org/install/install_sources ResNet-50 模型:拓撲規格來自於 https://github.com/tensorflow/tpu/tree/master/models/official/resnet。DenseNet-121 模型:拓撲規格來自於 https://github.com/liuzhuang13/DenseNet。聚合和效能模型:https://surfdrive.surf.nl/files/index.php/s/xrEFLPvo7IDRARs。資料集:ImageNet2012-1K:http://www.image-net.org/challenges/LSVRC/2012 /。ChexNet*:https://stanfordmlgroup.github.io/projects/chexnet/。效能測量使用:OMP_NUM_THREADS=24 HOROVOD_FUSION_THRESHOLD=134217728 export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2 \ mpirun -np 512 -ppn 2 python resnet_main.py –train_batch_size 8192 –train_steps 14075 –num_intra_threads 24 –num_inter_threads 2 — mkl=True –data_dir=/scratch/04611/valeriuc/tf-1.6/tpu_rec/train –model_dir model_batch_8k_90ep –use_tpu=False –kmp_blocktime 1。https://ai.intel.com/diagnosing-lung-disease-using-deep-learning/

9

CERN:2018 年 5 月 17 日在 Stampede2/TACC 上測量:https://portal.tacc.utexas.edu/user-guides/stampede2。運算節點:2 插槽 Intel® Xeon® Platinum 8160 處理器搭載 24 核心,每個 @ 2.10GHz,每節點共有 48 核心,每核心為 2 執行緒,L1d 32K;L1i 快取記憶體 32K;L2 快取記憶體 1024K;L3 快取記憶體 33792K,96 GB DDR4,Intel® Omni-Path Host Fabric Interface (Intel® OP HFI),雙電壓源。軟體:Intel® MPI 程式庫 2017 Update 4 Intel® MPI 程式庫 2019 Technical Preview OFI 1.5.0PSM2 含 Multi-EP,10 Gbit 乙太網路,200 GB 本機 SSD,Red Hat* Enterprise Linux 6.7。TensorFlow* 1.6:建立及安裝來源:https://www.tensorflow.org/install/install_sources 模型:CERN* 3D GANS 來自 https://github.com/sara-nl/3Dgan/tree/tf 資料集:CERN* 3D GANS 從 https://github.com/sara-nl/3Dgan/tree/tf 針對 256 個節點進行測量,針對 256 個節點的測量使用:OMP_NUM_THREADS=24 HOROVOD_FUSION_THRESHOLD=134217728 export I_MPI_FABRICS=tmi, export I_MPI_TMI_PROVIDER=psm2 \ mpirun -np 512 -ppn 2 python resnet_main.py –train_batch_size 8 \ –num_intra_threads 24 –num_inter_threads 2 –mkl=True \ –data_dir=/path/to/gans_script.py –kmp_blocktime 1。https://www.rdmag.com/article/2018/11/imagining-unthinkable-simulations-without-classical-monte-carlo

10

相較於搭載 BVLC-Caffe*,搭載Intel® Optimization for Caffe*時,推斷輸送量的效能提升275倍:Intel 於 2018 年 12 月 11 日 測量。2S Intel® Xeon® Platinum 8180 處理器 CPU @ 2.50GHz (28 核心),開啟超執行緒,開啟渦輪,192GB 總記憶體(12 插槽* 16 GB,Micron 2666MHz),Intel® SSD SSDSC2KF5,Ubuntu 16.04 Kernel 4.15.0-42.generic,BIOS:SE5C620.86B.00.01.0009.101920170742(微碼:0x0200004d),拓撲:Resnet-50 基準:FP32,BVLC Caffe* (https://github.com/BVLC/caffe.git) 認可 99bd99795dcdf0b1d3086a8d67ab1782a8a08383 目前效能:INT8,Intel® Optimizations for Caffe* (https://github.com/Intel/caffe.git) 認可:Caffe* 認可:e94b3ff41012668ac77afea7eda89f07fa360adf,MKLDNN 認可:4e333787e0d66a1dca1218e99a891d493dbc8ef1。

效能測試中使用的軟體與工作負載,可能只有針對 Intel 微處理器的效能最佳化。包括 SYSmark* 與 MobileMark* 在內的效能測試是使用特定電腦系統、零組件、軟體、作業與功能進行量測。這些因素若有任何異動,均可能導致測得結果產生變化。建議您參考其他資訊與效能測試數據,協助您充分評估欲購買產品的性能,包括該產品在搭配其他產品運作時的效能。如需詳細資訊,請參閱 www.intel.com.tw/benchmarks

最佳化公告:對於不是 Intel 微處理器特有的最佳化,用於非 Intel 微處理器時,Intel 的編譯器可能會,也可能不會最佳化到同樣程度。這些最佳化包括 SSE2、SSE3 及 SSSE3 指令集,也包括其他最佳化。對於任何最佳化,用於並非由 Intel 製造的微處理器,Intel 不保證最佳化的可用性、功能性或效力。在本產品中取決於微處理器的最佳化,乃是為了用於 Intel 微處理器而設計。某些最佳化並非專門針對 Intel 微架構,而是保留給 Intel 微處理器。關於本公告所涉及的具體指令集,如需詳細資訊,請參閱適用產品的使用指南與參考指南。

效能結果可能無法反映所有公開提供的安全性更新。請查看組態公開資料以獲得詳細資訊。沒有產品能提供絕對的安全性。

Intel、Intel 圖誌、Xeon 可擴充處理器、Deep Learning Boost 是 Intel 公司或其子公司在美國及/或其它國家的商標。
*其他名稱與品牌可能業經宣告為他人之財產。
© Intel Corporation.