Baidu BigSQL: Faster Spark Interactive Queries

To lower TCO and maintain performance, Baidu deployed Intel® Optane™ persistent memory to optimize its ad hoc query service.

Over the past few years, the world’s data volume has grown almost exponentially, which means companies, especially tech companies, are facing greater challenges in meeting service time requirements. Apache Spark, a unified analytics engine for large-scale and high- performance data processing, is designed to meet this challenge. One module of Apache Spark—Spark SQL is widely used for working with structured data in large data centers. Baidu’s BigSQL data processing platform is based on Spark SQL and has many features and performance enhancements that improve on it.

“In order for Baidu Big SQL to provide users with high-performance ad hoc query services, large memory is needed to cache hot data locally on compute nodes to avoid DFS I/O slowing performance down. With Intel Optane persistent memory, we managed to ensure outstanding cache performance, while at the same time greatly improving cluster processing and achieving significant TCO benefits.”—LI Shiyong, Senior System Engineer, Baidu

One important enhancement pertains to meeting sub-second performance requirements for interactive queries. This is where Intel and Baidu collaborated to create the Optimized Analytics Package (OAP) for Spark Platform project. OAP is designed to leverage the columnar data format and user-defined indexes built over selected columns, leading to improved data scanning efficiency. It also adopts a fine-grained in-memory data caching strategy to remove I/O bottlenecks in disks and networks, maximizing performance to sub-seconds.

As Baidu’s business expands, the scale of hot data grows rapidly. Memory scaling is needed to deliver the same level of performance that users demand. However, the high cost of Dynamic Random-Access Memory (DRAM) adds increasing pressure to the Total Cost of Ownership (TCO). To lower TCO while ensuring satisfactory performance, Baidu and Intel collaborated and introduced Intel® Optane™ persistent memory (PMem) as a more cost-efficient solution to replace DRAM.

Baidu’s internal testing has demonstrated that Intel Optane PMem improves OAP cache performance and performance-per-dollar output when compared to solutions without PMem, leading to direct business impacts such as the optimization of its ad hoc query service, Tuling, by offloading its workload and reducing average query latency.

Baidu BigSQL with OAP
One fundamental characteristic of Spark SQL is that it is designed to deliver optimized performance for batch processing. However, some of Baidu’s service queries have totally different characteristics. They are called interactive queries. Usually, they query over a large dataset with specific filtering conditions, serving the dedicated purpose of identifying a relatively small amount of data. Users expect this small amount of queried data to be returned in seconds or even sub-seconds, instead of the usual minutes or hours seen in batch processing, which is usually not possible for the current Spark SQL implementation.

To solve this problem, Baidu and Intel collaborated and implemented OAP, which uses index and caching techniques to accelerate interactive query response. By integrating OAP, Baidu BigSQL successfully achieved the desired level of interactive query performance.

Figure 1. Baidu BigSQL and OAP Integration.

When a query has specific filtering conditions, indexes can be created over the columns with such conditions. By creating and storing a full B+ Tree index side-by-side with the columnar data file, OAP can identify target rows by quickly searching through the B+ Tree index, and skip unnecessary data scans over backend storage such as HDFS. Furthermore, the index file is separated from the original data file. This makes it possible to create or drop indexes without the need to rewrite the original data files.

Figure 2. OAP Cache & Index Concept.

To further reduce query response time from seconds to sub-seconds, OAP optimizes index and data access with cache. By caching the index and data in memory, index loading and data scanning get orders of magnitude faster, avoiding disk and network I/O overhead when reading from distributed file systems. What’s more, index and data can be configured with separate caches, enabling independent eviction and memory space management for both.

Additionally, now that the cache is at the column level, it is possible to cache the columns required for the query exclusively. And based on the Least-Recently-Used (LRU) policy, those least-recently-used data items will be evicted from the cache if maximum capacity is reached, allowing more recent data items to be cached. Guided by this policy, an advanced cache manager is implemented in Baidu BigSQL to proactively populate hot columns, and retire columns no longer required in cache.

Baidu BigSQL Optimization with Intel Optane Persistent Memory
When the data scale is small, Baidu BigSQL can deliver optimal performance by caching index or data in DRAM. However, as Baidu’s business continues to grow, datasets are rapidly evolving in size. When cache space becomes too small to accommodate large amount of hot data, performance will suffer.

The simple solution is to add more DRAM, but there are several disadvantages. First, the price-per-GB is high, putting great pressure on TCO. Second, memory is a precious resource for computation, especially so in Spark’s environment where the total DRAM capacity that can be configured on each node is limited. Third, even though DRAM has higher random-access bandwidth and lower latency, such benefits will be wasted when it is used for caching large data blocks and characterizing sequential access. To find more cost-effective alternatives, Baidu and Intel worked together to integrate Intel Optane PMem.

Intel Optane PMem is an innovative technology that delivers a unique and affordable combination of large memory capacity and persistence. It represents a new class of memory and storage technology, explicitly architected for data centers. It offers several key benefits that match the specific requirements of Baidu BigSQL:

  • High bandwidth for sequential read
  • Large capacity and affordable cost

Intel Optane PMem supports two operating modes. When configured for Memory Mode, the applications perceive a pool of volatile memory no differently than they do on DRAM-only systems; when configured in App Direct Mode, the application can direct how to use available space. Since OAP cache has the specific purpose of indexing and inputting data, App Direct Mode is used to ensure the application has full control of how to use the device. In addition, the cache can be repopulated from backend storage and does not need to be persistent. OAP uses the memkind library to access PMem without persistency and corresponding performance penalties.

To use PMem in place of DRAM, Intel extended OAP to allow memory manager plugins, and implemented a PMem-based memory manager to allow the allocation of cache space in PMem. Users can switch between DRAM and PMem, or even mix the two, for instance using DRAM to cache index while using PMem to cache data.

Additionally, to fully integrate PMem with Baidu’s specific OS environment, Baidu and Intel carried out further wide-ranging collaborations in areas including hardware, operating system, and libraries.

To validate the performance and benefits of Intel Optane PMem in OAP, Baidu conducted several evaluations and internal tests, first with decision support benchmark queries and then with Baidu’s real workload queries. The main objective was to test and understand the cost-efficiency of PMem.

In the case of testing with decision support benchmark queries, firstly the dataset size is capped at 1 TB, and DRAM and PMem are configured at the same capacity. Test results show that they are both able to cache all the data, and PMem is only slightly behind DRAM in performance (11.7%), while its cost is a lot lower2. When the dataset reaches 3 TB, and DRAM and PMem are at the same cost, DRAM can no longer cache all the data due to its lower capacity. In comparison, PMem does not only have higher capacity to cache all the data, it shows much better performance—6 times better2. DRAM has poor performance in the second scenario because when data size greatly exceeds cache size, DRAM needs to read data from backend storage frequently which delays the response time. Decision support benchmark query tests show clear evidence that when at the same cost level, Intel Optane PMem can provide larger capacity and higher performance than DRAM.

Figure 3. DRAM and Intel Optane PMem Comparison Tests2 Decision.

The next stage of testing is based on the same two scenarios, but with Baidu’s actual workload and a slightly different approach. In the first scenario, both DRAM and PMem are tested to cache 50% of the frequently used columns. Results show that the PMem caching speed is only about 12% lower than DRAM2. And since its cost is disproportionally lower, it is the more cost-efficient solution. In the second scenario (DRAM and PMem at same cost), only PMem has the capacity to cache all the hot data columns and it demonstrates a 22% performance improvement, while avoiding 30% of I/O requests to underlying systems2.

Based on these test results, Baidu concluded that Intel Optane PMem can replace DRAM in BigSQL as a more cost-efficient cache solution. Since then, Baidu deployed PMem in BigSQL, and used it to optimize its ad hoc query service—Tuling. Supported by Intel Optane PMem, the cluster offloaded more than 30% of the workload from Tuling1. Additionally, after deploying PMem, the average query latency reduced by 20%1. The Spark/OAP performance per PMem server instance improved by 50% on Tuling Spark SQL workload, at an additional cost of only 20%1.

Outlook
Emerging trends are driving big data technologies to change and evolve. The focus is shifting from providing key functionalities to cloud based solutions, with in-depth optimizations to meet performance targets and reduce cost. In the future, as Baidu’s BigSQL becomes cloud based, Intel Optane PMem will bring to it more significant advantages in terms of performance and TCO.

And beyond input data cache acceleration for Spark SQL, with its high capacity and high bandwidth, PMem has an even bigger role to play in Spark-based machine learning and deep learning scenarios which require many computational iterations in order to process very large volumes of data. Furthermore, Spark shuffle can be optimized to access PMem through RDMA and utilize it as shuffle storage, further reducing shuffle latency and improving performance.

Going forward, Baidu and Intel will continue working together to optimize Spark. As Intel Optane PMem and 2nd Generation Intel® Xeon® Scalable Processors become more advanced, Baidu and Intel will be able to leverage them to introduce more acceleration features to Spark, pushing performance and cost-efficiency to the next level.

Explore Related Products and Solutions

Intel® Xeon® Scalable Processors

Drive actionable insight, count on hardware-based security, and deploy dynamic service delivery with Intel® Xeon® Scalable processors.

Learn more

Intel® Optane™ Persistent Memory

Extract more actionable insights from data – from cloud and databases, to in-memory analytics, and content delivery networks.

Learn more

注意事項與免責聲明

Intel® 技術的功能與優勢取決於系統配置,而且可能需要支援的硬體、軟體或服務啟動才能使用。實際效能會依系統組態而異。沒有電腦系統能提供絕對的安全性。詳情請洽詢購入系統的製造商或零售商,或是上網參閱 https://www.intel.com.tw// 效能測試中使用的軟體與工作負載,可能只有針對 Intel® 微處理器進行效能最佳化。包括 SYSmark* 與 MobileMark* 在內的效能測試是使用特定電腦系統、零組件、軟體、作業與功能進行測。這些因素若有任何異動,均可能導致測得結果產生變化。考慮購買時,為了協助您充分評估,您應該參考其他資訊及效能測試,包括該產品結合其它產品使用時的效能表現。如需更完整的資訊,請造訪 https://www.intel.com.tw/benchmarks// 效能結果係根據截至組態中所示日期的測試,可能無法反映所有公開提供的安全性更新。請查看組態公開資料以獲得詳細資訊。沒有產品或元件能提供絕對的安全性。// 所述之成本降低情境,用意是要提供範例,指出搭載特定 Intel® 處理器的產品,在特定情況與配置,可能會如何影響未來各項成本以及提供成本節省。實際情況可能有所差異。對於各項成本,或是成本降低幅度,Intel 不提供任何保證。// Intel 並不控制或稽核本文件提及的第三方效能標竿資料或網站。您應造訪該網站並確認本文件提及的資料是否正確。// 部分測試案例結果係採用 Intel 內部分析或架構模擬或模型進行預估或模擬,僅供參考之用。系統硬體、軟體或配置如有任何差異,都可能會影響實際的效能表現。

產品與效能資訊

1The production performance data was given on August 16, 2019. For more complete information about these test results, please contact Baidu. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
2The evaluation performance data was given on January 31, 2019. For more complete information about these test results, please contact Baidu. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.