Spark Native Accelerators and Associated Technologies

Author

Trent Hauck

Published

May 30, 2024

Landscape

There are a few budding technologies that are looking to accelerate Spark, generally by replacing the execution engine with a more efficient one. As of 2024-05-30, many of these technologies are not yet stable.

Feel free to email me at trent@trenthauck.com if you have any suggestions or corrections.

Description Execution Engine Language Related Companies Open Source TPC-H Link1
Gluten + Velox Velox C++ Meta wrote Velox, MSFT is using Gluten + Velox in Fabric Yes Velox TPC-H
DataFusion Comet DataFusion Rust Apple released DataFusion Comet Yes DataFusion Comet TPC-H
Photon Photon C++ Databricks No N/A
Blaze DataFusion Rust Kwai Yes Blaze TPC-H
RAPIDS RAPIDS C++/Cuda NVIDIA Yes N/A

Component Notes

  • Gluten dubs itself a middle layer for offloading computation to native engines. In practice, it seems to be used primarily with Velox.
  • Velox is a C++ execution engine developed by Meta. It is used Fabric from Microsoft.
  • Apache DataFusion Comet is a native execution engine for Spark written in Rust. It is based of Apache DataFusion and was released by Apple.
  • Photon is a C++ execution engine developed by Databricks available on their platform.
  • RAPIDS is a suite of libraries for data science and analytics that uses GPUs. It includes a Spark plugin. https://github.com/NVIDIA/spark-rapids-jni/ contains the bindings implemented in Cuda and C++.

Footnotes

  1. These are links, if available, to TPC-H benchmarks for the given technology. They may not use the same scale factor or configuration as other technologies, so please be cautious when comparing them. Also, if you have better links, please let me know.↩︎