Alibaba’s world record for Data Warehousing (TPC-DS) performance has been comprehensively demolished by Databricks, in formally audited tests of a combination of SQL workloads, including loading the data set, processing a sequence of queries, processing several concurrent query streams and running data maintenance functions that insert and delete data, using the company’s new integrated Big Data "Lakehouse" platform.
The new world record was set in the 100TB TPC-DS benchmark. (That's an enterprise-class benchmark, published and maintained by the non-profit, vendor-neutral Transaction Processing Performance Council.)
TPC-DS data is used to test performance, scalability and SQL compatibility across a range of Data Warehouse queries. The tests use a model of a retailer selling through three channels. Data is sliced across 17 dimensions, with the bulk of it held in large fact tables representing daily transactions spanning five years. The 100TB version of TPC-DS is the largest public sample relational database for public testing and evaluation, featuring a STORE_SALES table, for example, that alone contains over 280 billion rows/42 terabytes of CSV files.
The performance (2.2X faster than the previous record) is an impressive vindication of the work Databricks has put into building its open data management architecture. This aims to bring together the classic data warehouse (a repository for typically structured and analytics-ready data) and the capabilities of a data lake (a store of raw structured and unstructured data) into what it calls a data “Lakehouse”. (The company set the record using its commodity software rather than the custom-built setup used by Alibaba.)
Follow The Stack on LinkedIn
Databricks has used a range of innovations, including its “Photon Engine” -- a query engine developed in C++ and designed to be directly compatible with APIs for open source analytics engine Apache Spark – to deliver what it can now legitimately boast is the fastest data warehouse in the market. The new TPC-DS record saw Databricks delivering 32,941,245 “QphDS” (a metric that includes 99 queries ranging from simple aggregations to complex pattern mining) -- more than doubling the 14,861,137 QphDS previously achieved by Alibaba.
Databricks used the performance win to have a quick dig at rivals, noting: “Many data warehouse systems, even the ones built by the most established vendors, have tweaked the official benchmark so their own systems would perform well. (Some common tweaks include removing certain SQL features such as rollups or changing data distribution to remove skew). This is one of the reasons why there have been very few submissions to the official TPC-DS benchmark.... The tweaks also ostensibly explain why most vendors seem to beat all other vendors according to their own benchmarks” the company added in a November 2 blog.
Databricks has raised over $3.5 billion in funding to-date. Its team noted that this had allowed it to “fund the most aggressive SQL team build out in history; in a short period of time, we’ve assembled a team with extensive data warehouse background… Among them are lead engineers and designers of some of the most successful data systems, including Amazon Redshift; Google’s BigQuery, F1 (Google’s internal data warehouse system), and Procella (Youtube’s internal data warehouse system); Oracle; IBM DB2; and Microsoft SQL Server.”
The new TPC-DS record was set using Databricks SQL, the company's analytics platform built on open source standards which lets users query data lakes with standard BI tools like Tableau and Microsoft Power BI. Performance were powered by “Photon Engine”, a query engine developed in C++ and designed to be directly compatible with Apache Spark APIs. The company says early adopters include supermajor Shell, which is using the platorm to "execute rapid queries on petabyte scale datasets using standard BI tools."
The full TPC-DS record results can be found here.