We have already written about Data Lakehouse technologies and compared the most prominent Data Lake Table Formats. All of them have their strengths and weaknesses, but are ultimately exciting tools that enable us to maintain and utilize data lakes more efficiently.
Here we wanted to give a rough overview of our customers’ needs and wants when it comes to Data Lake Table Formats and File Formats. It gives a unique perspective in which direction the market is going at the moment.
Customers and numbers
Synvert has worked with over 250 customers with success stories in many different industries. Our goal was to take peek at the preferences of our customers’ analytical stacks, especially at the trend of emerging lakehouse technologies. So, what are the results of our questionnaire?
Figure 1
Delta Lake had the biggest push among customers, mostly because of integration with Databricks and Microsoft, who have been using the technology as default on their platforms. Customers that want use those platforms in the future, especially Databricks and Spark, tend to go with Delta Lake. In situations where they stopped using Delta Lake, it has mostly to do with overhead, where Spark was too much, and a shift to a modern data warehouse solution was more than enough for their use case. Still, Delta Lake is the most widely used lakehouse storage format, and since the announcement of Delta Lake 3.0 and it’s compatibility for Apache Iceberg and Apache Hudi, it might only get a larger footprint.
Iceberg has big future potential as it is recommended for Cloudera, Dataiku, and Dremio environments. As an Apache project it gathered support from many top companies, especially in the open source community. A lot of customers plan and want to use Iceberg in future projects (Figure 1, Figure 2).
Hudi is another Apache project – and comparatively to the other two parts of the lakehouse trinity, it is losing popularity among customers (Figure 2). That doesn’t mean it is going anywhere soon. Hudi is integrated in EMR, and still the first choice for many AWS customers.
Other table formats and alternatives include first generation lakehouses like Hive together with ORC file format, as well as proprietary implementations, and appropriated modern data warehouse solutions like Redshift, BigQuery, and Snowflake.
In that context, it does make sense that a lot of customers are considering using some of the modern lakehouse solutions in the future. Currently customers are mostly motivated by their own stack and which solution best fits and has the lowest maintenance costs.
It is great to see that big tech is supporting all of the main table formats, and especially noteworthy has been the rise of complementing projects like lakeFS, a data version control for data lakes. Looking at Github Star History (Figure 4) for the projects, it seems it mirrors our customer interest (Figure 3) for now. There is always the question will the industry converge? Will Delta Lake leave the Apache projects behind, or will one of them manage to outperform?