Cloud data lake vendor Cloudera has announced the general availability of Apache Iceberg in its data platform.
Developed through the Apache Software Foundation, Iceberg offers an open table format, designed for high-performance on big data workloads while supporting query engines including Spark, Trino, Flink, Presto, Hive and Impala.
Iceberg started out as a Netflix project before it was donated to the Apache foundation two years later in 2018.
In a blog, Cloudera — the data platform vendor with its roots in Hadoop-based systems — said its goal as to allow multi-function analytics on data lakes, repositories that support both structured and unstructured data. The introduction of the lake house concept is encourage users to employ analytics and BI on data lake systems.
“However, it still remains driven by table formats that are tied to primary engines, and oftentimes single vendors. Companies, on the other hand, have continued to demand highly scalable and flexible analytic engines and services on the data lake, without vendor lock-in,” Cloudera said.
The deployment of Iceberg in the Cloudera Data Platform (CDP) includes Cloudera Data Warehousing, Cloudera Data Engineering, and Cloudera Machine Learning. “These tools empower analysts and data scientists to easily collaborate on the same data, with their choice of tools and analytic engines,” Cloudera said.
Benefits are set to include support for schema and partition changes as a single command, time travel with point-in-time queries for forensic visibility and regulatory compliance capabilities, and concurrent multi-function analytics to deliver end-to-end data lifecycle needs. Performance is also set to improve with aggressive partitioning to handle very large-scale data sets, Cloudera said.
Tussle of the open source techies
However, Cloudera is not the only data late or lakehouse vendor to commit to an open-source path.
Databricks, the company originating as an Apache Spark vendor, has also donated its storage format layer to the open-source community. The latest iteration, Delta Lake 2.0, was announced last week at the Data and AI Summit.
“Delta Lake 2.0 will bring unmatched query performance to all Delta Lake users and enable everyone to build a highly performant data lakehouse on open standards. With this contribution, Databricks customers and the open-source community will benefit from the full functionality and enhanced performance of Delta Lake 2.0,” Databricks said.
Speaking to The Register, Joel Minnick, Databricks marketing VP, said: “After Delta Lake was open-sourced and there’s a lot of performance enhancements and features that we had continued to build inside of the Databricks platform. We’ve always been an open-source company at heart and if we were doing those enhancements, we really did want to be able to give those back to the community.”
Minnick said the enhancements were on the “data processing, data warehousing side of things.”
Delta Lake 2.0 was donated to the Linux Foundation this week. ®