Coming from years of experience building enterprise data warehouses, a data lake architecture can be confusing. Some parts appear to be similar – both have data storage systems, query engines, and user management. Other parts are similar but play different roles. At the heart of how the two are different is how both data warehouses and data lakes use their data catalogs.
In a data warehouse architecture, the catalog controls how data is loaded. This is why warehouses are considered less flexible but faster because data gets structures first then written. In a data lake, the catalog defines where existing data can be found and in what format. Data lakes have traditionally been thought of as flexible and slower because data gets written in any format and then structured later.
An enterprise data warehouse defines clear structures for how enterprise-wide data is collected, organized, and queried. The core of the warehouse is the catalog, which includes the set of schemas for each table in the warehouse. As data is collected from various sources, it needs to be transformed into the structure that the warehouse expects for each data type. For example, a retail data warehouse might specify the attributes for a customer record, including an identifier that is then included as an attribute in the transaction table. Users then identify information about customer transactions by running a query that joins these tables and the data warehouse query engine is tuned to answer those queries. But if a group wants to load new data or change attributes for a data type, the changes need to be coordinated through updates to the central catalog before anything can be collected. Similarly, users are limited to querying data based on how it is structured in the catalog.
By contrast, a data lake can accept any kind of data in any format. Instead of being used to coordinate how data is stored, the catalog in a data lake is used to help users identify the types of data that get added to the data lake. This increased flexibility allows organizations to be as strict or as flexible as the business demands. A data lake catalog can be used to verify data as it is loaded and enforce the data types, or the catalog can be updated in response to data changes, mapping and combining old and new formats. Users might be required to use the central catalog as defined (similar to a data warehouse) or they can update the catalog to mirror their own view of the data that’s been loaded. Though data lakes initially did not have the same query performance as data warehouses and are often combined with data warehouses, modern data lakes have query engines that are just as performant as data warehouses with the added benefits of data flexibility because of the power of the data lake catalog.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!