Data lakes are built using a versatile set of technologies that can be used to implement a wide range of solutions, including for raw data collection, flexible data access for users, and building fast and efficient data warehouses.
The two primary types of data lake solutions are as a data staging ground, to transform raw data into a format that’s usable for data analysis and reporting; and as a complete data warehouse, with a built in query engine. In a classic enterprise data warehouse architecture, these solutions represent the Extract Transform Load (ETL) system, and the analytics query engine. Implemented in a data lake, ETL and analytics are faster, in both query response time and time-to-insights, and more cost effective than with traditional technologies.
In a data lake architecture, data teams can store 10x more data, but the entry cost is often a fraction of the price of data warehouse solutions, which tend to start at a 7-figure price tag. As the adoption of data lake architecture grows in momentum, and more business units migrate workloads to leverage the unique benefits of this modern architecture, the cost will obviously increase to accommodate the growing need in compute and storage.
Using a data lake for data raw collection and transformation, it’s possible to replace legacy ETL solutions at lower cost and higher performance. ETL is a serialized activity that starts with modeling the target data in the data warehouse catalog, followed by designing data transformations to read the source data and make it conform with the target system schema.
For example, a classic ETL process for collecting telemetry data from a rideshare system would normalize and validate all of the location fields, then lookup the driver and rider IDs before storing valid records in the data warehouse. Invalid records might be rejected or in some cases could cause the entire process to fail. In a data lake based solution, the ride share data is first collected and stored in raw format. That data can be queried directly without any transformation or the data can be transformed after being stored. Data engineers can address exceptions and resume the transformation instead of restarting it. Users get faster access to data, and any data transformation is more efficient than when done via ETL.
Once data is stored in a data lake, it can be transformed and exported to a data warehouse or analytics engine. Moving data to a data warehouse was a common implementation when data lakes first emerged and they were slow to query. However, modern data lakes can support high performance query engines, allowing users direct access to both raw and transformed data directly in the data lake.
A common solution today involves loading raw data directly into a data lake, then transforming it into an optimized format that users can query directly. The latest high performance query engines used in data lakes also include technologies such as columnar data storage, data partitioning, and big data indexing.
Following the previous example, a data lake based solution for ride share data could feed a real time machine learning system for ride assignment and rate calculation, a real time reporting system for showing updated user reviews, and an interactive analytics system for business analytics to query. All of these can be served from the same data lake.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!