A data lake is an essential tool for big data analytics, making data accessible in a way that can impact business change as quickly and efficiently as possible. Rather than merely reporting and running business intelligence solutions on existing business operations in a data warehouse or analyzing how the business is performing, a data lake gives organizations the opportunity to rapidly reevaluate where the business should be innovating. The foundation of a data lake is a storage system that can accommodate all of the data across an organization, from supplier quality information, to customer transactions, to real time product performance data. Unlike a data warehouse or an analytics database, a data lake can be used to collect any type of data without knowing the details such as the schema or the meaning in advance.
Built on infinitely scalable storage, a good data lake design starts with a data collection system that can accept any type of data. Further, a data lake requires governance and lineage aware tools for combining and transforming datasets. What was traditionally a simple linear Extract Transform and Load (ETL) process to collect data into a data warehouse is now an Extract Load (really just Collect) and Transform process. A data lake might be used to combine information about the lifecycle of each product, letting the business analyze what level of quality from a given supplier leads to the highest customer complaints.
Though it’s common to start using the data lake as just a collection point, a well designed system gives users access to data as it arrives. First generation data lakes still loaded data that was transformed in the data lake into a data warehouse. Even today separate analytics systems are common add-ons to a data lake. These transitory architectures are helpful in adding a data lake to an existing data management system but fail to deliver on the speed and efficiency that’s possible with a well designed data lake. A modern data lake architecture adds a third layer of data virtualization, where the analytics engine operates directly on the data lake instead of using an add on legacy system. A data lake with data virtualization gives users a single location with direct access to any data for which they’re authorized. Rather than having to model data, transfer it from the data lake to a third party system, then manage permissions and access controls in multiple locations and maintain data consistency, a data lake with data virtualization offers a true single source of truth. The data virtualization enabled data lake architecture, with a query engine directly running on the data lake, delivers organizations the ultimate flexibility and allows for the agility necessary to support innovation.
New technologies like Varada are making data virtualization possible, helping organizations realize well designed data lake architectures. By using fast and efficient query engines on top of the flexible transformation and scalable data collection capabilities of a data lake, organizations can deliver data driven processes to impact their businesses as well as support reporting and business intelligence systems.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!