Until recently, the predominant tools for large scale data analytics included SQL query engines built into data warehousing systems or specialized analytic database management systems. Data lakes emerged to handle the accelerating growth in data, in both volume and complexity, prompting a new type of analytics system based on data virtualization.
Before the data lake, organizations collected and transformed data in flight, storing it in highly structured data warehouses that were designed to model the business. With a growing demand for analytics, data warehouse administrators deployed specialized systems, specifically optimized for high speed analytics on a subset of data. Users extracted just the data they needed for their analytics. As the world became digitized, organizations shifted to collecting raw data in data lakes. In some cases those lakes continued to feed data warehouses and then analytics engines, while in other cases users started pulling data directly from the data lake into their analytics engines. The result has been a proliferation of data instead of a consolidation, leaving organizations with data bloat, exploding costs, and exposed to liabilities from impossible data governance or data theft.
Though data lakes were initially used just for data collection, cleansing, and transformation, the SQL query engine technologies from the world of data warehouse and standalone analytics engines have been making their way onto the data lake. This new way of using data lakes is known as data virtualization. By taking the best of data analytics and data warehousing technology then running the query engine directly on a data lake, data virtualization helps organizations give their users lightning fast time-to-insights with no delay between data collections and analytics. At the same time, administrators are able to simplify data lake governance and keep costs under control.
When comparing data warehouse vs. data virtualization, the focal point is that data lake and its to reduce the cost and complexity of large scale data management. The missing piece so far has been an analytics solution that runs directly on the data lake instead of having to transform and migrate data to a data warehouse and then to an analytics engine. Data virtualization is a new, highly efficient and cost effective approach to analytics. By putting query engines directly on data lakes, data virtualization is changing the way companies are running their analytics data management.
Varada is a data platform that is deployed in your VPC and on top of your data lake. Queries from any data consumer are routed via Varada, which acts as the query engine. Any SQL app, BI tool or even analysts and data scientists can easily query any data source in your data lake, without the need to move data, prepare or model it in advance.
Varada is 100x faster than any other data lake query engine. Data teams and users no longer need compromise on performance in order to achieve agility and fast time-to-insights. Queries perform so much faster based on Varada’s dynamic and adaptive indexing technology. Unlike partitioning-based platforms, Varada indexes any column in any table so we can fetch data extremely fast. The indexing is adaptive to the type of data and Varadad’s engine knows automatically which data to index based on a smart observability layer that continuously monitors demand. Indexing is best for complex queries that run on highly dimensional data, that would have otherwise required extensive modeling to achieve acceptable response time.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!