Varada’s CEO, Eran Vanounou, chats with TWiET‘s Louis Maresca, Brian Chee and Curt Franklin on how to solve modern data challenges by running analytics directly on the data lake. Varada’s secret sauce is autonomous indexing, which transforms the data lake into a destination for interactive analytics without compromising on agility, flexibility and cost.
Organizations store data in all sort of places, in the cloud and on-prem. For the last two decades data warehouses were the default destination for big data analytics. This state of mind required the business teams to define what they expect, for example slice and dice data per product, country, region, price point, etc. Data teams would then need to digest that and model the data to make sure it’s optimized for interactive analytics. Optimizing data requires significant efforts, including partitioning, caching, cubes, etc. It also means business units are asked to wait days, and often weeks, for data to be available for querying. For many business units this is no longer acceptable. The long time-to-market, coupled with the costly process which is heavily reliant on data engineers, is no longer aligned with the fast pace business teams are expecting. The result is the transition towards data lakes.
Data lakes deliver many advantages: the first is a single storage destination, which is cheap and easy. Thanks to public cloud vendors it has become very easy to get started. In addition, data lakes also deliver a true separation between how data is stored and how it is practically consumed. Unlike data warehouses, where there’s a tight connection between the form data is stored and how it’s consumed, data lakes enable to store data as it’s created and in a granular format without losing any dimensions.
Indeed, many organizations are forced to put their data somewhere else for analytics. They use a cold path of data that they pipe to data warehouses or other forms of cloud storage. Then services pick that up and run analytics. This is cumbersome and expensive. For many use such as business intelligence, data scientists or data applications the use of data warehouses means a significant loss of the dimensionality of the data. This cold path also blocks the ability to use real-time analytics, as only 10%-15% of available data is actually moved to operational data warehouse. Up until recently, the data lake has been predominantly used for experimental and ad hoc analytics. The end result is multiple data silos that serve specific use cases. This weighs heavily on both cost and operational efficiencies.
When listening to business teams needs, they want to access all the available data for a very wide range of use cases. In a data-driven world, we can’t expect data consumers to be able to indicate in advance which queries they need to run. This is exactly the concept of agile analytics, which accelerates the adoption of data lakes.
The vision of data virtualization on top of the data lake is the ability to consumer data without considering its format, source, size, dimensionality, and most important without the need to move or copy data. But there’s a gap between the vision and how data consumers actually use data lakes.
Varada’s query acceleration layer eliminates this gap. Varada is deployed on top of the data lake and serves any SQL data consumers, including dashboards, BI, data apps, data scientists, etc. Varada delivers a suite of dynamic and adaptive acceleration strategies that leverage our secret sauce, autonomous indexing.
To deliver this “magic”, Varada starts off with a deep workload-level observability which uses machine learning to constantly monitor queries that hit the data lake, analyzing queries’ structure and behavior. This includes which tables are being frequently used, which tables are being joined, which users and workloads are heavy users, when queries are being used, and much more. These insights enable the platform to autonomously make decisions on which queries need to be accelerated, when and how. To ensure a full alignment with business needs, data teams can indicate which workloads have greater priority from a business perspective. This will be taken into consideration by the platform when accelerating queries, given the budget and overall cluster size.
Varada’s acceleration strategies include a combination of indexing and cache to deliver the “holy grail” of data access — very rapid response to ad hoc queries. Unlike in-memory based solutions, Varada is not limited to cache-only solutions which are very restricted in size, very expensive and require significant data engineering efforts. Varada’s indexing technology resides atop Presto (Trino) distributed query engine, which eliminates the need to move data. It enables to query many data sources, so data consumers, for example, can slice and dice data from AWS S3 with data from mySQL, and easily scales. This is a modern approach that truly embraces data lake architecture. The platform automatically decides which data and queries to accelerate and how to deliver the fastest possible performance. Autonomous acceleration enables data teams to get a zero dataops solution that works for them without taking away control. Data teams can now think strategically, vs. tactically, and focus on expanding data lake analytics use cases, setting priorities and budgets.
To easily get started, go to AWS Marketplace and spin up a free single-node to see Varada’s magic. Within a few hours the platform will automatically start implementing acceleration strategies and queries will run dramatically faster, ranging from 5x faster to even 100x faster. The more selective and complex queries are, the more impactful indexing is.
You can also schedule a short demo to see Varada in action on your data set!