Users are embracing data virtualization because the architecture gives direct access to any data they need. With less ETL overhead, there’s no waiting around for the data engineer team to load a separate data warehouse, change access control settings and correct the inevitable mistakes in data transformation. Amazon Web Services Athena has become a critical building block in any modern data lake architecture. As an AWS native implementation of Presto, Athena offers both the versatility of the Presto query engine and tight integration with other AWS services.
Athena lets data engineering teams create direct access to their AWS data sources such as S3, feed data into other AWS services, as well as share and control ad-hoc access to AWS data. Users can build reporting dashboards against original datasets via Athena’s standard Presto endpoint rather than going through complex and brittle ETL pipelines. Yet as Athena administrators know, Athena works exceptionally well out-of-the-box until users run into performance issues. Even though core Presto has powerful tools for optimization, a zero ops solution such as Athena doesn’t include any tooling for analyzing performance issues.
Varada has introduced a Presto based query engine that gives data lake administrators the power to optimize their Athena based data virtualization architecture and ensure they can meet all their performance requirements. In the long-term, it will enable organizations to leverage data visualizations for a very wide range of use cases, including low-latency workloads. Varada uses several unique technologies that address the gaps which often lead people to abandon a purely Athena based architecture.
We ran a set of 19 queries on AWS Athena and compared them to a Varada cluster. We used different use cases to illustrate the performance uplift data teams can expect over a wide range of workloads.
On average, Varada delivered x30 faster response time and as much as x70, at a minimal compute footprint. On full scan queries, Varada ran slower than Athena due to the small cluster.
Click here to view the detailed benchmarking analysis and report.
Varada’s dynamic and adaptive indexing technology is able to accelerate queries automatically without any overhead to query processing or any background data maintenance. Users see performance benefits when filtering, joining and aggregating data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within an SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks. Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column. Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well.
Varada’s indexing works transparently for users and indexes are managed automatically by Varada’s proprietary cost based optimizer extensions. Varada’s unique indexing efficiently indexes data directly from the data lake across any column so that every query is optimized automatically. Varad indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting each column into small chunks, called nanoblocks. To ensure fast performance for every query and each nanoblock, Varada dynamically selects from a set of indexing algorithms and indexing parameters that adapt and evolve as data changes to ensure best fit index any data nanoblock.
If you are considering a cost effective data lake stack or have already started running analytics workloads on Presto and Athena, you can consider augmenting AWS Athena with Varada’s autonomous indexing platform. Athena delivers a true serverless solution for AWS data lakes, enabling any SQL data consumer to run experimental queries with no costly data duplication and data movement. For production workloads that are performance or budget-sensitive, Varada enables data architects to autonomously accelerate and optimize workloads to meet specific performance and cost requirements directly on the data lake and with no need to model data.
To learn more about how to close the data virtualization gap, download the whitepaper.