Big Data Indexing Dramatically Accelerates Athena Query Performance

David Krakov
By David Krakov
November 23, 2020
November 23, 2020

AWS Athena is a critical building block in any modern data lake architecture. As an AWS native implementation of Presto, Athena offers both the Presto query engine’s versatility and tight integration with other AWS services. Athena lets data engineering teams create direct access to their AWS data sources such as S3, feed data into other AWS services, as well as share and control ad-hoc access to AWS data. Users can build reporting dashboards against original datasets via Athena’s standard Presto endpoint rather than complex and brittle ETL pipelines. Athena also gives administrators centralized control over ad-hoc access to data.

Benchmarking AWS Athena vs. Varada

Schedule A Demo

To see Varada in action on your data set, schedule a short demo!

The Data Lake Architecture Falls Short on Performance

Athena works exceptionally well out of the box until users run into performance issues. Even though core Presto has powerful tools for optimization, a zero devops solution such as Athena doesn’t include any tooling for analyzing performance issues. As a result, when an Athena deployment gains adoption in an organization, users run into roadblocks trying to productionalize the system. 

Varada’s data virtualization technology delivers dramatic query performance and resource utilization uplift and enables data teams to reduce AWS Athena cost while meeting business requirements. Varada offers a Presto-based query engine that gives data lake administrators the power to optimize their Athena based data virtualization architecture. Varada lets you run a full scale production grade data virtualization solution without needing to resort to an add-on data warehouse or hand optimize every query. 

Best of all, Varada runs directly in your VPC, with an easy initial deployment through AWS Marketplace. Users can access everything in the data lake via Varada through the shared catalog using AWS Glue or the Hive metastore. Administrators simply need to make Varada available to users via a standard Presto endpoint.

In this short benchmarking demo video, we ran three queries on AWS Athena and compared them to a Varada cluster. We used different types of queries to illustrate the performance uplift data teams can expect over a wide range of workloads.

For this benchmarking analysis, we used data and queries based on a popular ride sharing application. Queries illustrate use cases used by marketing and product teams for segmentation, user behavior, etc.

Running Queries Directly on the Data Lake 

Queries are leveraging the data lake as the data source, accessing files in parquet format (S3) with snappy compression. The data set includes 27bn rows / 1.5TB (compressed).

Varada Cluster
Varada Cluster: data set includes 27bn row

Cluster Overview

AWS Athena is using 3 months of data, partitioned daily. As a severless solution the actual size of the cluster and the type of EC2 machines used is not transparent to users.

Varada is using a 3-node cluster including two i3.metal workers and one i3.4xlarge coordinator.

Varada Cluster on AWS
Varada Cluster on AWS

Varada runs directly on the data lake within the customer’s VPC so there is absolutely no modeling or partitioning and there is no need to move / duplicate data to an isolated environment.

Varada’s secret sauce and our ability to dramatically accelerate queries is based on our unique big data indexing technology. Columns are automatically indexed according to workload needs.

Eliminate Manual Join Optimizations for SQL Order of Operations

Based on AWS Athena best practices, SQL optimization for joins requires to manually SQL order joins: larger table on the left side of join and the smaller table on the right side. Otherwise, queries often result in “timeouts”.

AWS Athena query fails because the JOIN was not optimized
AWS Athena query fails because the JOIN was not optimized

This is a common mistake by inexperienced users, and often occurs when queries are generated automatically by BI tools.

Varada leverages CBO and dynamic filtering to accelerate joins, and integrates advanced CBO with tables statistics. Dynamic filtering accelerates joins by orders of magnitude. These advancements enable to avoid timeouts and manual query re-writes to support a wide range of SQL consumers.

Predictable Cost Structure for Queries on the Data Lake

As a serverless solution, AWS Athena delivers a true zero devops deployment, reducing all barriers to entry. But when more and more data consumers onboard the platform, the full-scan nature of query execution and the fact pricing is based solely on data scanner, often spiral out of control. This unpredictable and often very pricey service can prevent many business units from leveraging the benefits of the data lake architecture.

Varada introduces a very simple pricing model, which is based on the size of the cluster deployed. By introducing a smart big data indexing technology, Varada eliminates the need for full scans in many queries, reducing data scanned from 100s of GBs to 100s of KBs. Overall Varada can reduce total cost of ownership (TCO) by 40%-60%.

To see Varada in action on your data set, schedule a short demo!

We use cookies to improve your experience. To learn more, please see our Privacy Policy