3 Common Pitfalls to Embracing the Data Lake Architecture with AWS Athena

By Roman Vainbrand
December 18, 2020
December 18, 2020

Data Lakes have emerged as the standard for big data management, with data virtualization becoming the predominant approach for versatile large-scale analytics. Data virtualization is achieved by layering a query engine on top of a data lake, which enables users to get direct access to any data they need. With less ETL overhead, there’s no waiting around for the data engineer team to load a separate data warehouse, change access control settings and correct the inevitable mistakes in data transformation. 

Pioneered by early open-source data technologies such as Hive on Hadoop, modern day data virtualization rivals the most powerful data warehouse technologies with one critical advantage: rather than moving data from a data lake out to separate data warehouses for analysis, data virtualization gives users the ultimate flexibility to access a single view of all their data. 

AWS Athena Leverages Presto to Deliver Modern Data Virtualization

Presto was created to deliver a distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Unlike other solutions, Presto doesn’t require you to pre-load the data from its sources. Presto can connect to all common SQL and NoSQL databases as well as read directly from HDFS and S3.

This leads to significant advantages for using Presto over an ETL process:

  • Saves you the overhead from authoring and maintaining various ETLs. Not only do you save time and money now, but you’re also more versatile for future changes. Much easier to respond to changing business requirements.
  • The data is as fresh as possible.
  • All the data is accessible in its most granular form. Most ETLs will result in data granularity loss, leaving only the “important” parts of the data. Often this results in the loss of the ability to drill down.

Schedule A Demo

Due to its unique advantages, Presto has quickly become a tool of choice for many data driven companies.

As the leading hosted Presto based query engine for data virtualization, AWS Athena offers a wide range of capabilities. It runs directly on an AWS based data lake, with no data duplication, no data movement, and no issues with data consistency. Athena lets data engineering teams create direct access to their AWS data sources such as S3, feed data into other AWS services, as well as share and control ad-hoc access to AWS data. Users can build reporting dashboards against original datasets via Athena’s standard Presto endpoint rather than going through complex and brittle ETL pipelines.

As a result, the operational overhead of Athena is incredibly low for basic data virtualization use cases. Indeed, Athena has become a critical building block in any modern data lake architecture. As an AWS native implementation of Presto, Athena offers both the versatility of the Presto query engine and tight integration with other AWS services.

Challenge #1:
Serve Interactive Queries Without Compromising on Agility

As Athena administrators know, Athena works exceptionally well out of the box until users run into performance issues. As a result, when an Athena deployment gains adoption in an organization, users run into roadblocks trying to productionalize the system.

For example, some queries incur expensive and time consuming scans, which means users can’t reliably power real time dashboards. Users also run into issues getting predictable query times when issuing complicated joins.

Subsequently many data teams revert back to a data warehouse, which instantly voids all of these benefits as suddenly data engineering teams need to deal with data migration, consistency, multiple permission models, and users struggling with finding data across multiple data catalogs.

These common performance issues have limited the spectrum of relevant use cases that can run on Athena, which often ends up as a platform only for ad hoc queries and not business-critical analytics.

Challenge #2:
Visibility & Control

Even though core Presto has powerful tools for optimization, a zero devops solution such as Athena doesn’t include any tooling for analyzing performance issues.
Admins don’t have access to continuous monitoring and deep visibility into how workloads perform, how resources are used on an hourly / weekly basis, who are the heavy spenders, what is the “hottest data”, etc.
Athena administrators often find they need domain expertise to understand the data that users are querying and to optimize users’ workflow.
Administrators struggle to help users because there is no good way to analyze what Athena is doing under the covers. Since data sets and use cases change quickly, any hard-earned gains through manual optimization goes out the window.

Challenge #3:
Full Scans Often Result in Spiraling & Unpredictable Costs

Athena is priced based on data scanned. Seems very simple and easy to understand. But where it shines on simplicity it fails on predictability.
Data teams often struggle to estimate the level of performance users should expect and don’t have the tools to estimate cost and ensure it matches the allocated budget.
Unlike other popular data platforms, Athena doesn’t include query acceleration options that offer users and data platform teams the ability to consistently meet performance and concurrency requirements.
The “serverless” nature of Athena brings tremendous benefits in ease of use, but when it comes to managing budgets and business requirements, data teams are forced to manually optimize data even before it hits the platform, which again misses the goal of a modern data lake architecture.

Varada’s Fresh Approach to Data Virtualization Puts Indexing at the Front

Varada’s dynamic indexing technology eliminates the need for full scans and can accelerate queries automatically without any overhead to query processing or any background data maintenance.
Users see performance benefits when filtering, joining and aggregating data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within an SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks.

Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column. Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well.

Varada’s indexing works transparently for users and indexes are managed automatically by Varada’s proprietary cost-based optimizer extensions. Varada’s unique indexing efficiently indexes data directly from the data lake across any column so that every query is optimized automatically. Varada indexes adapt to changes in data over time by splitting each column into small chunks, called nanoblocks.

To ensure fast performance for every query, Varada dynamically selects from a set of indexing algorithms and indexing parameters that adapt and evolve as data changes to ensure best fit index any data nanoblock.

The Ultimate Data Democratization Solution: Serving a Wide Range of Queries and Use Cases

Whether you’re considering more cost-effective architectures for a cloud data lake or have already gotten started with Presto and Athena, you’ll find a lot of success with an AWS Athena based solution paired with Varada.
Athena brings the reality of a no-devops query engine to AWS based data lakes, enabling true data virtualization without costly data duplication and brittle data movement.
For large-scale use cases, Varada enables data architects to seamlessly accelerate and optimize workloads to meet specific performance and cost requirements with zero data-ops and effective resource utilization.

To see Varada in action, schedule a short demo!

We use cookies to improve your experience. To learn more, please see our Privacy Policy