Data Lakes have emerged as the standard for big data management with data virtualization becoming the predominant approach for versatile large scale analytics. Data virtualization is achieved by layering a query engine on top of a data lake. Pioneered by early open source data technologies such as Hive on Hadoop, modern day data virtualization rivals the most powerful data warehouse technologies with one critical advantage: rather than moving data from a data lake out to separate data warehouses for analysis, data virtualization gives users the ultimate flexibility to access a single view of all their data.
As the leading hosted Presto based query engine for data virtualization, AWS Athena offers a wide range of capabilities. It runs directly on an AWS based data lakes, with no data duplication, no data movement, and no issues with data consistency. As a result, the operational overhead of Athena is incredibly low for basic data virtualization use cases. What drives users to augment Athena is lack of control over performance and cost. Introducing a data warehouse instantly voids all of these benefits as suddenly data engineering teams need to deal with data migration, consistency, multiple permission models, and users struggling with finding data across multiple data catalogs.
As Athena administrators know, Athena works exceptionally well out of the box until users run into performance issues. Even though core Presto has powerful tools for optimization, a zero devops solution such as Athena doesn’t include any tooling for analyzing performance issues. As a result, when an Athena deployment gains adoption in an organization, users run into roadblocks trying to productionalize the system. For example, some queries incur expensive and time consuming scans, which means users can’t reliably power real time dashboards. Users also run into issues getting predictable query times when issuing complicated joins.
Varada offers a Presto based query engine that gives data lake administrators the power to optimize their Athena based data virtualization architecture. Varada lets you run a full scale production grade data virtualization solution without needing to resort to an add-on data warehouse or hand optimize every query. Best of all, Varada runs directly in your VPC through the AWS Marketplace offering. Users can access everything in the data lake via Varada through the shared catalog using AWS Glue or the Hive metastore. Administrators simply need to make Varada available to users via a standard Presto endpoint.
In order to achieve these dramatic performance gains while maintaining a zero devops and data-ops footprint, Varada uses several unique technologies that address the gaps which often lead people to abandon a purely Athena based architecture. Varada’s dynamic and adaptive indexing technology is able to accelerate relevant queries automatically without any overhead to query processing or any background data maintenance. Varada’s indexing works transparently for users and indexes are managed automatically by Varada’s proprietary cost based optimizer extensions. Varada is able to identify which queries to accelerate and which indexes to maintain.
Varada includes out-of-the-box native support for all community supported Presto connectors to access a wide array of data sources. The Varada query engine also expands upon the open source Presto query engine by adding enterprise grade support for high availability in the Coordinator and Workers, so both can withstand node failures. Varada’s cost-based optimizer extends the basic optimizer with knowledge of how and when to accelerate queries with inline indexes. Varada Workers are also able to auto-scale based on dynamic workload and administrator configuration.
Whether you’re considering more cost effective architectures for a cloud data lake or have already gotten started with Presto and Athena, you’ll find a lot of success with an AWS Athena based solution paired with Varada. Athena brings the reality of a no-devops query engine to AWS based data lakes, enabling true data virtualization without costly data duplication and brittle data movement. For large-scale use cases, Varada enables data architects to seamlessly accelerate and optimize workloads to meet specific performance and cost requirements with zero data-ops and effective resource utilization.
To learn more about how to close the data virtualization gap, download the whitepaper.