Read part I of the blog here.
Presto-based AWS Athena is a good example for a zero data-ops implementation, perfect for ad hoc queries and initial experiments with data virtualization. However, in these kinds of native implementations, users are often driven to augment the data stack due to lack of control over performance and cost.
Data virtualization platforms tend to work exceptionally well out of the box until users run into performance issues. Even though some platforms, such as Presto, offer powerful tools for optimization, but don’t include any tooling for analyzing performance issues.
As a result, when data virtualization deployments gain adoption in an organization, users run into roadblocks trying to productionalize the system. For example, some queries incur expensive and time consuming scans, which means users can’t reliably power real time dashboards. Users also run into issues getting predictable query times when issuing complicated joins.
Administrators are tasked to acquire specific domain expertise to understand the data that users are querying and to optimize users’ workflow. Administrators also struggle to help users because there is no good way to analyze what is going on under the covers. Since data sets and use cases change quickly, any hard earned gains through manual optimization goes out the window. As a result, there are practical limits on how broadly users can adopt such platforms for access to data in the data lake. While the benefits of using data virtualization solutions still outweigh the limitations, these issues are what ultimately lead organizations to abandon the pure data lake strategy and adopt a hybrid approach, duplicating data into a data warehouse. Instead of being the unifying solution promised by data virtualization, the data lake ends up being relegated to yet another data silo.
Introducing a data warehouse instantly voids all of these benefits as suddenly data engineering teams need to deal with data migration, consistency, multiple permission models, and users struggling with finding data across multiple data catalogs.
As the data lake becomes a key priority for many organizations, data platform teams are tasked with finding creating solutions to minimize pre-consumption ETLs.
To help with that, Varada has introduced a Presto-based query engine that gives data lake administrators the power to seamlessly optimize their data virtualization architecture. Varada lets you run a full scale production grade data virtualization solution without needing to resort to an add-on data warehouse or hand optimize every query. Best of all, Varada runs directly in your VPC, with an easy initial deployment through AWS Marketplace. Your data does not need to be moved or duplicated.
Interested to learn more? We’ll be happy to show you how Varada works! click here to schedule a short demo.