The Data Virtualization Evolution is Just Beginning

Eran Vanounou
By Eran Vanounou
I
November 13, 2020
November 13, 2020

Data virtualization revolutionized the data infrastructure space by serving
data consumers directly on top of the data lake without the need to move data elsewhere. At its core, data virtualization was born to deliver flexibility and quick time-to-market to any data consumer, while mitigating many challenges that slowed down data teams, with ease of use to run any query at any time.

But, data virtualization was not designed to deliver performance and falls short on meeting the expectations of wide use cases with the required and consistent response time. Data teams often try to optimize for that, but this requires extensive and complex data-ops. Financial governance also turns out to be a significant challenge, as inefficient brute force scans and compute scaling result in unpredictable cost structure that tends to quickly spiral out of control.

Recently, a new standard for data virtualization is emerging. In order to fully operationalize and monetize the data lake, as well as avoid moving data to separate data silos, data virtualization must deliver interactive performance and significantly reduce the data-ops required to optimize for price and performance.

Schedule A Demo

The Innovation in Data Infrastructure is Driven by the Need for Simplicity and Agility

Agility, ease-of-use, and performance are the driving force of the data
infrastructure space. Companies are shifting attention to support high velocity development, placing fast time-to-market and maximum flexibility
as top priorities. This new approach requires decoupling data consumption from its preparation. The ultimate flexibility can only be achieved when you don’t need to move or prepare data at all (beyond the basic ETLs of course).

To ensure predictable and consistent performance, many enterprises
compromise on accessing all their available data and settle for isolated data
silos that have been prepared and modeled to enable speedy queries. But
the best platforms should automatically accelerate queries according to
workload behavior. Data teams should have the ability to define business
priorities and adjust performance and budgets accordingly. This will enable
them to serve a wide range of use cases on a single data platform and
directly on the data lake and eliminate the need to build separate silos for
each use case.

Presto is a Good Start for Optimizing for Cost & Performance

One of the early query engines developed to support high performance data
virtualization is Presto. A distributed query engine with support for a wide range of data lake platforms, Presto gives data teams the ultimate versatility. It also delivers the core benefits of data virtualization, with no
data duplication, giving administrators centralized access controls, and a
shared catalog to make collaboration easier. Presto stands apart from other
solutions because of its broad support, deep extensibility, and powerful
standard ANSI SQL language.

Under the covers, Presto processes queries in memory without relying on
slow intermediate disk-based storage. The in-memory pipelined distributed
execution engine ensures highly efficient query processing. Integrated with
the in-memory pipeline processing is a cost-based optimizer. Just like a data
warehouse, the query optimizer evaluates different execution plans for each query and chooses the best one based on data statistics and available
resources, reducing overall processing time. The latest version of Presto
includes dynamic filtering, which accelerates join performance by reducing the amount of data that queries need to process when joining tables by as much as 10x.

Varada’s engineers added an elegant design for dynamic filtering after realizing that the cost-based optimizer allowed for an implementation of a broadcast join, pushing the inner join values to filter the larger joined table scan phase.

Powerful Query Engine

Though most Data virtualization solutions are able to read any type of data, all of the in-memory processing is optimized around a columnar architecture, ideal for analytic queries. Combined with data sources that are stored in columnar optimized formats, platforms can optimize query execution by reading only the fields that are required for any individual query. Exposing these advanced query processing capabilities through standard ANSI SQL and JDBC connectors, it’s obvious why data virtualization solutions have become extremely popular.

Data virtualization is rapidly gaining momentum due to the agility due to its agility and ease-of-use. But data team and now struggling with optimizing performance and controlling spiraling costs.

Tackling Data Virtualization Performance Issues

Presto-based AWS Athena is a good example for a zero data-ops implementation, perfect for ad hoc queries and initial experiments with data virtualization. However, in these kinds of native implementations, users are often driven to augment the data stack due to lack of control over performance and cost.

Data virtualization platforms tend to work exceptionally well out of the box until users run into performance issues. Even though some platforms, such as Presto, offer powerful tools for optimization, but don’t include any tooling for analyzing performance issues.

As a result, when data virtualization deployments gain adoption in an organization, users run into roadblocks trying to productionalize the system. For example, some queries incur expensive and time consuming scans, which means users can’t reliably power real time dashboards. Users also run into issues getting predictable query times when issuing complicated joins.

Administrators are tasked to acquire specific domain expertise to understand the data that users are querying and to optimize users’ workflow. Administrators also struggle to help users because there is no good way to analyze what is going on under the covers. Since data sets and use cases change quickly, any hard earned gains through manual optimization goes out the window. As a result, there are practical limits on how broadly users can adopt such platforms for access to data in the data lake. While the benefits of using data virtualization solutions still outweigh the limitations, these issues are what ultimately lead organizations to abandon the pure data lake strategy and adopt a hybrid approach, duplicating data into a data warehouse. Instead of being the unifying solution promised by data virtualization, the data lake ends up being relegated to yet another data silo.

Will Data Virtualization End Up as Yet Another Data Silo?

Introducing a data warehouse instantly voids all of these benefits as suddenly data engineering teams need to deal with data migration, consistency, multiple permission models, and users struggling with finding data across multiple data catalogs.

Avoiding Data Duplication is Critical

As the data lake becomes a key priority for many organizations, data platform teams are tasked with finding creating solutions to minimize pre-consumption ETLs.

To help with that, Varada has introduced a Presto-based query engine that gives data lake administrators the power to seamlessly optimize their data virtualization architecture. Varada lets you run a full scale production grade data virtualization solution without needing to resort to an add-on data warehouse or hand optimize every query. Best of all, Varada runs directly in your VPC, with an easy initial deployment through AWS Marketplace. Your data does not need to be moved or duplicated.

Users can access everything in the data lake via Varada through the shared
catalog using AWS Glue or the Hive metastore. Administrators simply need
to make Varada available to users via a standard Presto endpoint.

In order to achieve these dramatic performance gains while maintaining a
zero devops and dataops footprint, Varada uses several unique technologies
that address the gaps which often lead people to abandon a traditional data
virtualization architecture. Varada’s dynamic and adaptive indexing
technology is able to accelerate relevant queries automatically without any
overhead to query processing or any background data maintenance.
Varada’s indexing works transparently for users and indexes are managed
automatically by Varada’s proprietary cost based optimizer extensions.
Varada is able to identify which queries to accelerate and which indexes to
maintain.


Interested to learn more? We’ll be happy to show you how Varada works! click here to schedule a short demo.


The Power of Big Data Indexing

Varada’s unique indexing efficiently indexes data directly from the data lake across any column so that every query is optimized automatically. Varada indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting columns into small chunks, called nanoblocks™. Based on the data type, structure, and distribution of data in each nanoblock, Varda automatically creates an optimal index. To ensure fast performance for every query and each nanoblock, Varada
automatically selects from a set of indexing algorithms and indexing
parameters that adapt and evolve as data changes to ensure best fit index
any data nanoblock.
At query time when running through the Varada endpoint, users see
transparent performance benefits when filtering, joining and aggregating
data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within a SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks. Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column. Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well.

The Next Generation of CBO

Varada takes CBO to the next level, by automatically analyzing and
introducing indexes for filtering, joins and aggregates, continuously
reanalyzing query performance on the fly, and balancing resources across
the entire system.

Varada uses machine learning to decide when and what to optimize. With
the benefit of lightweight indexing, Varada is able to use intelligent and
elastic resource allocation, and leveraging intermediate results. The
resulting cost model is exposed to administrators and users who can then
prioritize specific user queries.


Interested to learn more? We’ll be happy to show you how Varada works! click here to schedule a short demo.

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept