Data virtualization revolutionized the data infrastructure space by serving
data consumers directly on top of the data lake without the need to move data elsewhere. At its core, data virtualization was born to deliver flexibility and quick time-to-market to any data consumer, while mitigating many challenges that slowed down data teams, with ease of use to run any query at any time.
But, data virtualization was not designed to deliver performance and falls short on meeting the expectations of wide use cases with the required and consistent response time. Data teams often try to optimize for that, but this requires extensive and complex data-ops. Financial governance also turns out to be a significant challenge, as inefficient brute force scans and compute scaling result in unpredictable cost structure that tends to quickly spiral out of control.
Recently, a new standard for data virtualization is emerging. In order to fully operationalize and monetize the data lake, as well as avoid moving data to separate data silos, data virtualization must deliver interactive performance and significantly reduce the data-ops required to optimize for price and performance.
Agility, ease-of-use, and performance are the driving force of the data
infrastructure space. Companies are shifting attention to support high velocity development, placing fast time-to-market and maximum flexibility
as top priorities. This new approach requires decoupling data consumption from its preparation. The ultimate flexibility can only be achieved when you don’t need to move or prepare data at all (beyond the basic ETLs of course).
To ensure predictable and consistent performance, many enterprises
compromise on accessing all their available data and settle for isolated data
silos that have been prepared and modeled to enable speedy queries. But
the best platforms should automatically accelerate queries according to
workload behavior. Data teams should have the ability to define business
priorities and adjust performance and budgets accordingly. This will enable
them to serve a wide range of use cases on a single data platform and
directly on the data lake and eliminate the need to build separate silos for
each use case.
One of the early query engines developed to support high performance data
virtualization is Presto. A distributed query engine with support for a wide range of data lake platforms, Presto gives data teams the ultimate versatility. It also delivers the core benefits of data virtualization, with no
data duplication, giving administrators centralized access controls, and a
shared catalog to make collaboration easier. Presto stands apart from other
solutions because of its broad support, deep extensibility, and powerful
standard ANSI SQL language.
Under the covers, Presto processes queries in memory without relying on
slow intermediate disk-based storage. The in-memory pipelined distributed
execution engine ensures highly efficient query processing. Integrated with
the in-memory pipeline processing is a cost-based optimizer. Just like a data
warehouse, the query optimizer evaluates different execution plans for each query and chooses the best one based on data statistics and available
resources, reducing overall processing time. The latest version of Presto
includes dynamic filtering, which accelerates join performance by reducing the amount of data that queries need to process when joining tables by as much as 10x. Varada’s engineers added an elegant design for dynamic filtering after realizing that the cost-based optimizer allowed for an implementation of a broadcast join, pushing the inner join values to filter the larger joined table scan phase.
Though most Data virtualization solutions are able to read any type of data, all of the in-memory processing is optimized around a columnar architecture, ideal for analytic queries. Combined with data sources that are stored in columnar optimized formats, platforms can optimize query execution by reading only the fields that are required for any individual query. Exposing these advanced query processing capabilities through standard ANSI SQL and JDBC connectors, it’s obvious why data virtualization solutions have become extremely popular.
Data virtualization is rapidly gaining momentum due to the agility due to its agility and ease-of-use. But data team and now struggling with optimizing performance and controlling spiraling costs.
Click here to continue to Part II…