Varada clusters run within your Virtual Private Cloud (VPC) and serve as an operational data tier between data consumers and the existing data sources such as columnar data on a data lake. Existing data sources continue to serve as a single source of truth.
Data consumers include any application that generates SQL such as interactive BI tools, data APIs and custom decision systems, as well as analysts and data scientists
Varada’s big data infrastructure platform is deployed within your VPC to ensure optimal control, security and governance. Varada connects directly to a wide range of data sources, including:
Operational datasets are defined using the CREATE MATERIALIZED VIEW SQL command, determining dataset definition, lifecycle and applying last-mile ETL. The ability to create operational datasets in a single-click enable:
Bottom line, the materialized view approach shifts the focus from building custom data pipelines and optimization to creating a flexible live definition. The materialized view can be easily updated or changed over time to accommodate evolving business needs.
All the data in the data lake, as well as external sources, can also be accessed via a unified semantic layer leveraging query pushdown capabilities.
Varada enables to combine SQL (via JOIN, UNION, etc.) data from direct data source connectivity (such as a relational database or data lake) with materialized indexed datasets using their inline indexes.
Virtual views can seamlessly mix data sources and materialized indexed datasets, enabling to transparently serve data application and users from different data tiers.
Varada keeps the operational data continuously synchronized, enabling easy data lifecycle management directly on the data lake. Varada leverages native cloud services such as AWS Glue, SQS, and customer managed catalogs such as Hive Metastore to keep data continuously fresh. Varada supports different synchronization and update scenarios:
During the data materialization process, Varada loads and indexes the data across all dimensions.
The operational dataset is indexed as it is loaded, at the rate of the data ingest, without any user intervention or post-processing. The result is that any query on an inline indexed dataset will find an index ready for it.
The inline index is adaptive to the data so that each dimension (columns) is split into very small pieces, called nanoblocks, which are then stored on NVMe SSDs.
To ensure fast performance for every query and each nanoblock, Varada leverages a variety of indexing algorithms and indexing parameters that adapt and evolve as data changes to ensure best fit index any data nanoblock.
Inline Indexing is used for:
To enable workloads to work out-of-the box, Varada embeds Presto (community edition), a “SQL-on-anything” engine that supports different connectors to access different data sources. Varada offers any Presto capability out-of-the-box, including support of Presto connectors. All community supported connectors are included by default. Varada SQL Engine expands Presto community edition: