Eliminating the DataOps Challenge for Interactive Queries on Presto & Trino

By Roman Vainbrand
June 14, 2021
June 14, 2021

By autonomously accelerating Presto & Trino queries, Varada brings down dataops to the bare minimum, enabling data teams to focus on business priorities, deliver on analytics demands faster, and maintain control over data lake analytics cost and performance.

The number one priority for every data leader and team is to eliminate inefficiencies: get rid of the ops, deliver analytics workloads faster, improve cost/performance balance, focus on business priorities and high priority workloads, while maintaining control. To meet the ongoing demand for agile, more flexible data analytics, many organizations have deployed Presto-based data lake architecture. Presto and Trino promise access to a heterogeneous up-to-date stream of data (formats, sources, storage locations, etc.) that is available for analysis at any time and for any business purpose, zero time-to-market for ad hoc queries, and unprecedented flexibility. 

But Presto, in and of itself, does little to alleviate dataops. Teams still need to manage the data analytics budget, prioritize query requests, and optimize query performance. Manual query optimization is time consuming and requires a specific and extensive skill set. Backlog optimizations grow everyday, creating a vicious cycle. Presto’s reliance on full scans ends up limiting the relevant use cases to non-interactive analytics, or requires significant modeling to enable it to run faster at a decent cost. In addition, the lack of workload level observability prevents dataops teams from seeing how data is being used across the entire organization, and identifying which workloads need priority based on business needs — rather than on the needs of an individual user or query. 

As analytics use cases grow in demand across almost every business unit, with costs bloating in direct proportion, data teams are desperate for a way to simplify dataops management — while getting the cost of query acceleration under control.
Varada’s query acceleration platform enables teams to instantly operationalize their data lake, running queries 10x-100x faster at a 40%-60% cost reduction with zero optimization dataops. The platform includes solutions for autonomous query acceleration, workload-based observability, as well as a workload control center which gives data architects full control in prioritizing workloads and defining budgets and performance requirements.

Schedule A Demo

Dynamic Indexing: Accelerate Presto Queries Autonomously

Presto optimization takes up enormous staff resources. The dataops team spends time pouring over queries, looking for ways to speed them up and make them more resource efficient. Starting with the slowest queries or most budget-consuming queries, which by the way are not necessarily the most important queries, the dataops team attempts to optimize each query by hand using a well-known bag of tricks, such as data structure changes, query rewrite and cluster level optimizations. Since Presto is based on brute force query processing, these optimizations can only take you so far – in fact, almost 90% of compute resources are consumed by full scans.

Varada continuously learns and analyzes all your queries. With machine learning, the platform autonomously decides which queries to accelerate, when to accelerate and how, leveraging a mix of indexing and caching strategies. The platform takes into account not only usage patterns, but also business priorities and budgets, so that acceleration decisions are fully aligned with business needs. Varada’s unique indexing technology is adaptive and dynamic, relying on many types of indexing, from the basic ones all the way to advanced text and logs analytics. Based on the observability layer, the platform decides which data to index at any given time to support fluctuating and changing demand. 

Varada’s proprietary indexing logic automatically analyzes the data lake and introduces indexes for filtering, joins and aggregates, continuously evaluating query performance on the fly. Varada’s engine automatically prioritizes the data to index or cache based on a smart observability layer that continuously monitors demand. Varada indexes data directly from the data lake across any column. This means that every query is optimized automatically. 
Data teams using Varada don’t need to compromise on performance to achieve agility and optimal resource utilization on the data lake: they can leverage the power of autonomous indexing, caching, intermediate results, and optimized dynamic filtering implementation to accelerate Presto queries by 10x-100x on their existing cluster — at a 40%-60% cost reduction.

Benchmarking Trino vs. Varada

Expect to be amazed!

Data Lake Observability: Preemptively Adapt to Changes

Varada’s Workload-level observability enables dataops teams to seamlessly monitor, optimize and accelerate workloads to meet dynamic business requirements.

Step 1: Monitor Workloads

Workload-level monitoring gives dataops teams an open view to see how data is being used across the entire organization. By monitoring the workload behavior and identifying trends — such as deteriorated execution times or inconsistent results — data teams can preemptively adapt to changes.

Step 2: Optimize Presto

Based on the query execution metadata, Varada delivers easy-to-digest and actionable workload-level observability that enables administrators to easily understand how data is used by different workloads and users, how resources are allocated, how and why bottlenecks occur, etc. Data teams gain control by effectively identifying and optimizing high priority workloads, instead of optimizing highly elastic and dynamic queries one-by-one.

Step 3: Accelerate Presto Queries

Armed with deep understanding of the workload patterns and business priorities, Varada automatically tailors acceleration strategies for each workload. Data teams can easily set priorities, performance requirements and budget caps. These features, in aggregate, enable data teams to enjoy a true hands-free query acceleration which is designed to meet a wider range of use cases and workload business requirements directly on the data lake.

Instantly analyze your Trino and PrestoDB clusters with Varada’s Analyzer (available on Github)

Maintain Full Cost & Performance Control: Enjoy Instantly Interactive Presto Queries

The cost of data workloads is notoriously hard to predict — and to keep under control. Using Varada’s workload control center, administrators and data consumers can prioritize workloads and set budget caps to ensure the platform meets business requirements across different use cases, and workload prioritization is used by the platform to drive cache and indexing strategies. 

Insights are continuously revised, based on real-time usage and query performance and translated into two types of acceleration strategies on the data lake which are used to automatically create acceleration instructions on which data to index, and how, and which data to cache. Varada creates millions of adaptive caches and indexes. Those are structured in a mesh of “nanoblocks” – tiny projections of different data subsections in use.

  • Cache strategies rely on SSD columnar nano-block caching to speed up data access and are based on the frequency of data usage and its business priority. 
  • Indexing strategies speed up data searches, filters and joins: the impact of each index is evaluated separately based on data type and level of selectivity. Varada’s indexing technology breaks data, across any column, into nano-blocks. Varada automatically chooses the most effective index for each nano-block based on the data content and structure. We use a variety of indexes such as Bitmap, Dictionary, Trees, text analysis etc. and tailor each one to every nano-block. This unique indexing technology is what makes all your data available and interactive.

Acceleration instructions are extremely granular and are based on specific tables, columns and even partitions. To ensure the acceleration meets additional business considerations, acceleration strategies are also directly influenced by workloads priorities, as determined by administrators and data consumers. Each new set of instructions is automatically configured and implemented, according to the budget caps and allocated resources set by administrators.

Though acceleration instructions are generated automatically by the platform, administrators have full control to view, manage and override specific instructions and determine which datasets to accelerate and which strategies to apply. 

Varada’s control center gives dataops teams the ability to see usage at both the query and workload level — and the tools to direct the underlying optimization system based on workload priority. This means that when prioritizing where to optimize, data platform teams have a full view of the system and can understand where their efforts will make the biggest impact on the business.

Varada: Zero DataOps with Unprecedented ROI and Cost & Performance Balance

Companies using Varada make the most of their data lake architecture, without the need for additional and extensive investments in dataops or additional use case-specific platforms, such as text analytics platforms. Varada delivers high ROI through effective indexing-based resource utilization — coupled with zero dataops. 

Varada relies on its smart indexing to reduce processed data down to 1-2% of the data that would be naturally scanned and processed by Presto without indexing by Presto. This is not only extremely fast, but also reduces compute resources to 10-30% of the initial Presto cluster footprint.

Varada’s zero dataops means that companies can really go to market fast. Varada optimizes interactive use cases autonomously and delivers agility and flexibility at zero time-to-market, with no need to optimize anything. This means more analytics for the organization on a single data stack; no need for optimized data platforms, and no need to partition data.
Bottom line, data consumers using Varada get top performance with the ultimate flexibility of running queries directly on the data lake. Data platform teams using Varada get a zero dataops solution that really works – without relinquishing control. By autonomously optimizing dataops tasks, Varada not only delivers high ROI but also frees data teams to think strategically and focus on expanding data lake analytics use cases and setting cross-organization business-driven priorities and budgets.

Schedule a short demo now to see Varada in action on your data set!

We use cookies to improve your experience. To learn more, please see our Privacy Policy