Trino (PrestoSQL) & Varada: The Ultimate Data Lake Architecture

David Krakov
By David Krakov
I
March 25, 2021
March 25, 2021

Accelerate your Trino (PrestoSQL) queries performance and slash compute costs with zero data-ops using Varada

Due to its unique advantages, Presto has quickly become the tool of choice for data driven companies looking for the agility and flexibility of the data lake architecture. Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources ranging from gigabytes to petabytes. It has been proven at scale in a variety of use cases at Airbnb, Comcast, Facebook, Netflix, Twitter, and Uber, and has a vibrant community of contributors addressing problems and improving the product offering.

Presto offers low-latency queries, high concurrency, cost-based optimization (CBO), and native ability to query multiple data sources in raw form, delivering short time-to-insight. The platform is highly customizable to meet business needs, giving IT full control over Presto installation. 

But Presto also poses many challenges. Because it relies on highly skilled talent, Presto operations are heavily dependent on IT, DevOps and DataOps, who are responsible for customizing and managing all enterprise capabilities. In addition, even with the CBO, performance is expensive and cost tends to spiral over time due to the natural dependency of brute force on compute resources. Further performance optimizations are often capped and require significant compute and DataOps resources.

As analytics use cases grow in demand across almost every business unit, with costs bloating in direct proportion, data teams are desperate for a way to simplify data ops management — while getting the cost of query acceleration under control.

Schedule A Demo

Benchmarking Trino (PrestoSQL) vs. Varada

Varada enables data architects to seamlessly accelerate and optimize Trino workloads to meet specific performance and cost requirements, with zero data-ops and effective resource utilization. Varada can be easily deployed as a “Presto Connector” directly on existing clusters to instantly accelerate your Trino clusters, leveraging existing investments. 

Varada’s indexing technology is designed to continuously monitor and learn which datasets are frequently used or required to meet specific high-priority workloads performance requirements. By using this feedback loop, different datasets are dynamically and automatically operationalized by indexing, cache, intermediate results, or any combination that delivers optimal performance and price balance.

In this short benchmarking analysis we measured Varada’s performance and cost per query against Trino, using a 27bn rows / 1.5TB (compressed in snappy Parquet) dataset, inspired by a popular ride-sharing app.

We ran four different queries:
Query #1: Full scan distinct count aggregation that matches the partitioning scheme;
Query #2: Cohort selection from the data for a simple counting aggregation;
Query #3: Selective projection of many columns in few rows at a needle in a haystack search;
Query #4: Selective join operation between two tables, fact and dimension.

Queries were executed a few times and the median result was taken. As we ran each query multiple times, we didn’t use the cache to store results. Only cached data (“warm”) was used to represent active workloads.

Cluster configurations:

  • Trino (PrestoSQL): 3x r5.8xlarge (3 on-demand nodes, EMR), total data size as loaded 2.1TB
  • Varada: 3 x r5d.8xlarge, with 1.25TB allocated for index and cache

Query Performance Results

As demonstrated below, Varada + Trino outperformed Trino on all queries. As expected, the value of our unique indexing technology is strongly proven to accelerate selective queries and joins, where Varada was over 45x and 60x faster, respectively.

Trino Varada Data Lake Presto
Varada Performance Uplift vs. Trino
Trino Varada Data Lake Presto
Query Run Time (seconds)

Cost Per Query

When it comes to cost per query, Varada was between 20x to 47x cheaper for cohort, selective queries and joins. Varada’s indexing is extremely lean on compute resources, which can help data teams avoid Trino’s spiraling costs.

Trino Varada Data Lake Presto
Cost / Query ($, normalized for 1-hour 100% utilization, except for on-demand)

Supercharge Presto / Trino with Indexing

With Varada, data teams and users no longer need to compromise on performance in order to achieve agility and fast cost effectiveness: they can leverage the power of autonomous indexing, caching of data and intermediate results to accelerate Presto queries by 10x-100x on their existing cluster. 

Varada’s proprietary indexing logic automatically analyzes the data lake and introduces indexes for filtering, joins and aggregates, continuously evaluating and query performance on the fly. Varada indexes data directly from the data lake across any column, so that every query is optimized automatically. Varada indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting columns into small chunks, called nanoblocks™. Based on the data type, structure, and distribution of data in each nanoblock, Varda automatically creates an optimal index. To ensure fast performance for every query and each nanoblock, Varada automatically selects from its library of indexing algorithms and indexing parameters the optimal set for any data nanoblock, even as data changes and evolves.

Varadad’s engine automatically prioritizes the data to index or cache based on a smart observability layer that continuously monitors demand. Users running queries through the Varada endpoint see transparent performance benefits when filtering, joining and aggregating data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within an SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks. Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column. Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is also accelerated using nanoblock indexes.

While indexing is most effective for complex queries that run on highly dimensional data, that would have otherwise required extensive modeling to achieve acceptable response time, all queries perform significantly faster with Varada’s dynamic and adaptive indexing technology.


See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:

To see Varada in action on your data set, schedule a short demo!


Query Syntax

Query #1 (full scan): 

SELECT year(d_date) as "year",        
       month(d_date) as "month",       
       count(distinct tripid) as cnt_trips 
FROM   
       hive.trips.trips_data_04012021_071819 as trips_data 
WHERE   
       d_date between date '2018-03-23' and date '2018-03-25' GROUP BY  1,2

Query #2 (cohort selection):

SELECT        
      trips_data.rider_gender AS rider_gender,        
      SUM(1) AS cnt 
FROM hive.trips.trips_data_04012021_071819 as trips_data 
WHERE d_date between date '2018-02-01' and date '2018-03-01'       
      AND rider_age IN (18,19)       
      AND d_weekday = 5 
GROUP BY  1 
order by 2 desc;

Query #3 (selective project):

SELECT   
      tripid, rider_id,    
      d_date, t_hour,    
      fare,  duration, distance,
      rider_age, rider_gender,
      lon, lat 
FROM
      hive.trips.trips_data_04012021_071819 trips_data 
WHERE   
      rider_id = 3380311
      AND t_hour between 7 and 10 

Query #4 (selective join):

SELECT
     month(fact.ts) as month,
     count(distinct fact.tripid) as cnt_unq_trips,
     count(distinct fact.rider_id) as cnt_unq_riders,
     round(avg(fact.fare),2) as avg_fare,
     round(avg(fact.duration)/60,2) as avg_duration_min 
FROM   
     hive.trips.trips_data_04012021_071819 as fact,
     hive.trips.riders_campaign as dim 
WHERE
     dim.rider_id = fact.rider_id AND
     dim.segememt IN ('Students_Promotion','Churning_Riders_Last3Months') AND
     dim.num_trips < 9 
group by 1 
order by 1;
We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept