How to Solve Your Data Lake Analytics Headache? Take 1 Trino with 1 Varada & Presto! Problem Solved!

By Guy Mast
I
February 3, 2022
February 3, 2022

Supercharge your Trino query performance, while slashing compute costs with zero data-ops by using Varada. A rapidly growing number of organizations are adopting both Trino and Varada’s connector as complementary solutions to solve their business challenges.

Solving agility and flexibility challenges, while quickly expanding the use of data lake architecture, is no easy feat. Especially with data engineers already struggling to keep up as companies generate more data than ever. This deluge of data has to be stored across multiple systems, and even though it is now harder to access, this data requires more and more complex analysis. The solution used to be to move and copy data, but this only created long and expensive workflows. The Trino project and its Enterprise offering by Starburst  solves this problem, which is why Varada is such a big fan. Trino allows you to provide easy and accurate access to all your data, for all your users. This is exactly why Varada chose Trino as our query engine, and why an increasing number of companies have been turning to the unique advantages of Trino. 

Trino is an open-source distributed SQL query engine for running interactive analytic queries against data sources ranging from gigabytes to petabytes. And it has been gaining serious traction by offering companies low-latency queries, high concurrency, cost-based optimizer (CBO), and native ability to query multiple data sources in raw form, delivering short time-to-insight. Its surge in popularity is not such a surprise given its proven scalability in a variety of high profile use cases, such as at Airbnb, Comcast, Facebook, Netflix, Twitter, and Uber. Its growth as the ‘go to’ solution has seen it develop a vibrant community of contributors addressing problems and improving the product offering.

Customizable Optimization 

The platform is highly customizable to meet business needs, giving data teams full control over Trino installation. However, Trino also poses some challenges. Because it relies on highly skilled talent, Trino operations are heavily dependent on DevOps and DataOps, who are responsible for customizing and managing all enterprise capabilities. In addition, even with the CBO, performance is expensive and cost tends to spiral over time due to the natural dependency of brute force on compute resources. Further performance optimizations are often capped and require significant compute and DataOps resources.

As analytics use cases grow in demand across almost every business unit, with costs bloating in direct proportion, data teams are desperate for a way to simplify data ops management — while getting the cost of query acceleration under control.

Varada enables data architects to seamlessly accelerate, optimize and query workloads sent to Trino to meet specific performance and cost requirements, with zero data-ops and effective resource utilization. Varada can be easily deployed as a “Trino Connector” directly on existing clusters to instantly accelerate your Trino clusters, leveraging existing investments. The new community edition of Varada’s connector for Trino is free (for clusters of up to 4 worker nodes), and is a part of Varada’s ongoing commitment and contribution to the Trino community.

Varada’s indexing technology is designed to continuously monitor and learn which datasets are frequently used or required to meet specific high-priority workload performance requirements. By using this feedback loop, different datasets are dynamically and automatically operationalized by indexing, cache, intermediate results, or any combination that delivers optimal performance and price balance.

Benchmarking the Benefits of Using Trino & Varada

We measured Trino’s queries cost/ performance with the Hive connector and then with Trino and Varada’s connector, using a 27bn rows / 1.5TB dataset, inspired by a popular ride-sharing app. The data was compressed in snappy Parquet and stored in the data lake (AWS S3).

We ran four different queries:

Query #1: Full scan distinct count aggregation that matches the partitioning scheme;

Query #2: Cohort selection from the data for a simple counting aggregation;

Query #3: Selective projection of many columns in few rows at a needle in a haystack search;

Query #4: Selective join operation between two tables, fact and dimension.

Queries were executed a few times and the median result was taken. As we ran each query multiple times, we didn’t use the cache to store results. Only cached data (“warm”) was used to represent active workloads.

Cluster configurations:

  • Trino with the Hive connector: 3x r5.8xlarge (3 on-demand nodes, EMR)
  • Trino with the Varada connector: 3 x r5d.8xlarge

Query Performance Results

As demonstrated below, Varada + Trino outperformed using just Trino + Hive connector on all queries. As expected, the value of our unique indexing technology is strongly proven to accelerate selective queries and joins, where Varada was over 45x and 60x faster, respectively.

Query Performance Results

Go for a test drive: Download Varada Trino Connector

Cost Per Query

When it comes to cost per query, Varada’s connector was between 20x to 47x cheaper for cohort, selective queries and joins than using the Hive connector. Varada’s connector indexing is extremely lean on compute resources, which can help data teams avoid Trino’s spiraling costs by using Hive to query the data lake.

Cost / Query ($, normalized for 1-hour 100% utilization, except for on-demand)

Cost / Query ($, normalized for 1-hour 100% utilization, except for on-demand)

Supercharge Presto / Trino with Indexing

With Varada, data teams and users no longer need to compromise on performance in order to achieve agility and fast cost effectiveness: they can leverage the power of autonomous indexing, caching of data and intermediate results to accelerate Presto queries by 10x-100x on their existing cluster. 

Varada’s proprietary indexing logic automatically analyzes the data lake and introduces indexes for filtering, joining and aggregating data, continuously evaluating and query performance on the fly. Varada indexes data directly from the data lake across any column, so that every query is optimized automatically. Varada indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting columns into small chunks, called nanoblocks™. Based on the data type, structure, and distribution of data in each nanoblock, Varda automatically creates an optimal index. To ensure fast performance for every query and each nanoblock, Varada automatically selects from its library of indexing algorithms and indexing parameters the optimal set for any data nanoblock, even as data changes and evolves.

Varadad’s engine automatically prioritizes the data to index or cache based on a smart observability layer that continuously monitors demand. Users running queries through the Varada endpoint see transparent performance benefits when filtering, joining and aggregating data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within an SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks. Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column. Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is also accelerated using nanoblock indexes.

While indexing is most effective for complex queries that run on highly dimensional data that would have otherwise required extensive modeling to achieve acceptable response time, all queries perform significantly faster with Varada’s dynamic and adaptive indexing technology.


See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:

To see Varada in action on your data set, download the connector! 


Query Syntax

Query #1 (full scan): 

SELECT year(d_date) as “year”,        

       month(d_date) as “month”,       

       count(distinct tripid) as cnt_trips 

FROM   

       hive.trips.trips_data_04012021_071819 as trips_data 

WHERE   

       d_date between date ‘2018-03-23’ and date ‘2018-03-25’ GROUP BY  1,2

Query #2 (cohort selection):

SELECT        

      trips_data.rider_gender AS rider_gender,        

      SUM(1) AS cnt 

FROM hive.trips.trips_data_04012021_071819 as trips_data 

WHERE d_date between date ‘2018-02-01’ and date ‘2018-03-01’       

      AND rider_age IN (18,19)       

      AND d_weekday = 5 

GROUP BY  1 

order by 2 desc;

Query #3 (selective project):

SELECT   

      tripid, rider_id,    

      d_date, t_hour,    

      fare,  duration, distance,

      rider_age, rider_gender,

      lon, lat 

FROM

      hive.trips.trips_data_04012021_071819 trips_data 

WHERE   

      rider_id = 3380311

      AND t_hour between 7 and 10

Query #4 (selective join):

SELECT

     month(fact.ts) as month,

     count(distinct fact.tripid) as cnt_unq_trips,

     count(distinct fact.rider_id) as cnt_unq_riders,

     round(avg(fact.fare),2) as avg_fare,

     round(avg(fact.duration)/60,2) as avg_duration_min 

FROM   

     hive.trips.trips_data_04012021_071819 as fact,

     hive.trips.riders_campaign as dim 

WHERE

     dim.rider_id = fact.rider_id AND

     dim.segememt IN (‘Students_Promotion’,’Churning_Riders_Last3Months’) AND

     dim.num_trips < 9 

group by 1 

order by 1;

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept