Due to its unique advantages, Presto has quickly become the tool of choice for data driven companies looking for the agility and flexibility of the data lake architecture. Presto is an open-source distributed SQL query engine for running interactive analytic queries against data sources ranging from gigabytes to petabytes. It has been proven at scale in a variety of use cases at Airbnb, Comcast, Facebook, Netflix, Twitter, and Uber, and has a vibrant community of contributors addressing problems and improving the product offering.
Presto offers low-latency queries, high concurrency, cost-based optimization (CBO), and native ability to query multiple data sources in raw form, delivering short time-to-insight. The platform is highly customizable to meet business needs, giving IT full control over Presto installation.
But Presto also poses many challenges. Because it relies on highly skilled talent, Presto operations are heavily dependent on IT, DevOps and DataOps, who are responsible for customizing and managing all enterprise capabilities. In addition, even with the CBO, performance is expensive and cost tends to spiral over time due to the natural dependency of brute force on compute resources. Further performance optimizations are often capped and require significant compute and DataOps resources.
As analytics use cases grow in demand across almost every business unit, with costs bloating in direct proportion, data teams are desperate for a way to simplify data ops management — while getting the cost of query acceleration under control.
Varada enables data architects to seamlessly accelerate and optimize Trino workloads to meet specific performance and cost requirements, with zero data-ops and effective resource utilization. Varada can be easily deployed as a “Presto Connector” directly on existing clusters to instantly accelerate your Trino clusters, leveraging existing investments.
Varada’s indexing technology is designed to continuously monitor and learn which datasets are frequently used or required to meet specific high-priority workloads performance requirements. By using this feedback loop, different datasets are dynamically and automatically operationalized by indexing, cache, intermediate results, or any combination that delivers optimal performance and price balance.
In this short benchmarking analysis we measured Varada’s performance and cost per query against Trino, using a 27bn rows / 1.5TB (compressed in snappy Parquet) dataset, inspired by a popular ride-sharing app.
We ran four different queries:
Query #1: Full scan distinct count aggregation that matches the partitioning scheme;
Query #2: Cohort selection from the data for a simple counting aggregation;
Query #3: Selective projection of many columns in few rows at a needle in a haystack search;
Query #4: Selective join operation between two tables, fact and dimension.
Queries were executed a few times and the median result was taken. As we ran each query multiple times, we didn’t use the cache to store results. Only cached data (“warm”) was used to represent active workloads.
As demonstrated below, Varada + Trino outperformed Trino on all queries. As expected, the value of our unique indexing technology is strongly proven to accelerate selective queries and joins, where Varada was over 45x and 60x faster, respectively.
When it comes to cost per query, Varada was between 20x to 47x cheaper for cohort, selective queries and joins. Varada’s indexing is extremely lean on compute resources, which can help data teams avoid Trino’s spiraling costs.
With Varada, data teams and users no longer need to compromise on performance in order to achieve agility and fast cost effectiveness: they can leverage the power of autonomous indexing, caching of data and intermediate results to accelerate Presto queries by 10x-100x on their existing cluster.
Varada’s proprietary indexing logic automatically analyzes the data lake and introduces indexes for filtering, joins and aggregates, continuously evaluating and query performance on the fly. Varada indexes data directly from the data lake across any column, so that every query is optimized automatically. Varada indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting columns into small chunks, called nanoblocks™. Based on the data type, structure, and distribution of data in each nanoblock, Varda automatically creates an optimal index. To ensure fast performance for every query and each nanoblock, Varada automatically selects from its library of indexing algorithms and indexing parameters the optimal set for any data nanoblock, even as data changes and evolves.
Varadad’s engine automatically prioritizes the data to index or cache based on a smart observability layer that continuously monitors demand. Users running queries through the Varada endpoint see transparent performance benefits when filtering, joining and aggregating data. Varada transparently applies indexes to any SQL WHERE clause, on any column, within an SQL statement. Indexes are used for point lookups, range queries and string matching of data in nanoblocks. Varada automatically detects and uses indexes to accelerate JOINs using the index of the key column. Varada indexes can be used for dimensional JOINs combining a fact table with a filtered dimension table, for self-joins of fact tables based on time or any other dimension as an ID, and for joins between indexed data and federated data sources. SQL aggregations and grouping is also accelerated using nanoblock indexes.
While indexing is most effective for complex queries that run on highly dimensional data, that would have otherwise required extensive modeling to achieve acceptable response time, all queries perform significantly faster with Varada’s dynamic and adaptive indexing technology.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!
Query #1 (full scan):
SELECT year(d_date) as "year", month(d_date) as "month", count(distinct tripid) as cnt_trips FROM hive.trips.trips_data_04012021_071819 as trips_data WHERE d_date between date '2018-03-23' and date '2018-03-25' GROUP BY 1,2
Query #2 (cohort selection):
SELECT trips_data.rider_gender AS rider_gender, SUM(1) AS cnt FROM hive.trips.trips_data_04012021_071819 as trips_data WHERE d_date between date '2018-02-01' and date '2018-03-01' AND rider_age IN (18,19) AND d_weekday = 5 GROUP BY 1 order by 2 desc;
Query #3 (selective project):
SELECT tripid, rider_id, d_date, t_hour, fare, duration, distance, rider_age, rider_gender, lon, lat FROM hive.trips.trips_data_04012021_071819 trips_data WHERE rider_id = 3380311 AND t_hour between 7 and 10
Query #4 (selective join):
SELECT month(fact.ts) as month, count(distinct fact.tripid) as cnt_unq_trips, count(distinct fact.rider_id) as cnt_unq_riders, round(avg(fact.fare),2) as avg_fare, round(avg(fact.duration)/60,2) as avg_duration_min FROM hive.trips.trips_data_04012021_071819 as fact, hive.trips.riders_campaign as dim WHERE dim.rider_id = fact.rider_id AND dim.segememt IN ('Students_Promotion','Churning_Riders_Last3Months') AND dim.num_trips < 9 group by 1 order by 1;