This blog post was initially published on The New Stack.
With a properly tuned Presto cluster you can run fast queries on your data lake with response times ranging anything from sub second to minutes. In order to accelerate your queries and reach top performance you need to really understand what Presto is trying to do and why it’s trying to do it. There are several features and actions that greatly speed up Presto query performance that can be applied throughout the query lifecycle to reduce query execution time and the associated compute resources.
Varada is a leading data lake acceleration platform and has been an active member in the Presto and Trino communities with many contributions including dynamic filters, CBO optimizations, workload analyzer and many more. In this post, I’ll provide practical tips and workflows that will help data ops engineers accelerate their queries. Let’s get started!
The first step is analyzing the query and associated tables to understand Presto’s execution plan. There are several alternatives available:
Run EXPLAIN (plan structure + cost estimates) or EXPLAIN ANALYZE (plan structure + cost estimates + actual execution statistics) on your query and evaluate the plan to see if you get what you want. Find the most expensive query fragment (largest CPU time) and operator and check to see if you’re reading too much data? Did predicate pushdown work as expected? Are your filters implemented correctly, or are they really expensive due to bad selectivity? Do the JOINs look right (order and type)?
Now you can update your table statistics (if no estimates are available or they are stale) to improve CBO operations, override CBO if needed, and refine your query mechanics to improve cost and increase selectivity. Sometimes you’ll have distributed workload issues (over/under resource utilization or skewed workload distribution) that you’ll need to further investigate via observability tools.
The explain plan is a powerful tool, but it’s not very user friendly. Reading the plans and making sense of them can be overwhelming for even the most experienced data ops engineers. For a deep dive into how Presto executes a query and how to read EXPLAIN, check out this great webinar by Martin Traverso from Trino: Understanding and Tuning Presto Query Processing.
Varada’s open-source Workload Analyzer collects details and metrics on every query, aggregates and extracts information, and delivers dozens of charts describing all the facets of cluster performance. Using the Workload Analyzer, data teams can learn how resources are used on an hourly and weekly basis and define scaling rules, identify heavy spenders and improve the utilization of resources, and improve predicate pushdown. Workload Analyzer also helps data teams to significantly reduce IO and CPU, identify “hottest” data, improve JOINs performance, provide a better production roll-out experience, and identify upgrade risks upfront.
Both Trino and PrestoDB provide a web-based user interface (UI) for monitoring a Presto cluster and managing queries. The main page has a list of queries along with information like unique query ID, query text, query state, percentage completed, username and source from which this query originated. The currently running queries are at the top of the page, followed by the most recently completed or failed queries. The query detail page contains multiple tabs of information ranging from resource utilization to live query plan and operators performance data.
Below are screenshots from Trino’s Web UI showing a query with a broadcast join plan between a large store_sales table and small time_dim table (correctly ordered by CBO), which uses dynamic filtering (can be seen in the operators’ view). The high-level statistics (query overview and resources) are useful for comparing similar queries, with different optimizations enabled.
Presto’s CBO uses the table statistics provided by the connectors to estimate the costs for different join implementations and automatically picks the one with the lowest estimated costs. Therefore, it is recommended to keep table statistics up to date. This can be done using the SHOW STATS and ANALYZE commands. The SHOW STATS command returns current statistics for the named table or for the results of a (limited) query, while the ANALYZE command updates table and column statistics of a given table.
File format, size, compression, and organization can go a long way to improving your Presto query performance. Check out this great webinar by Dain Sundstrom from Trino on Configuring and Tuning Presto to learn more about tuning and sizing your Presto deployment for optimal performance.
Columnar format has become the de facto standard for big data analytics. The process of converting rows into columns is time consuming and compute-intensive but critical for optimal performance. We recommend using ORC or Parquet – both outperform text format considerably – but ORC is faster in Presto. Presto does not mandate how data is stored but it processes the data as vectorized columns. This means that Presto only holds the amount of data it processes without carrying around the additional fields that are not relevant to the query in process. Moreover, columnar data usually compresses better since each column can be compressed separately – according to its data distribution.
File size is a big deal in Presto. Small files create many small IO requests which impact performance (high latency, getting throttled, or running out of IO capacity) and makes file listing slow. Each file requires separate handling, which increases the scheduling time and cost. We recommend using file sizes of at least 100MB to overcome potential IO issues.
Additionally, always consider compressing your data for better performance. If using high compression formats, prefer ZSTD over ZIP. For low compression, prefer LZ4 over Snappy. In any case, you should avoid using LZO altogether.
Partitioning divides your table into parts and keeps the related data together based on column values such as date or product. You define partitions at table creation, and they help reduce the amount of data scanned for queries with predicates over the partitioned column, thereby improving performance. Additionally, sorting your data also helps narrow the range of partitions that need to be read by Presto. If we sort each file, we can efficiently skip irrelevant chunks, but only if we filter over the sorted column.
Improving your filters to give you less results (high selectivity) is key to reducing latency and the amount of data scanned. Presto has several features it uses to optimize filtering operations.
One way Presto optimizes querying is by skipping over unnecessary data, referred to as pruning. Some statistics (e.g. the minimum and maximum values) of each column are recorded per file, per stripe (~1M rows), and for every 10K rows. Using this information, Presto can skip any segment that could not possibly match the query predicate. Below are screenshots of EXPLAIN ANALYZE outputs for queries with and without predicate pushdown.
EXPLAIN ANALYZE output for query without predicate pushdown: reading 2.79B rows with complex predicate – which currently cannot be pushed down.
EXPLAIN ANALYZE output for query with predicate pushdown: reading 2.13B rows with simple predicate.
The last feature is lazy reads – delay reading columns until their data is actually needed. Predicate pushdown is amazing when it works, but for a lot of data sets, it doesn’t work at all. If the data has a large number of distinct values and is well-shuffled, the minimum and maximum stats will cover almost the entire range of values, rendering predicate pushdown ineffective. With lazy reads, the query engine always inspects the columns needed to evaluate the query filter, and only then reads other columns for segments that match the filter (if any are found). For most workloads, this feature saves more CPU in addition to predicate pushdown. Additionally, you can take advantage of another great optimization, late materialization which allows lazy reads to be propagated within the same stage. This can be enabled via the ‘experimental.late-materialization.enabled’ config property.
The order in which joins are executed in a query can have a significant impact on the query’s performance. If a join that produces a lot of data is performed early in the execution, then subsequent stages will need to process large amounts of data for longer than necessary, increasing the time and resources needed for the query. Presto’s CBO will always try to choose the most efficient plan for a given query, based on table statistics and resource availability. This greatly reduces query processing time.
In order for advanced optimizations like dynamic filtering to work efficiently, the smaller dimension table needs to be chosen as a join’s build side. The cost-based optimizer can automatically do this using table statistics provided by connectors. Therefore, it is recommended to keep table statistics up to date and rely on the CBO to correctly choose the smaller table on the build side of the join.
There are two types of join distributions in Presto:
If the right-hand side (build) table is “small”, it can be replicated to all join workers – saving the CPU and network the cost of left-hand side repartitioning – which can be significant for a large left-hand side (probe) table. Otherwise, both tables are repartitioned using the join key, allowing joins with larger right-hand side tables. For more details on how to optimize joins in selective use cases see my Presto join optimization SlideShare.
Very large join operations can sometimes run out of memory. Colocated joins (also known as grouped join execution) will use less memory, CPU, and shuffle less data among Presto workers. To leverage these benefits, you must make sure the two tables to be joined are partitioned on the same keys. When both sides of a join have the same table partitioning and the partitioning is addressable, partial data can be loaded into memory at a time, making it possible to execute the join with less peak memory usage.
Presto was recently enhanced with the Dynamic Filtering feature (contributed by Varada) that allows pre-filtering the probe-side table before joining tables. Dynamic filtering optimizations significantly improve the performance of queries with selective joins by avoiding reading of data that would be otherwise filtered by join condition. This has shown to reduce query execution time by a factor of x3-x10! Without dynamic filtering, Presto scans all the data in the fact table since there are no filters in the query. The join operator ends up throwing away most of the probe-side rows as the join criteria is highly selective.
With Varada, data teams and users no longer need to compromise on performance in order to achieve agility and fast cost effectiveness: they can leverage the power of autonomous indexing, caching, intermediate results, efficient scan and predicate pushdown implementation, and optimized dynamic filtering implementation to accelerate Presto queries by 10x-100x on their existing cluster.
Varada’s proprietary indexing logic automatically analyzes the data lake and introduces indexes for filtering, joins and aggregates, continuously evaluating query performance on the fly. Varada’s engine automatically prioritizes the data to index or cache based on a smart observability layer that continuously monitors demand. Varada indexes data directly from the data lake across any column, so that every query is optimized automatically. Varada indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting columns into small chunks, called nanoblocks™. Based on the data type, structure, and distribution of data in each nanoblock, Varada automatically creates an optimal index. To ensure fast performance for every query and each nanoblock, Varada automatically selects from its library of indexing algorithms and indexing parameters the optimal set for any data nanoblock, even as data changes and evolves.
Schedule a short demo now to see Varada in action on your data set!