Unveiling the Secrets Behind Your Presto Cluster

Roman Zeyde
By Roman Zeyde
I
April 12, 2020
April 12, 2020

Last updated: June 14, 2020


Are you ready to get started? Click New call-to-action to download the Workload Analyzer. It’s free!


Data architects are consistently striving to provide users with the best user experience possible (ease of use, query response times, data availability). But they are also challenged with building an efficient, robust, highly available, scalable and cost-efficient infrastructure in the shortest amount of time and effort as possible. 

This delicate balance also applies to managing Presto clusters. The benefits delivered by Presto are tremendous (click here to learn more), but is it really running as efficiently as possible? Can it support the SLA on response time and concurrency that workloads and users demand? 

As we learned from our customers and partners, most of the optimization opportunities are hidden underneath the surface and are based on truly understanding how users and data interact. 

To address that we built a Presto cluster insights tool that provides in-depth visibility on how to best serve Presto users and workloads. This is a sneak preview of some of the insights we can help uncover. For a real sneak peak, check out this New call-to-action – the report is based on real customer data that has been anonymized.

Community first! As part of our deep commitment to the Presto community, we decided to release a standalone open source version of the tool for any Presto user to evaluate potential performance improvements in their cluster. The tool will help optimize your Presto clusters on your own with existing solutions, as well as to  evaluate how adding an indexing layer can help. Stay tuned to the open source version!

New call-to-action

Understand What Your Cluster Does

To execute queries, Presto performs many activities such as scans, filtering, joining, and aggregating data. By focusing on what workloads truly require and which resources are used, you will be able to identify heavy spenders and improve the pipeline:

  • Identify top users consuming most of the CPU on your cluster
  • Identify which query operators/type of queries consume most of your cluster resources
  • Identify which users required further education on improving and optimizing queries, as well as applying performance best practices
  • Identify and improve cohorts / groups of users and apply resource groups policies to enhance overall workload management   
  • Optimize how you model chargeback pricing based on actual user consumption levels and pattern
Varada Presto

Presto cluster: CPU utilization by operator

Future-Proof Your Cluster: Leverage Workload Patterns to Plan Ahead

The Workload Analyzer offers deep insight into resource utilization (compute, RAM, networking bandwidth) and visibility into which resources are used and when. This enables to easily optimize for the best configuration and determine the right sizing of the cluster following usage trends, as well as define scaling rules:

  • Identify your low , medium and high peaks for smarter auto-compute scaling rules based on actual facts and query patterns (ETL, internal/external apps)
  • Identify new opportunities to reduce costs – are you over-subscribing infrastructure resources?
Varada Presto

Presto cluster: CPU utilization over time

Varada Presto

Presto cluster: RAM utilization by user

New call-to-action

Identify Hotspots to Optimize Performance and Cost

Not all data and users are equal. For example, some data elements – sources, tables, partitions – can be used more frequently than others. Many clusters follow a power law distribution, where a small part of users and data consume an unproportional percentage of resources.

The Workload Analyzer also offers insights on usage patterns to enable you to easily identify data that is frequently accessed, and which may cause bottlenecks on resources such as network bandwidth and store it differently. Furthermore, the tool detects data that causes heavy processing to consider pre-aggregation or indexing.

Furthermore, the Workload Analyzer helps identify selectivity patterns (output vs. input data read post applying query filters) so you can effectively improve:

  • Inefficient data format and wasted IO/CPU – convert to columnar ORC / Parquet with optional fast compression to avoid scanning redundant columns not needed for the query
  • Non-optimal / lack of partitioning, which can result in full table scans, slow performance and cluster bottlenecks
  • Predicate pushdown – even with partitioning in place the Workload Analyzer helps identifying other potential I/O & CPU inefficiencies, such as full tables scans, which in many cases stem from lack of pushdown
Varada Presto

Presto cluster: Table scan CPU utilization

Varada Presto

Presto cluster: CPU utilization by user

Detect Inefficient Usage and Help Educate Users

Small changes in a query SQL can result in dramatic changes in performance and resource consumption. For example, reordering joins so that a small right hand side is joined with a large left hand side results in a significant improvement. Using an approximate distinct count instead of exact distinct count is at times an acceptable trade-off between accuracy and query time. 

You can also identify if and where you can leverage performance-boosting potential with Dynamic Filtering. This is extremely relevant for the very common Star Schema DWH architecture where you have large table (fact) on one side and a much smaller one (Dim or highly selective large table) on the other side of the join.

Local and Non-Intrusive by Design: How is Data Collected?

Presto provides a lot of detailed information about the queries execution and exposes it via a REST API, in JSON format. 

This standalone tool collects and locally stores query JSONs from Presto for the recording period. The tool’s data collection has negligible compute needs and does not impact or interact in any way with the cluster query execution. The Presto statistics data stored inside JSONs can then be used for the analysis of the cluster during the recorded period using an open-source Python script that is executed locally.

New call-to-action

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept