Unveiling the Secrets Behind Your Presto Cluster

Roman Zeyde
By Roman Zeyde
April 12, 2020
April 12, 2020

Are you ready to get started? Click New call-to-action to download the Query Analyzer. It’s free!

Data architects are consistently striving to provide users with the best user experience possible (ease of use, query response times, data availability). But they are also challenged with building an efficient, robust, highly available, scalable and cost-efficient infrastructure in the shortest amount of time and effort as possible. 

This delicate balance also applies to managing Presto clusters. The benefits delivered by Presto are tremendous (click here to learn more), but is it really running as efficiently as possible? Can it support the SLA on response time and concurrency that workloads and users demand? 

As we learned from our customers and partners, most of the optimization opportunities are hidden underneath the surface and are based on truly understanding how users and data interact. 

To address that we built a Presto cluster insights tool that provides in-depth visibility on how to best serve Presto users and workloads. This is a sneak preview of some of the insights we can help uncover.

Community first! As part of our deep commitment to the Presto community, we decided to release a standalone open source version of the tool for any Presto user to evaluate potential performance improvements in their cluster. The tool will help optimize your Presto clusters on your own with existing solutions, as well as to  evaluate how adding an indexing layer can help. Stay tuned to the open source version!

Understand What Your Cluster Does

To execute queries, Presto performs many activities such as scans, filtering, joining, and aggregating data. By focusing on what workloads truly require and which resources are used, you will be able to set priorities for an optimization roadmap. 

The tool will help you:

  • Learn how resources are consumed, and identify associated cost
  • Prioritize initiatives, such as re-partitioning, to improve predicate pushdown and to focus user education efforts

Presto cluster: CPU utilization by operator

Varada Presto

Break down by user to optimize specific needs, or by data source to understand the impact of data pipeline design choices on query performance.

Future-Proof Your Cluster: Leverage Workload Patterns to Plan Ahead

The tool offers deep insight into resource utilization (compute, RAM, networking bandwidth) and visibility into which resources are used and when. This enables to easily optimize for the best configuration and determine the right sizing of the cluster following usage trends.

Presto cluster: CPU utilization over time

Varada Presto

Further breaking down and comparing resource usage by different dimensions such as users, data sources or query type can help determine a need for separate cluster configurations.

Presto cluster: RAM utilization by user

Varada Presto

Identify Hotspots to Optimize Performance and Cost

Not all data and users are equal. For example, some data elements – sources, tables, partitions – can be used more frequently than others. Many clusters follow a power law distribution, where a small part of users and data consume an unproportional percentage of resources.

The tool offers insights on usage patterns to enable you to easily identify data that is frequently accessed, and which may cause bottlenecks on resources such as network bandwidth and store it differently. Furthermore, the tool detects data that causes heavy processing to consider pre-aggregation or indexing.

Presto cluster: Table scan CPU utilization

Varada Presto

Understand your users and allocate resources to them by budget and requirements / priorities.

Presto cluster: CPU utilization by user

Varada Presto

Detect Inefficient Usage and Help Educate Users

Small changes in a query SQL can result in dramatic changes in performance and resource consumption. For example, reordering joins so that a small right hand side is joined with a large left hand side results in a significant improvement. Using an approximate distinct count instead of exact distinct count is at times an acceptable trade-off between accuracy and query time. 

The tool enables to proactively detect bottlenecks and suggests query alternatives to improve user experience, reduce cluster cost, and build trust and familiarity with data across all data practitioners.

Local and Non-Intrusive by Design: How is Data Collected?

Presto provides a lot of detailed information about the queries execution and exposes it via a REST API, in JSON format. 

This standalone tool collects and locally stores query JSONs from Presto for the recording period. The tool’s data collection has negligible compute needs and does not impact or interact in any way with the cluster query execution. The Presto statistics data stored inside JSONs can then be used for the analysis of the cluster during the recorded period using an open-source Python script that is executed locally.