Last updated: June 14, 2020
Data architects are consistently striving to provide users with the best user experience possible (ease of use, query response times, data availability). But they are also challenged with building an efficient, robust, highly available, scalable and cost-efficient infrastructure in the shortest amount of time and effort as possible.
This delicate balance also applies to managing Presto clusters. The benefits delivered by Presto are tremendous (click here to learn more), but is it really running as efficiently as possible? Can it support the SLA on response time and concurrency that workloads and users demand?
As we learned from our customers and partners, most of the optimization opportunities are hidden underneath the surface and are based on truly understanding how users and data interact.
To address that we built a Presto cluster insights tool that provides in-depth visibility on how to best serve Presto users and workloads. This is a sneak preview of some of the insights we can help uncover. For a real sneak peak, check out this – the report is based on real customer data that has been anonymized.
Community first! As part of our deep commitment to the Presto community, we decided to release a standalone open source version of the tool for any Presto user to evaluate potential performance improvements in their cluster. The tool will help optimize your Presto clusters on your own with existing solutions, as well as to evaluate how adding an indexing layer can help. Stay tuned to the open source version!
To execute queries, Presto performs many activities such as scans, filtering, joining, and aggregating data. By focusing on what workloads truly require and which resources are used, you will be able to identify heavy spenders and improve the pipeline:
Presto cluster: CPU utilization by operator
The Workload Analyzer offers deep insight into resource utilization (compute, RAM, networking bandwidth) and visibility into which resources are used and when. This enables to easily optimize for the best configuration and determine the right sizing of the cluster following usage trends, as well as define scaling rules:
Presto cluster: CPU utilization over time
Presto cluster: RAM utilization by user
Not all data and users are equal. For example, some data elements – sources, tables, partitions – can be used more frequently than others. Many clusters follow a power law distribution, where a small part of users and data consume an unproportional percentage of resources.
The Workload Analyzer also offers insights on usage patterns to enable you to easily identify data that is frequently accessed, and which may cause bottlenecks on resources such as network bandwidth and store it differently. Furthermore, the tool detects data that causes heavy processing to consider pre-aggregation or indexing.
Furthermore, the Workload Analyzer helps identify selectivity patterns (output vs. input data read post applying query filters) so you can effectively improve:
Presto cluster: Table scan CPU utilization
Presto cluster: CPU utilization by user
Small changes in a query SQL can result in dramatic changes in performance and resource consumption. For example, reordering joins so that a small right hand side is joined with a large left hand side results in a significant improvement. Using an approximate distinct count instead of exact distinct count is at times an acceptable trade-off between accuracy and query time.
You can also identify if and where you can leverage performance-boosting potential with Dynamic Filtering. This is extremely relevant for the very common Star Schema DWH architecture where you have large table (fact) on one side and a much smaller one (Dim or highly selective large table) on the other side of the join.
Presto provides a lot of detailed information about the queries execution and exposes it via a REST API, in JSON format.
This standalone tool collects and locally stores query JSONs from Presto for the recording period. The tool’s data collection has negligible compute needs and does not impact or interact in any way with the cluster query execution. The Presto statistics data stored inside JSONs can then be used for the analysis of the cluster during the recorded period using an open-source Python script that is executed locally.