With ~80% of compute resources “wasted” on ScanFilter, partitioning is limited in reducing data reads and optimizing query performance for multiple workloads. Here’s why you should be using indexing instead.
Presto and Trino have quickly become the tools of choice for data driven companies looking for the agility and flexibility of the data lake architecture. As analytics use cases grow in demand across almost every business unit, data teams are desperate for a way to balance cost and performance. The single most common problem in data lake analytics ROI relates to the fact that Presto and Trino are based on brute force query processing.
Best practices tell us that a good partitioning strategy, aligned with the dimensions that are frequently used in the filters of the queries, can go a long way to help limit the volume of data scanned and dramatically accelerate queries and reduce costs. But since Presto still has to cull through all of the data from the partitions that were not filtered out, partitioning can only take you so far.
But, filtering by only a few columns and partitions is no longer enough. As data-driven companies use their data as one of their most strategic assets that drive their competitive edge, data consumers constantly look for deeper insights. This means running more sophisticated complex queries that leverage the hundreds and often thousands columns to drive smart decisions.
In fact, we’ve discovered that on average 80% of customers using PrestoDB and Trino clusters waste 80% of compute resources on ScanFilter operation, i.e on reading and filtering the underlying data, regardless of their partitioning strategy.
This means partitioning is not useful enough in reducing data reads and optimizing query performance for multiple workloads.
Partitioning is the closest you can get to indexing in a cloud data lake. In order to reduce query time and cost, the best partitioning strategy will make queries filter out as many partitions as possible, enabling PrestoDB / Trino to scan as little data as possible. Data is commonly partitioned by time, and often uses up to 3 levels of partitioning (i.e. year, month, day), in order to facilitate queries that filter by time. In multi-tenant use cases there is another level of partitioning by tenant as well.
However, despite the best efforts, queries still read massive amounts of data: the partitioning strategy does not reduce the data reads for most queries, which also filter on other columns different from the partition columns. Partitioning also doesn’t solve the problem in cases where query patterns are dynamic and involve predicates on more than a few columns. Another challenge is partitioning by multiple dimensions or by high cardinality columns — this may result in long tailed data distribution across partitions and extremely small files per partition. Such small files on the data lake can significantly degrade query performance.
This is a huge challenge in cases where SLA requirements are strict (i.e. interactive performance), several workloads coexist, and the partitioning strategy needs to enable fast analytics for all use-cases. For example, consider a scenario where customer-facing apps and internal analytics workloads coexist on the data lake. The data might be partitioned by date and by tenant in order to serve customer facing app workload in a timely manner, but it does not support cross-tenants event analysis workloads (to answer business questions such as “get all tenants who had event X in the last week”, for example). Queries in the event analysis workload won’t benefit from the partitioning strategy, which will result in reading all tenants partitions and long query response time.
If a company wants both internal analytics across multiple customers, and external analytics that present data to each customer separately, it can make sense to duplicate the table data and use time-based and event partitioning for internal analytics, and time and tenant partitioning for the customer-facing analytics — but that means that the data lake will no longer exist as a single source of truth. Another option will be to try and consolidate both use cases and partition by date, tenant and event. In the case of 50 tenants and 40 event types we will end up with 730,000 partitions for just the last year data (50X40X365 = 730,000). This enormous number of partitions will most likely result in very small files, long listing time by the query engine, and overall poor query performance.
In short, in today’s modern data analytics era, data challenges aren’t necessarily related to the number of rows — a complexity that can be solved through several approaches, including partitioning.
Rather, data challenges arise from the multi-dimensionality of data — hundreds and often thousands of columns — making partitioning simply ineffective in reducing data read.
Don’t believe us? Check it out for yourself using our Presto Workload Analyzer (free and available on github) to get accurate data on how much compute resources are actually spent on ScanFilter operations. The analyzer offers deep actionable insights and unprecedented observability for Presto and Trino clusters, including Starburst, AWS EMR, Dataproc, etc.. You can review our tips to get started and optimization tips to learn more about the analyzer and its reports. You can also review a sample report of the Analyzer.
Once you’ve benchmarked your wall time usage by operator type (see chart #17 in the Analyzer report), you can evaluate the effectiveness of your partitioning strategy. If your results are significantly higher than the benchmark, chances are that you need to re-evaluate your partitioning strategy in order to support more workloads and reduce ScanFilter to expected levels.
In order to do this, you’ll need to identify top tables by scan time consumption (chart #18 in the Analyzer report). For these tables, compare the partitioning layout with the most common predicates to make sure that the partitioning strategy fits the query patterns and effectively reduces the amount of data being read by queries.
But remember that despite your best optimization efforts, with the best partitioning strategy you still spend 80% of resources on scanning your data. So, how can you reduce full scans? Indexing!
Indexing is the tried and true method of creating separate files that can be used to quickly identify where to retrieve records from the data set. Indexing is multi-dimensional by nature and is extremely effective for selective and highly selective queries that are based on filtering many columns. But Implementing indexing on truly massive data sets is not feasible for most organizations — it’s a complicated and expensive development process that requires deep skills and expertise to deliver. The key to big data indexing solutions is to have a dynamic, intelligent indexing system, which is a perfect fit for massive big data and strict / interactive performance requirements, as well as making sure data lake ROI is optimal.
Varada is an autonomous query acceleration platform which gives data teams control over the performance and cost of their cloud data lake analytics. Varada leverages dynamic and elastic indexing and caching, efficient scan and predicate pushdown implementation, as well as optimized dynamic filtering implementation, to accelerate SQL queries by 10x-100x and reduce costs by 40%-60%. As an example, check out this data from our benchmarking of Trino vs. Varada.
Varada autonomous indexing technology is:
Varada eliminates the need for full scans and can accelerate queries automatically without additional DataOps and any overhead to query processing or any background data maintenance. This reduces the amount of data scanned by orders of magnitude.
Because query patterns keep changing (and we can’t keep updating the data layout), and because partitioning can’t really reduce the amount of data being read, Varada’s autonomous indexing can help serve multiple workloads without compromising on the performance provided with the existing partitioning strategy and data layout.
Enabling Trino clusters to leverage Varada’s acceleration is as simple as installing a .JAR file to the nodes (“Connector” deployment). Any query using presto-hive to access the data lake, such as S3 or HDFS, can benefit from Varada’s unprecedented query performance.