While both platforms are used to extract value off massive amounts of data in the data lake, Athena and Presto / Trino have benefits and drawbacks for different use cases — this post will highlight when and how to extract the most value of each.
If you are reading this blog, you are seriously considering implementing a data lake architecture or have already figured out that you need to use PrestoDB or Trino (AKA PrestoSQL) to support interactive analytics use cases and data-driven culture.
But should you start with a simple solution like Amazon Athena, which is based on Presto — or manage your own PrestoDB or Trino clusters? To make an informed decision, the first step is to understand the pros and cons of each solution, as well as what they offer in the context of your specific use cases, customer requirements and SLAs, data sets, and queries.
Amazon Athena is a serverless query service, providing the easiest way to run ad-hoc queries for data in S3 without the hassle of setting up and managing clusters. Just point it at the data and get started for $5 per terabyte scanned. Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets and is a great choice for getting started with analytics if you have nothing set up yet.
Presto and Trino are open-source distributed query engines running on a cluster of machines.
Both Presto and Trino were designed to be flexible, and extensible. Both support a wide variety of use cases with diverse characteristics. Presto and Trino are available as vanilla deployments in the cloud or on-premises, as well as a managed solution (e.g. AWS EMR, GCP Dataproc, Starburst Data, Ahana and more), and have been proven at scale in a variety of use cases at Facebook, Airbnb, Comcast, Netflix, Twitter, Uber and many more.
Since Presto and Trino are open-source, you only pay for the infrastructure used. For an additional incremental cost, you can use managed solutions offered by public cloud vendors such as Amazon EMR and GCP Dataproc, which makes it simple and cost effective to run Presto and Trino when compared to on-prem or self-managed cloud deployments. Alternatively, you can pay even more for fully managed commercial solutions such as Ahana and Starburst Data, that offer improved performance, support and security — while making it easy to deploy, connect and manage a Presto environment.
In this analysis we’ll zoom in on Trino and compare it to AWS Athena. You can easily apply these observations to PrestoDB.
Athena and Trino aim to provide users with the same benefits of data lake analytics , but they differ substantially in cost and scale, visibility and control, and setup and integration considerations. Below we outline the main differences between the two platforms.
Amazon Athena is priced per query. Charges are based on the amount of data scanned by the query – $5 per terabyte, with a 10MB minimum per query. You can save on query costs and achieve better performance by compressing, partitioning, and converting your data into columnar formats. Each of these operations reduces the amount of data Amazon Athena needs to scan to execute a query.
The Trino pricing scheme enables you to drastically lower the cost per query when you deploy Trino yourself. This is doubly true if you run queries extensively and scan significant amounts of data for each query. A basic Presto cluster of 10 workers and 1 coordinator on AWS EMR using 2x r5.8xlarge + r5.xlarge will cost $5 per hour. This enables users to run a full hour of concurrent queries that scan hundreds of terabytes of data — for the same cost of one Athena query scanning one terabyte of data.
One of AWS Athena’s biggest limitations is that it’s a shared, multi-tenant service. As such it cannot guarantee good predictable performance: it’s hard to predict SLA because for each query execution a different cluster size might be allocated under the hood, and managing the allocated resources is impossible. Trino enables users to manage resources, so SLAs are more controllable and predictable.
Scalability and concurrency are also a major concern. Athena by default does not allow more than 20 concurrent active queries (running and queued) per account. If your total of running and queued queries exceeds 20, query 21 will result in a “too many queries” error.
In Trino the concurrency limits are affected by the coordinator’s available CPU and the cluster’s available memory. Trino cluster size can be scaled up and down as required, and admins can choose the machine types and their number in order to optimize cost-performance ratio. This option is not available with Athena, being a serverless solution.
Additionally, Athena limits by default the runtime of each query to 30 minutes and then cancels the query without charge. When using a lambda function, there’s a limit of 15 minutes. With Trino there is no real upper limit on a query, but admins can use query.max-execution-time to set the time after which queries will be aborted.
Query Optimizations. Trino can leverage table statistics for its cost-based optimizer (CBO). Since Athena engine version 2 is based on Presto 0.217 in which CBO was introduced, we would expect that Athena 2 would leverage CBO as well. However, CBO can’t be fully leveraged since Athena users usually rely on Glue metastore, which only enables to update stats manually but not automatically using the ANALYZE command.
In addition, Athena can run out of resources — users will get a “Query exhausted resources at this scale factor“ error — a problem that can’t be mitigated since it’s impossible to tweek parameters or enlarge the underlying cluster size. Trino, on the other hand, does allow users to set session parameters, which can significantly boost query performance.
Monitoring & Debugging. The Trino web UI is a great query monitoring tool, showing all executed (and failed) queries along with performance statistics which enable users to fine-tune their cluster for faster and cheaper queries. Athena’s console displays only the amount of data scanned per query. You can run EXPLAIN on both platforms to see the query plan, but Athena doesn’t support EXPLAIN ANALYZE which collects runtime statistics that are beneficial for calculating the actual cost of each operation.
In Trino you can also look at query logs and analyze your workloads. To make it even easier, Varada recently released a new open source tool that enables data platform teams to analyze, monitor and optimize Presto clusters. The Presto Analyzer offers workload-level insights on how Trino and PrestoDB clusters perform, and enables to instantly identify heavy users and bottlenecks and effectively optimize performance and concurrency.
Trino version. Athena supports some, but not all, of Presto’s functions and features. Athena makes infrequent updates to the platform: the Athena engine version 2 is based on Presto release 0.217 (originally released on Feb. 2019), while the current Presto release is 0.259. The Trino community is exceptionally active in developing and introducing new features.
Supported infrastructure. Trino can run on-premises and on multiple cloud vendors (Dataproc for GCP, EMR on AWS), including Kubernetes. Athena runs on AWS.
Supported data connectors. Athena can connect and query managed AWS services via the federated queries capability, and enables users to write their own custom connectors (including on-prem data sources) using the Athena Query Federation SDK. Trino has an impressive set of connectors out of the box that enables to query not just data on S3 and MySQL instances (via JDBC), but also on non-relational data stores like MongoDB, Redis, Elasticsearch and even Kafka.
Query editor. Athena has a query editor in which you can run queries and save them. Trino doesn’t offer an organic query editor but can be connected to different IDEs with JDBC connections such as DBeaver.
Athen and Trino can be beneficial for data teams in different scenarios, for example when only ad hoc queries are required, or when SLAs are not a hard requirement. Below are the main use cases and pros and cons of both platforms.
Athena can work for you if you are either running experimental ad hoc queries, or if you query only small datasets in the data lake. It’s also perfect if you want to run interactive ad hoc SQL queries against data on Amazon S3, without having to manage any infrastructure or clusters. And, if consistent performance is not a major concern Athena can work for you in other scenarios as well.
Trino is a manageable and controllable query engine that keeps evolving and can be a good long-term solution for running analytics on the data lake at scale. When SLAs are a hard requirement or the cost is too high, Trino enables some query optimization, which is limited by the platform’s brute force query processing.
To summarize, Trino is a better choice if you’re running analytics on a massive data lake — mostly due to the ability to control and manage the cluster. But if SLA is a hard requirement, in most cases it will be hard to achieve reasonable cost/performance even with Trino. In these cases, a different approach is needed.
Autonomous indexing can remove the need to keep tuning the cluster performance for each new workload and can break the cost/performance glass ceiling.
Varada is an autonomous query acceleration platform which gives data teams control over the performance and cost of their Athena or Presto/Trino clusters. Varada’s dynamic and adaptive big data indexing solution enables to effectively balance performance and cost with zero data-ops.
Varada’s proprietary indexing logic automatically analyzes the data lake and introduces indexes for filtering, joins and aggregates, continuously evaluating query performance on the fly. Varada’s engine automatically prioritizes the data to index or cache based on a smart observability layer that continuously monitors demand. Varada indexes data directly from the data lake across any column. This means that every query is optimized automatically.
Varada has native connectors for PrestoDB and Trino (including Starburst and Ahana) or can be easily deployed as a standalone cluster. Regardless of your choice, you can add Varada to your stack and leverage the advantages of autonomous indexing:
Very fast queries at a lower cost. Data teams using Varada don’t need to compromise on performance to achieve agility and optimal resource utilization on the data lake. With the power of autonomous indexing, caching, intermediate results, and optimized dynamic filtering implementation, teams can accelerate Presto/Trino queries by 10x-100x on their existing cluster — at a 40%-60% cost reduction. As an example, check out this data from our benchmarking of Varada vs. Trino, and vs. Athena.
Deep workload observability. With Varada, data teams can easily set priorities, performance requirements and budget caps. Varada’s workload-level observability component enables seamless monitoring, optimization and acceleration of workloads to meet dynamic business requirements.
Eliminate dataops to get optimal agility and fast time-to-delivery. By autonomously accelerating Athena & Presto/Trino queries, Varada brings down dataops to the bare minimum, enabling data teams to focus on business priorities, deliver on analytics demands faster, and maintain control over data lake analytics cost and performance.