Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was designed to be highly adaptive, flexible, and extensible. It also can support a large variety of different use cases with diverse characteristics. As a distributed system running on Hadoop file system, it features one coordinator node that works in sync with numerous worker nodes. When users enter their SQL query, the coordinator utilizes a proprietary open source, query and execution engine to parse, plan and schedule a distributed query plan using the worker nodes.
The Presto SQL engine was built to support standard ANSI SQL semantics, such as aggregations, joins, left/right outer joins, window functions, sub-queries and distinct counts and approximate percentiles, among others. Presto data sources are wide and varied, from HDFS to traditional relational databases through google docs, and includes NoSQL data sources such as Cassandra.
Due to its unique advantages, and its ability to be optimized resulting in accelerated presto, it has quickly become a tool of choice for many significant data driven companies. In this post, we’ll review 5 features that make Presto great, followed by 5 features that if Presto had them, it would be even greater.
This is the feature that sets Presto apart from other data platforms. Presto does not have its own internal storage; rather it connects to other data storage and reads the data from them (it can also write but Presto shines at reading pre-existing data). Presto can connect to a wide variety of data stores; all common SQL and NoSQL databases as well as reading unmodeled data directly from S3 and HDFS. The ability to read raw, unmodeled data directly from S3 saves you from waiting for an ETL to pre-process your data; you can access it as soon as it is stored in your data lake. This gives you a very short time-to-insight on your fresh data, no matter where it is stored.
Not only that, with Presto, you can use multiple data sources at once. One query can cross reference data from numerous data sources!
Presto performs all processing in memory. Unlike other distributed solutions, no intermediate results are stored to disk. This mode of operation ensures that the distributed processing cycle is as streamlined as possible.
Presto has several features that greatly speed up query planning and execution. We’ll discuss a couple here. The first is well known in other databases, Cost Based Optimization. This means that Presto will always choose the most efficient plan for a given query, based on table statistics and resource availability. This greatly reduces query processing time.If that’s not enough, recently Presto was enhanced with Dynamic Filtering (by our very own Varada developer Roman Zeyde!). Dynamic Filtering allows for the filtering part of the query to be applied before JOINing tables. This has shown to reduce query execution time by a factor of x3-x10! Full details can be found in Roman’s blog post here.
Columnar format has become the de facto standard for big data analytics. Presto does not mandate how the data is stored but it processes the data as vectorized columns. This means that Presto only holds the amount of data it processes without carrying around the additional fields that are not relevant to the query in process. Moreover, for the columnar data sources, this optimization goes all the way to reading and writing the data.
Supporting standard ANSI SQL makes using Presto a breeze. Whether you’re a data analyst or a developer, freeing you from learning a dedicated language to query your data is always a big plus. Moreover, it easily connects to all common BI tools with a JDBC connector.
With Big Data analytics, there is never fast enough. Especially for situations that require complex queries over a big data set such as running a BI dashboard over a big Data Lake or a user facing data driven application. And although Presto will allow to create this without an intermediary system, the performance will be lacking.
A Presto cluster consists of the single Coordinator node and a slew of Worker nodes. The worker nodes are elastic by design; you can easily remove or add additional workers as your needs change.
Presto does not have a caching layer; this means if you have “hot” queries you won’t be getting better results from Presto. To be fair, you may have a caching layer in the underlying data source.
Some solutions, like Spark, allow your queries to withstand node failure. In Presto, if a worker node fails during a query process, that query will fail. This can be troublesome and having the query finish quickly can make this easier to swallow.
However, the Coordinator is a single point of failure. It is the point of contact for the client and does all the query planning and resource management. When the Coordinator fails, you can’t really do anything. Unfortunately, in an otherwise enterprise ready product, Presto has no inherent HA solution for it.
Presto is an open source project. This has advantages (rapid development cycle, a strong developer community to confer with) but also raises some challenges.
The downside of using an open source tool is that it’s all on you; installing, configuring and maintaining the application is your responsibility. Most big companies that employ Presto have a skilled team dedicated to Presto. Hiring such a highly skilled staff is time consuming and maintaining the team is expensive. This often makes the “free” open source solution just as costly as an enterprise solution.
Varada extends the goodness of Presto with an indexing layer on top of hot data. Now queries can run x10-x100 times faster, while running on thousands of dimensions.
Varada’s Inline Indexing™ technology enables large and complex datasets to optimally serve analytics users and apps by making filtering, joining and aggregating data extremely fast on every dimension of any data source.
Presto is a wonderful fit for our innovative solution as we offload to Presto the SQL parsing, distributed planning and execution and we concentrate on reading the data as quickly as possible.
Varada’s solution is deployed in your virtual private cloud (VPC) and offers an inherent HA solution for the Coordinator. The platform maintains a hot standby running and in case the primary fails, the secondary Coordinator will take over smoothly.
If you need a predictable and consistent SLA plus the great flexibility and fast time-to-insight that Presto provides, take Varada for a spin! Click here to schedule a short demo.