Big data has created a massive challenge for most companies. A massive challenge that requires sophisticated tools to provide better data management. One of these tools is Trino. Trino is a SQL engine. More specifically, Trino is an open-source distributed SQL query engine for ad-hoc and batch ETL queries against multiple types of data sources.
But like any solution, Trino isn’t a perfect. In particular one of the main downsides of federated queries is that there can be some trade-offs in speed. This can be caused by a lack of meta-data stored and managed by Trino to better run queries.
In addition, Presto was initially developed at Facebook that essentially has its cloud. For them to expand it and grow it, as they need increased speed, isn’t a huge problem. However, for other organizations to get the same level of performance, they might need to spend more money to add more machines to their clusters. This can become very expensive, and all for the sake of managing unindexed data.
This is where products like Varada can optimize queries that are running on Trino.
In this article, we are going to discuss how we can fix the performance, lower costs and optimize queries just by utilizing Varada.
Varada is a platform that leverages a dynamic and adaptive indexing technology; thereby, making it easier for businesses to query data across services.
By indexing data in Trino, Varada reduces the time the CPU used for data scanning (via indexes). This frees the CPU up for other tasks like aggregating, joins and other complex data manipulation tasks. In turn, all of these optimizations allow SQL users to run various queries across dimensions, facts, other types of joins and data lake analytics on indexed data as federated data sources.
Varada’s goal for companies is to help them close the data lake analytics gaps. Data lakes are designed to help analysts and end-users get access to data, as well as deliver agility and quick time-to-market analytics. But, to truly achieve performance and cost requirements, data teams are challenged by heavy data ops and ineffective query engines.
Trino on its own can often attempt to brute force queries and it doesn’t track metadata that could be used to improve queries. In contrast, Varada utilizes its proprietary indexing technology to help companies gain unprecedented efficiencies that lead to faster queries and improved workloads.
Varada is on a mission to deliver the power of big data indexing to dramatically accelerate queries on the data lake. It does so by automatically learning the needs of a query or a group of business coherent queries (workloads) that analyze the column, the partition level and the actual operation that needs to be performed on the data.
But what is the foundation of Varada’s performance?
Varada has developed a unique indexing mechanism that is optimized for fast analytical queries. This is called nanoblock indexing. Instead of storing one large index for each column that the user selects, Varada dynamically creates millions of nanoblocks – a few dozen kilobyte sized sub sections of the indexed column.
To ensure fast performance for every query and each nanoblock, Varada selects from a set of indexing algorithms and indexing parameters that adapt and evolve as data changes to ensure the best fit index of any data nanoblock.
These indexes become useful when data analysts and engineers utilize concepts such as filters, joins, and aggregations. These various SQL concepts utilize Varada’s indexes and nanoblocks to improve performance by leveraging the meta-data stored in the nanoblocks to reduce the search time.
The ability to dynamically accelerate different datasets is at the core of Varada’s solution. The unique big data indexing technology Varada brings to the table is designed to continuously monitor and learn which datasets, within the massive data lake, are frequently used or required to meet specific performance requirements of high-priority workloads. By using this feedback loop, different datasets are dynamically and automatically operationalized by indexing, cache, intermediate results, or any combination that delivers optimal performance and price balance.
Trino on its own has a lot of benefits, but the balance between speed and costs are major issues you can run into when running large queries.
To demonstrate the speed and cost savings Varada provides to Trino users, Varada applied their solution against Snowflake.
According to Varada’s query tests they outperformed Snowflake on all queries as demonstrated in the chart above.
In their short analysis they analyzed 4 different queries. These 4 queries focused on the sections below:
Query #1: Full scan distinct count aggregation that matches the partitioning scheme
Query #2: Cohort selection from the data for a simple counting aggregation
Query #3: Selective projection of many columns in few rows at a needle in a haystack search
Query #4: Selective join operation between two tables, fact and dimension.
As demonstrated below, Varada’s unique indexing technology out performs competitive solutions across all query types, but especially in selective and highly selective queries.
But Varada doesn’t just improve performance. Varada also aims to improve the overall cost when using Trino.
When it comes to cost per query, Varada was developed to optimize cost by utilizing multiple techniques that range from smart indexing to caching; especially with Trino, which can occasionally attempt to brute force queries for performance at great expense.
Trino on it’s own can be difficult to manage. Many data professionals have said that data lakes can quickly become data swamps – not scaling with company needs.
This is where Varada’s Data Platform can take data lakes to the next level by providing capabilities that go beyond simply storing data in a cloud platform.
As one of the co-founders stated. “The storage aspects of a data lake have already been solved.”
Now, Varada comes in and solves the next set of problems which is helping end-users scale data analytics quickly. You won’t need to have data engineers spend as much time fine-tuning every new data set which is far from scalable.
Instead, the performance benefits you get from Varada can allow your analysts easier access to the data faster so they can test out their use cases.
Data lakes continue to play a valuable role for many companies. They have helped many companies deliver flexible data systems that provide quick turn-arounds for new features and insights. But, to truly achieve performance and cost requirements, data teams are challenged by heavy data-ops and ineffective query engines.
Varada’s implementation of their unique indexing technology allows companies to balance out cost and performance.
Their goal is to provide analysts and data scientists the ability to quickly have access to their company’s vast arrays of data stored in their data lakes.
For example, text analytics queries are significantly accelerated by using integrated Apache Lucene indexing (which Varada manages), enabling analysts, data scientists, and data applications to blazing fast text filters without any need for SQL performance, tuning or SQL optimizations. Data teams can also easily integrate text search into Business Intelligence systems and dashboards.
The queries that were developed by the researchers heavily relied on joins and multiple types of data—including experiments, gene databases and medical research—with some of those datasets being very large. Optimizing and accelerating these complex queries to meet the performance requirements requires extensive optimizations on the data structure accessed by AWS Athena, further delaying time-to-insights and resulting in a difficult-to-maintain data pipeline environment.
Varada enables the company’s data teams to democratize data by operationalizing the entire dataset while ensuring interactive performance, without the need to move data or optimize its structure. Varada’s acceleration capabilities deliver a suite of indexing and caching strategies.
Observability into data workloads: This allows teams to do more than just ad hoc analysis on their data lake. Instead, they also can build operational dashboards quickly.
Varada’s platform serves as a smart acceleration layer on your data lake technology and solutions, which remains the single source of truth, and runs in the customer cloud environment. The secret sauce is the ability to automatically and dynamically index relevant data, at the structure and granularity of the source. Varada enables any query to meet various performance and concurrency requirements, running at the same speed as the internal in-memory databases without exponentially growing.
The amount of data sources and data is growing at a rate that data engineers and data infrastructure at most companies will not be able to manage the growth. In turn, being able to take advantage of tools like Varada and Trino can help companies improve their query performance and reduce costs.
Varada indexes adapt to changes in data over time, taking advantage of Presto’s vectorized columnar processing by splitting columns into small chunks, called nanoblocks™. In addition, Varada has put together a community to help data engineers and analysts utilize Trino to its fullest.