In the battle of analytics solutions, the data lake architecture is quickly becoming the leading alternative. The strong need for agility and flexibility in querying massive amounts of data is the key differentiation factor and what makes organizations truly data-first.
There are three key approaches to running analytics on the data lake. The basic approach is based on brute force, and leverages smart compute-based optimizations to scan the entire data lake. Snowflake has opted for the micro-partitioning approach to allow the query engine to only retrieve the data blocks that are required to answer a query based on the query predicate for the partitioning key, and to do so in parallel. Varada took an entirely different approach, bringing the power of dynamic indexing to the data lake — by indexing over all data columns or dimensions at once, Varada significantly reduces compute resources to support interactive SLAs.
Indexing is the tried and true method of reducing data scans. The query engine uses indexes and the query predicate to reduce the amount of data retrieved before reading all of the relevant columns for each record to process.
Although this has been the prevalent optimization method in relational databases, indexes have not found a strong usage in big data. Data for analytics is typically stored in a columnar way that makes it much easier to scan and aggregate – but makes any classic row-based indexing approach very inefficient. Therefore partitioning or sorting the data on few columns has become the popular approach.
However, one key element of big data is the large amount of dimensions in play due to the data complexity. It is common to find thousands of columns used daily, making any partitioning scheme woefully inadequate to the business need, and reduces the actual dimensions used to only a few. Dynamic indexing eliminates this trade-off. Instead of creating partitions to index several columns, Varada dynamically creates millions of nanoblocks – a few dozen kilobyte sized sub sections. Each nanoblock only stores an index for a subset of the data, and each nanoblock index is independent and uniquely adapted to that nanoblock. By taking advantage of modern storage systems, such as SSDs, nanoblocks are fast to load, update, and systems like Varada can quickly read and execute queries against nanoblock indexes without the overhead inherent in either traditional indexes or micro-partitions.
Varada is also able to dynamically create, modify, and remove both indexes and table columns without the overhead of rewriting table data or any other background rebalancing or clean-up processes.
In this short analysis we measured Varada’s performance and cost per query against Snowflake on four different queries.
The queries include examples for:
Query #1: Full scan distinct count aggregation that matches the partitioning scheme
Query #2: Cohort selection from the data for a simple counting aggregation;
Query #3: Selective projection of many columns in few rows at a needle in a haystack search
Query #4: Selective join operation between two tables, fact and dimension. As demonstrated below, Varada’s unique indexing technology shines across all query types, but especially in selective and highly selective queries.
For the benchmarking analysis, we used a 27bn rows / 1.5TB (compressed in snappy Parquet) dataset, inspired by a popular ride-sharing app. Queries were executed a few times and the median result was taken. As we ran each query multiple times, we didn’t use the cache to store results. Only cached data (“warm”) was used to represent active workloads.
Cluster configurations:
Varada outperformed Snowflake on all queries, and as expected the value of our unique indexing technology is strongly proven to accelerate selective queries and joins.
To illustrate how much faster Varada is, check this out:
For example, in Query #3 Varada was almost 30x faster than Snowflake. In selective joins the uplift is even greater, and is over 60x.
When it comes to cost per query, Varada’s indexing is extremely lean on compute resources, which can help data teams to avoid Snowflake spiraling costs.
The agility and flexibility benefits of the data lake architecture are clear. But, delivering performance and cost are the critical driving forces behind massive data lake adoption.
As analytics use cases become in high demand across almost every business unit, data teams are constantly struggling with balancing performance and costs. In order to widely expand the open data lake concept across the entire organization, data teams should seek a smart and dynamic solution that will autonomously accelerate queries without breaking the budget.
Augmenting Snowflake with a cloud data lake solution delivers significant benefits:
However, SQL workloads over data lakes do not benefit from data warehouse optimization like automatic clustering, and become data scan heavy and expensive. Varada helps bridge that gap.
Varada enables fast SQL analytics directly on your cloud data lakes, coexisting with data warehouses, both for data lakes separate from a data warehouse and for data lakes serving a hybrid environment.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!
Query #1 (full scan):
SELECT extract(year from "d_date") as year, extract(month from "d_date") as month, count(distinct "tripid") as cnt_trips FROM v2_demo_trips.trips_data WHERE "d_date" between date '2018-03-23' and date '2018-03-25' GROUP BY 1,2
Query #2 (cohort selection):
SELECT "rider_gender", count(*) as cnt FROM v2_demo_trips.trips_data WHERE "d_date" between date '2018-02-01' and date '2018-03-01' AND "rider_age" IN (18,19) AND "d_weekday" = 5 GROUP BY 1 order by 2 desc;
Query #3 (selective project):
SELECT "tripid", "rider_id", "d_date", "t_hour", "fare", "duration", "distance", "rider_age", "rider_gender", "lon", "lat" FROM v2_demo_trips.trips_data WHERE "rider_id" = 3380311 AND "t_hour" between 7 and 10
Query #4 (selective join):
SELECT extract(month from fact."ts") as month, count(distinct fact."tripid") as cnt_unq_trips , count(distinct fact."rider_id") as cnt_unq_riders, round(avg(fact."fare"),2) as avg_fare, round(avg(fact."duration")/60,2) as avg_duration_min FROM v2_demo_trips.trips_data as fact, v2_demo_trips.riders_campaign as dim WHERE dim."rider_id" = fact."rider_id" AND dim."segment" IN ('Students_Promotion','Churning_Riders_Last3Months') AND dim."num_trips" < 9 group by 1 order by 1;