Varada vs. Snowflake: The Power of Indexing on Your Data Lake

David Krakov
By David Krakov
I
February 22, 2021
February 22, 2021

Data Lake Architecture Gains Momentum

In the battle of analytics solutions, the data lake architecture is quickly becoming the leading alternative. The strong need for agility and flexibility in querying massive amounts of data is the key differentiation factor and what makes organizations truly data-first.

There are three key approaches to running analytics on the data lake. The basic approach is based on brute force, and leverages smart compute-based optimizations to scan the entire data lake. Snowflake has opted for the micro-partitioning approach to allow the query engine to only retrieve the data blocks that are required to answer a query based on the query predicate for the partitioning key, and to do so in parallel. Varada took an entirely different approach, bringing the power of dynamic indexing to the data lake — by indexing over all data columns or dimensions at once, Varada significantly reduces compute resources to support interactive SLAs.

Repurposing Indexing to Meet Performance Requirements on Very Large Datasets

Indexing is the tried and true method of reducing data scans. The query engine uses indexes and the query predicate to reduce the amount of data retrieved before reading all of the relevant columns for each record to process. 

Although this has been the prevalent optimization method in relational databases, indexes have not found a strong usage in big data. Data for analytics is typically stored in a columnar way that makes it much easier to scan and aggregate – but makes any classic row-based indexing approach very inefficient. Therefore partitioning or sorting the data on few columns has become the popular approach.

However, one key element of big data is the large amount of dimensions in play due to the data complexity. It is common to find thousands of columns used daily, making any partitioning scheme woefully inadequate to the business need, and reduces the actual dimensions used to only a few. Dynamic indexing eliminates this trade-off. Instead of creating partitions to index several columns, Varada dynamically creates millions of nanoblocks – a few dozen kilobyte sized sub sections. Each nanoblock only stores an index for a subset of the data, and each nanoblock index is independent and uniquely adapted to that nanoblock. By taking advantage of modern storage systems, such as SSDs, nanoblocks are fast to load, update, and systems like Varada can quickly read and execute queries against nanoblock indexes without the overhead inherent in either traditional indexes or micro-partitions.

Varada is also able to dynamically create, modify, and remove both indexes and table columns without the overhead of rewriting table data or any other background rebalancing or clean-up processes.

Schedule A Demo

Benchmarking Snowflake vs. Varada

In this short analysis we measured Varada’s performance and cost per query against Snowflake on four different queries. 

The queries include examples for: 

Query #1: Full scan distinct count aggregation that matches the partitioning scheme
Query #2: Cohort selection from the data for a simple counting aggregation; 
Query #3: Selective projection of many columns in few rows at a needle in a haystack search 
Query #4: Selective join operation between two tables, fact and dimension. As demonstrated below, Varada’s unique indexing technology shines across all query types, but especially in selective and highly selective queries.

For the benchmarking analysis, we used a 27bn rows / 1.5TB (compressed in snappy Parquet) dataset, inspired by a popular ride-sharing app. Queries were executed a few times and the median result was taken. As we ran each query multiple times, we didn’t use the cache to store results. Only cached data (“warm”) was used to represent active workloads.

Cluster configurations:

  • Snowflake: Medium (4 credits), total data size as loaded 2.1TB
  • Varada: 3 x r5d.8xlarge, with 1.25TB allocated for index and cache

Query Performance Results

Varada Snowflake Query Performance
Query Run Time (seconds)

Varada outperformed Snowflake on all queries, and as expected the value of our unique indexing technology is strongly proven to accelerate selective queries and joins.

To illustrate how much faster Varada is, check this out:

Varada Snowflake Query Performance
Varada Performance Uplift vs. Snowflake

For example, in Query #3 Varada was almost 30x faster than Snowflake. In selective joins the uplift is even greater, and is over 60x.

Cost Per Query

When it comes to cost per query, Varada’s indexing is extremely lean on compute resources, which can help data teams to avoid Snowflake spiraling costs.

Varada Snowflake Query Cost
Cost / Query ($, normalized for 1-hour 100% utilization, except for on-demand)

Taking Data Lake Architecture to the Max: Run as Fast as Your Business!

The agility and flexibility benefits of the data lake architecture are clear. But, delivering performance and cost are the critical driving forces behind massive data lake adoption.

As analytics use cases become in high demand across almost every business unit, data teams are constantly struggling with balancing performance and costs. In order to widely expand the open data lake concept across the entire organization, data teams should seek a smart and dynamic solution that will autonomously accelerate queries without breaking the budget.

Augmenting Snowflake with Agile Data Lake Analytics

Augmenting Snowflake with a cloud data lake solution delivers significant benefits:

  • Cost reduction by lowering cost of storage and data transfer with a cloud data lake  
  • Data lifecycle, data latency and ETL needs that fit a data lake better
  • Multi-purpose data marts that serve non-SQL workloads such as Machine Learning
  • Regulatory/contractual requirements for data location or retention

However, SQL workloads over data lakes do not benefit from data warehouse optimization like automatic clustering, and become data scan heavy and expensive. Varada helps bridge that gap.

Varada enables fast SQL analytics directly on your cloud data lakes, coexisting with data warehouses, both for data lakes separate from a data warehouse and for data lakes serving a hybrid environment.


See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:

To see Varada in action on your data set, schedule a short demo!


Query Syntax

Query #1 (full scan): 

SELECT extract(year from "d_date") as year,
       extract(month from "d_date") as month,
       count(distinct "tripid") as cnt_trips
FROM
  v2_demo_trips.trips_data
WHERE
  "d_date" between date '2018-03-23' and date '2018-03-25'
GROUP BY  1,2

Query #2 (cohort selection):

SELECT "rider_gender",
       count(*) as cnt
FROM v2_demo_trips.trips_data
WHERE "d_date" between date '2018-02-01' and date '2018-03-01'
      AND "rider_age" IN (18,19)
      AND "d_weekday" = 5
GROUP BY  1
order by 2 desc;

Query #3 (selective project):

SELECT
  "tripid", "rider_id", 
  "d_date", "t_hour",  
  "fare", "duration", "distance",
  "rider_age", "rider_gender",
  "lon", "lat"
FROM
  v2_demo_trips.trips_data
WHERE
  "rider_id" = 3380311
  AND "t_hour" between 7 and 10

Query #4 (selective join):

SELECT
  extract(month from fact."ts") as month,
  count(distinct fact."tripid") as cnt_unq_trips ,
  count(distinct fact."rider_id") as cnt_unq_riders,
  round(avg(fact."fare"),2) as avg_fare,
  round(avg(fact."duration")/60,2) as avg_duration_min
FROM
  v2_demo_trips.trips_data as fact,
  v2_demo_trips.riders_campaign as dim
WHERE
  dim."rider_id" = fact."rider_id" AND
  dim."segment" IN ('Students_Promotion','Churning_Riders_Last3Months') AND
  dim."num_trips" < 9
group by 1
order by 1;
We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept