Disrupting Data Lake Economics with Indexing

By Ori Reshef
November 8, 2021
November 8, 2021

Spiraling compute costs prevent organizations from efficiently utilizing their massive data sets and getting top ROI on their data lake. Big data indexing technologies offer a resource-effective solution to this data deadlock.

It is a truism that with every passing year (not to mention day) data is growing in scale and complexity, and so it should come as no surprise that challenges related to data are multiplying to the same tune. These challenges are related not only to the mere mass of data, but commonly arise from its multi-dimensionality — hundreds and often thousands of columns that need to be sifted through in order to gain sought after insights. 

Data-driven companies rightly refer to their data as one of their most valued strategic assets driving their competitive edge, and are constantly looking to farm it for deeper insights. This means running more sophisticated and complex queries that leverage thousands columns to drive smart decisions. To enable a competitive advantage, data consumers demand agility and flexibility, including exploratory analytics and drill downs that can produce original analytical insights. Indeed, if you run the same analytics as everyone else — how can you stand out?

The data lake seems to offer a solution for these needs. Data lakes are gaining momentum exactly for their ability to serve agile analytics by allowing data consumers to query any data, any source, any format and any time. The downside is that the technology of data lake query engines is still based on brute force. With highly dimensional queries designed to find smart insights — 80%-90% of resources are wasted on compute. Queries can still be very fast, but they require either massive clusters or massive compute resources to deliver on their promise. 

This analytics model not only turns querying your data into a huge economical undertaking, but very soon — considering the shortage in server chips and the rising adoption and costs of compute — it might also become a show stopper. Based on an analysis of Cisco’s latest earnings report and comments made by AMD, it looks likely that we will have to contend with higher prices and chip shortages for the foreseeable future and price increases will be locked in for at least the next year or two. Everyone will simply have to do more with less.

Schedule A Demo

Data Query Costs – Big Data Unit Economics

So how much does a query actually cost? When considering all the factors that go into generating data for one query, it becomes clear why modern data analytics are so resource heavy. These are some of the parameters that need to be taken into account for the economies of analytics:

  • Compute. This is your cloud compute infrastructure which scales up/down based on your query workload needs.
  • Storage. A commodity that grows linearly with the size of your data lake. 
  • ETLs. The actual human and platform costs involved in performing ETLs, as well as the loss of agility and flexibility stemming from delays. 
  • Data teams. The many data, analytics and operational experts involved in deriving data insights and cluster management, including their training for the multiple skill sets required to support various platforms.
  • Actual cost of analytics platforms. Across the stack, including numerous platforms. 

In a recent study, we discovered that on average it costs a large organization over $360 to answer a single question — one data query. Of course, data-driven organizations often run hundreds of questions every day, resulting in heavy usage of compute and massive analytics bills. This is the data deadlock: in theory, you have at your disposal the infrastructure and data you need to get ahead. In practice, however, deriving insights from your data is so expensive that querying becomes no longer economically sound. 

Lower Your Data Lake Analytics Unit Cost with Big Data Indexing

Big data indexing offers a fresh approach to data lake analytics that eliminates massive data scans and is very lean on compute. You can think about big data indexing technology as a smart acceleration layer on top of your data lake. Varada is extremely lean on compute, as indexing reduces the amount of scanned data down to 1-2% of the data that would be normally scanned by query engines. 

Varada indexes relevant data at the structure and granularity of the source. The platform automatically knows which data to index, how to index and when, based on actual demand by queries (using machine-learning based automated acceleration) or requirements defined by admins. Each query is split into work tasks with optimal query paths which include hot data and index on the cluster’s SSDs, warm data and index on the customer’s data lake, or cold data on the data lake which remains the single source of truth. Varada dynamically balances the performance with the cost of SSDs (hot vs warm data & indexes), enabling a true zero dataops data lake experience, dramatically reducing TCO and associated data team overhead.

Varada’s index-once approach ensures that clusters can quickly scale to meet peak demand or shrink to avoid overprovisioning and idle resources. When a cluster is eliminated, it’s indexes remain available as warm data on the customer’s data lake, enabling fast warming up when adding new clusters. Additionally, admins can leverage the state and acceleration instructions of previously live clusters when starting a new cluster.

The impact of big data indexing on analytics unit economics is twofold:

  • Data indexing dramatically accelerates query run time and concurrency without depending on massive compute resources, so cost per query is significantly lower
  • Data indexing cuts down the total resources and cost required for data utilization by breaking the dependency on compute and reducing cluster size along with associated resources, including staff overhead.

Benchmarking: Cost per Query ($)

In this benchmarking analysis we ran two queries on several analytics platforms and analyzed the cost per query. Costs are normalized to reflect comparable infrastructure. For more detailed analysis, check out the benchmarking of Varada against Trino, AWS Athena and Snowflake.

Varada SQL Cost

New call-to-action

In addition to significantly lower cost per query, Varada enables organizations to shrink TCO by 40%-60%.

For example, an organization with a 100TB data lake and a required median response time of 10 seconds using Trino (PrestoSQL) on EMR would require 100 nodes of EC2 r5.4xlarge (no SSD, $1.01/hr) for a total cost of $101/hr.

With Varada ($2.00/TB/hr), assuming 10% of the data is accelerated, the same analytics workload would only require 20 nodes of EC2 r5d.xlarge (w/SSD, $1.15/hr), resulting in a total cost of $43 per hour or 57% reduction in TCO.

TCO Analysis: Trino (AWS EMR) vs. Varada

Varada helps organizations transform their analytics cost structure and data lake financial governance by leveraging an autonomous indexing technology that decouples the data lake and the compute cluster that processes it. The result is dramatic reduction of the dependency on full scans.

Varada delivers high ROI, flexibility and predictability to organizations performing large-scale analytics, enabling them to cost-effectively expand their analytics strategy.

Ready to see Varada in action on your data lake? Schedule a live demo!

We use cookies to improve your experience. To learn more, please see our Privacy Policy