Building the Analytics-Ready Data Lake Stack

Ori Reshef
By Ori Reshef
I
May 12, 2021
May 12, 2021

For many organizations, data lake architecture fails to deliver on its analytics promise and often remains as an aggregative storage layer. New data virtualization and acceleration platforms maximize data lake ROI and turn big data into a strategic asset.

In recent years, data investment has reached a state where it is nearly universal across enterprises — and the pace of investment is accelerating. There is, however, a huge gap between initiatives and results. According to this 2021 NewVantage Partners survey of Fortune 1000 executives, big data is still not used effectively and firms are continuing to struggle to derive value from their investments in this area. For example:

  • Only 48.5% are driving innovation with data
  • Only 41.2% are competing on analytics 
  • Only 29.2% are experiencing transformation business outcomes
  • Only 24.0% have created a data-driven organization.

The ongoing demand for agile, more flexible data analytics to leverage big data investments has fueled the rise of data lakes and distributed SQL query engines (like Presto and Trino). The power of data lakes to hold vast amounts of raw data in native formats until needed by the business, combined with the agility and flexibility of distributed engines in querying that data, promises organizations the ability to maximize data-driven growth. 

The biggest advantage of data lakes is flexibility. Indeed, allowing the data to remain in its native and granular format means that data is not modeled in advance, transformed in flight, or at target storage. This is an up-to-date stream of data that is available for analysis at any time, for any business purpose. 

The main value organizations derive from the data lake stack is three-fold:

  1. It enables instant ease of access to their wealth of data, regardless of where it resides, with near zero time-to-market (no need for IT / data teams to prepare or move data)
  2. It creates a pervasive, data-driven culture 
  3. It transforms data into the digital intelligence that is a prerequisite for achieving a competitive advantage in today’s data-driven ecosystem

But data lakes only have meaning to an organization’s vision when they help solve business problems through data democratization, re-use, and exploration by agile and flexible analytics. The access to the data lake provides a real force multiplier when it is used by companies thoroughly, across business units. 

In practice, however, even after a successful implementation many enterprises use the data lake on the fringes, running queries on a limited basis for ad hoc, high value queries. Thus they dramatically fail to use their data lake to its potential, and experience poor ROI as a result.

Schedule A Demo

Analytics is at the center of attention of the data lake architecture paradigm shift

There are several obstacles that prevent organizations from utilizing the power of their data lake stack, all of which require organizations to rethink their data lake architecture in order to capitalize on their investment in big data and analytics. 

The single most common problem in poor ROI relates to the fact that traditional data lake query engines are based on brute force query processing, culling through all of the data to return the result sets needed for application responses or analytics. 

In fact, 80% of compute resources are “wasted” on full scans! 

This unnecessary leverage of widely excessive resources runs up significant costs. The result is that SLAs are not sufficient to support interactive use cases and realistically support only ad hoc analytics or experimental queries. To effectively support a wide range of analytics use cases, dataops teams have no choice but to revert back to optimized data sios and querying traditional data warehouses. 

But, dataops teams are already spread thin with responsibilities for managing the data analytics budget, prioritizing query requests, and optimizing query performance. Manual query optimization is time consuming, and backlog optimizations grow everyday, creating a vicious cycle. The lack of workload level visibility prevents dataops teams from identifying which workloads need priority based on business needs — rather than on the needs of an individual user or query. The frustration of data users throughout the organization and the burnout experienced by the dataops team can stymie even the best-made plans to capitalize on big data and build a data-driven culture. 

Overcoming these obstacles to leveraging the power of the data lake demands a transition to a an analytics-ready data lake stack, which is composed of:

  • Scalable and massive storage (petabyte to exabyte scale) such as AWS S3
  • Data virtualization layer that provides access to many data sources and formats
  • Distributed SQL query engine such as Trino (PrestoSQL
  • Query acceleration and workload optimization engine for performance/cost balance, to eliminate the disadvantages of brute force approach and their implications

These tools enable agile data lake analytics that harnesses near-perfect data — with traditional data warehouse comparable performance and cost. By implementing those tools, the business no longer needs to adapt to existing data architecture, which limits which queries can be run. Instead, the data architecture adapts itself to specific business needs, which are highly elastic and dynamic. They offer a simple and cost effective way for enterprises to shift their analytics to the data lake, making it the one stop shop for agility and flexibility focused data analytics.

Autonomous query acceleration: the missing link in the data lake stack

Data lake query acceleration platforms like Varada are the missing link in your data lake stack. Sitting on top of your data lake and query engine (Presto), they serve as a smart acceleration layer on your data lake which remains the single source of truth. The data lake becomes the business’s mainstream data analytics platform, enabling enterprises to turn it into a strategic competitive advantage and achieve data lake ROI. Data also becomes a strategic asset, as businesses can use it to respond with agility to new opportunities and threats through innovations that drive business growth and competitive advantage. 

Varada is an automated dataops and observability platform which gives you control over the performance and cost of your data lake analytics. Varada autonomously and continuously learns and adapts to the users, the queries they’re running, and the data being used. Workload-level visibility gives data ops teams an open view to see how data is being used across the entire organization, and better focus data ops resources on business priorities.

Dataops teams can tell Varada which workloads are more important and how to allocate budgets. Based on this information, Varada automatically and dynamically creates appropriate indexes, refines which queries to cache, and even materializes tables with the right column sets, including pre-joining dimensions.
Varada delivers high ROI by leveraging indexing technology. Varada’s dynamic indexing technology eliminates the need for full scans and can accelerate queries automatically without any overhead to query processing or any background data maintenance. This reduces the amount of data scanned by orders of magnitude. As an example, check out this data from our benchmarking of Trino vs Varada:

Trino vs. Varada Query Performance (seconds)

Enterprises using Varada can also expect a 40%-60% cost reduction vs. current solutions such as Presto / Trino EMR

The following analysis demonstrates the expected cost reductions of Varada vs. Presto EMR.

Business requirements:

  • Data lake: 100TB
  • Required response time (median): 10 seconds
  • Accelerated data: 10TB (10%)

Infrastructure:

  • AWS EC2 r5.4xlarge (no SSD): $1.01/hr
  • AWS EC2 r5d.4xlarge (w/SSD): $1.15/hr
  • Varada: $2.00/TB/hr

Schedule a short demo now to see Varada in action on your data set!

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept