Faster Analytics On Your Data Lake, Without Breaking Your Budget

By Ori Reshef
November 4, 2020
November 4, 2020

Day to day, only three things matter in your analytics deployment: production SLAs, user satisfaction, and staying within budget. Like many three-factor problems, you get to pick two and have to make compromises on the third. Production SLAs are not negotiable, so what you end up balancing is user satisfaction and budgets. When you’re just getting started and budgets are small, it’s easy to excuse the occasional overage in order to eke out performance gains for users. Long term, you need to figure out a way to sustainably give users the access and reliable performance they need to get their job done without the unexpected cost hiccups that bring the CFO banging on your door. The answer to walking this fine line is knowing where and how to apply the right optimizations.

Evaluating Your Analytics Engine Options

Broadly speaking you have two options for running analytics and each one has significant implications on how you can optimize and what you need to budget. The easiest and most obvious way to run analytics is by outsourcing to a managed analytics provider. By offloading your analytics workloads and moving your data to a third party, you can simply send your users and budget to someone else who will provide as much resources and speed as you can afford. You save on DevOps costs and can control for either fast performance or budget spend (but usually not both).When budgets are small and you have limited users, outsourced providers offer a valuable trade off. Instead of buying software and staffing up teams and dedicated resources, you can send just the data that users need to analyze and point them at the hosted provider. You pay for the cost to store and read data, plus CPU and memory. Some outsourced analytics engines offer automation that will speed up queries (at a cost), trading off DataOps costs. For the most part the way to get more speed from an outsourced provider is to pay for more disk, CPU and RAM.

Data Lake Based Engines and Data Virtualization Offer Better Speed and Control

Moving to an in house analytics engine can give you greater control over maximizing your query performance per cost. Managing your own resources, especially as you see user adoption grow, lets you control the price performance of your analytics by choosing the right optimizations across your workloads. 

But just putting a standard query engine on your data lake doesn’t automatically optimize every dollar you spend. You need to apply the right types of optimizations, including prioritizing queries by workload, applying intelligent indexing, and dynamic caching data sets.

Much like managed analytics engines, the basic data lake based query engine has very few controls. A simple system, such as AWS Athena or Presto, will show you when queries are issued, when they complete, and how long they spend reading from disk. Often the only control you have is data scan speed. The more parallelized your queries, the faster they return. You may be able to time share your resources so you run fewer queries at a time and each completes more quickly. This means juggling which queries are required to support which workloads – a DataOps nightmare. Ultimately you end up paying for as many disks as necessary to scan the largest data set and speed up the slowest query.

You can get better insight into how to partition your data if you have visibility into which queries are accessing which data at the column level, and what kinds of predicates are common among queries. Having column access statistics per query also helps you decide which data sets can be cached or pre-materialized. If your query engine supports it, you can selectively index data sets that are frequently used for looking up reference data. At the scale of an enterprise-wide analytics deployment, these optimization choices can quickly become overwhelming, especially if your DataOps team doesn’t have the information they need about which queries support which business workloads. The ideal engine combines visibility, workload level reporting, advanced controls, and automation to help you speed up the workloads that are most important to the business.

Schedule A Demo

Leveraging Dynamic Indexing to Balance Performance and Budget

Varada augments its unique big data indexing-based Data Virtualization technology with a Visibility Center and Workload Management. These capabilities are a critical piece of scaling analytics while balancing production SLAs, user experience, and keeping overall spending under control. The Visibility Center surfaces everything from query level statistics such as data access at the table and column level, to resources consumed at the query level and aggregated based on your specific workloads. You can define workloads in Varada to report on how collections of user queries and automated processes combine to answer to business level needs. 

With Varada’s Workload Manager you direct the query engine on how to prioritize and allocate resources on a per workload basis. Varada automatically adjusts the underlying optimizations in the query engine, including creating and managing appropriate indexes, materializing intermediate data sets, and caching query results. 

For example, by prioritizing production workloads, Varada will make sure the cache is pre-warmed for all of the slower queries in the production workload and that the correct indexes are created and maintained to avoid costly and slow data scans. Varada can also automatically scale out the underlying hardware resources and allocate them to queries in different workloads, all within the budgetary constraints set by administrators.

As workloads shift and queries change, Varada adapts optimizations for the appropriate workloads based on your pre-set priorities.

A common example is that a user tweaks a query for a production report to join in a new data set. Varada has already seen the new query run when the user was testing it, recognizes that the updated query is part of a high priority workload and figures out whether to pre-cache the new data set or add an index depending on the query cost. 

With Varada you both save on DevOps costs and offset the DataOps burden thanks to the ability to automatically optimize the underlying query engine and accelerations directly on your data lake.

Deliver Faster Analytics On a Fixed Budget, Even On Your Data Lake

When you start offering analytics to users via an external provider, it’s easy to get stuck trying to balance production SLAs, day-to-day query performance, and keep everything within the available budget. The key to solving these challenges is knowing which optimizations to apply to each workload. You need to make sure your query engine gives you the right types of optimizations to accelerate your queries, such as dynamic indexing and intelligent caching in addition to automation and the ability to scale out within your budget. By adding in automated query optimizations, DataOps teams can speed up your overall analytics without breaking the bank.

To see Varada in action, schedule a short demo!

We use cookies to improve your experience. To learn more, please see our Privacy Policy