Getting your users access to analytics is just the first step in delivering on the promise of a data lake. In order to scale up to organization wide analytics, you need to make sure that your solution is tuned for your users’ needs. The simplest approach is to outsource your analytics to an external provider. By shipping off your data on a per user basis, you offload the burden of tuning and tweaking individual queries, trading performance for potentially astronomical cost. Ultimately, there’s never going to be enough budget to outsource all of your organization’s analytics needs. You need to find an in house solution that gives you the right balance of automation and control over your cost performance trade off.
It’s easy to dismiss the value of an analytics solution that you can run in house, ideally directly on your data lake. With so many managed analytics services available, surely you can find one that addresses your users’ demands. This may be true at small scale, for one or two teams. But managed services operate primarily with one large knob: more speed equals higher cost. As you see user adoption grow, your spending starts to balloon because of the limitations from lack of visibility and control over query performance. Users continue to throw unchecked queries at your service provider and ratched up the raw power at ever higher expense. As you hit the budget ceiling, it becomes apparent that the only scalable solution is to offer your users an in-house analytics solution running directly on your data lake.
When you finally realize you need to bring your analytics solution in-house in order to manage cost, you’re still not sure what you need in terms of visibility. Most analytics solutions, including managed services, give you just the automation query statistics you need to know that you need to spend more money. If you have the expertise and the time, you can hand optimize every user query to try and optimize costs…until you realize that not all queries are equally important. What you really need from your analytics solution is a way to understand how groups of user queries combine into workloads that map to business needs, and then control over the budget allocated to those workloads. In many instances the answer isn’t to speed up a query but rather to slow it down so that a higher priority workload can use more of the shared resources and finish on time without blowing your budget.
Managing and optimizing the analytics workloads at a large organization often ends up being an exercise in optimizing where the data team spends their time. There are more users and more queries than any team can realistically optimize. User workloads change, and business priorities change faster than you can tune workloads. In order to help provide acceptable performance for you users, any in house solution also needs the same performance automation capabilities available in a managed service. The difference between managed service automation and in house is the ability to override and control workload performance based on business priorities and cost.
The key to effectively tune analytics performance at large scale is to have the right level of visibility into workloads and access to both automation and controls at the workload level. Some of this information comes from the query statistics you get in most query engines. More sophisticated solutions allow you group queries by user to identify overall workload behavior. By mapping out the resource utilization of entire workloads, you can start to focus on where to optimize queries and where to allocate your fixed resource pool. The ultimate query prioritization should map back to business priorities. You need to determine which workloads and queries are required by the business teams and allocate resources to those workloads.
With Varada, data admins have access to the critical workload level metrics and controls they need in order to optimize their enterprise wide analytics usage. Varada’s Visibility Center shows everything from data usage at the table and column level through CPU and memory load combined with costs. Admins can aggregate this information at the workload level, for example by showing all of the resources and costs associated with customer reports, and compare them to the development workloads.
The Varada Workload Manager further lets admins prioritize resources and control costs on a per workload basis. The same customer reports that have high scan rates may be a candidate for indexing or materialization. In addition to directing the data teams to optimize that set of queries, admins can simply tell Varada to apply automated optimizations, reducing the continuous whack-a-mole that can quickly burn out a small data team. Varada gives admins the flexibility to dynamically define workloads, apply built-in optimizations, and manually tweak performance so that they maintain their business SLAs.
When evaluating solutions for data lake analytics, keep in mind that there are no shortcuts. The solutions that are attractive for small groups of users end up costing you more time and money to optimize at scale. Before you get caught with that surprise end of month bill, look closely at the level of visibility that you get from your analytics solution. If all you have is basic costs and a single “budget” knob, you’ll likely end up paying both for uncapped resources and a data ops team to handle on going query optimizations.
To see Varada in action, schedule a short demo!