As a data team, you’re drowning in demand from users requesting analytics access to your corporate data lake. There’s a good reason to welcome these requests. Having spent years investing in data collection pipelines and establishing data governance and controls, the organization is primed to start using data as a strategic asset rather than just a cost center. After all, a data lake without analytics doesn’t fulfill any purpose other than taking up space and checking regulatory boxes. You can handle these user requests in one of two ways: either start to move subsets of data out to externally managed systems or provide a query engine to access data directly on the data lake.
Moving data to external systems is an attractive option for the first few users. Externally managed systems free up the dev ops team, and give users unfettered access to their segmented subset of data. Additionally, all your performance and scale barriers can be solved by throwing money at your service provider. Yet without understanding what your users are doing, those costs can quickly spiral out of control. The price tag for letting all your users analyze whatever data they want can quickly surpass the operating costs for entire divisions. With real budget limitations, you need a way to prioritize access based on the needs of the business. In order to do this responsibly, you need transparency and visibility into how users are analyzing data at the workload level. This is often impossible using an external analytics provider.
External analytics providers are by their nature black box systems. You send over your data and give users access to analytics. By giving up control, you can pay for flexibility for your users. Running more queries or speeding up queries is generally a matter of increasing your budget. With limited visibility into how the system is running, where you need to tune, or how slow running queries related to business critical questions, you end up with one big control knob: budget. By moving the data into a black box query system, data teams lose all control over performance, reliability, and cost as users can run amuck querying data and throwing endless compute at answering every question that comes to mind. Using a managed service for analytics, you give up more than just governance and control. With lack of visibility to optimize workloads, your only choice is to pay more and go faster.
In order to avoid the money pit of outsourced analytics, responsible data teams are able to map analytics usage back to business value in order to optimize and prioritize budgets. This mapping requires visibility you often don’t get in managed systems. Consider the information you need access to in order to map user queries to business priorities. More than just the amount and cost of data scanned on the account, you need query level visibility into all the resources and costs, including CPU usage, memory usage, and tables accessed. The total cost of supporting user analytics involves everything from selecting what data to make available to how to prioritize spending to keep production workload SLAs and user queries responsive.
Basic query history, individual query statistics, and billing reports are granular but they don’t reveal the entire picture. Business questions typically aren’t answered with a single query. Beyond just query statistics, you need user level and workload level resource utilization and costs. In order to trade off between query priorities and performance within a fixed budget, you need controls to make some sets of user queries run faster and other sets of user queries run cheaper, perhaps at the expense of performance.
Full visibility into your analytics solutions is critical for ensuring reliable query performance for business critical workloads and to keep non-production queries responsive, all while managing costs. To fill these needs at an enterprise scale, Varada introduces the Visibility Center. Data admins can use the Varada Visibility Center to investigate resource utilization and costs of user analytics through a wide range of dimensions. The Visibility Center provides a breakdown by queries run, CPU used, memory, and data used. Admins can further group queries into workloads and see both total workload costs and a breakdown within the workload by each resource. The Varada Visibility Center is a critical tool to help answer the inevitable question, why did my monthly bill double? Admins get the information they need to both keep users happy and responsive to the CFOs drum beat to keep costs down.
While outsourced analytics is a quick way to start or can be used for small teams, as your user base grows, the need for visibility – in order to manage the cost vs. businesses priority equation – means bringing analytics in house. The ideal analytics engine runs directly against your data lake, gives you the same level of automation and flexibility as an external solution, with the visibility you need to manage priorities and budgets. By deploying your own query engine, you get access to both the granular data as well as key information about which user groups need similar information, what times users run particular workloads and how those workloads perform, as well as statistics on user groups that you need to understand how well business questions are being answered.
To see Varada in action, schedule a short demo!