Your end-to-end analytics-ready data lake stack is critical for optimizing your cloud data lake performance and ROI. Here’s what you need to know.
Data lakes enable us to handle vast, complex datasets. Modern data lakes can support high performance query engines, allowing users direct access to both raw and transformed data directly in the data lake. Their biggest advantage is flexibility. Data lakes offer an up-to-date stream of data that is available for analysis at any time and for any business purpose. The data can remain in its native and granular format. But the data lake is not just a storage destination for your data, it’s a strategic technology stack that at the most basic level consists of 3 essential layers: scalable object storage such as S3 (petabyte to exabyte scale) to hold data; a distributed query engine with a data virtualization layer that provides access to many data sources and formats like Presto to query data; and a big data catalog such as Hive metastore or AWS Glue to find that data and define its access policy.
Cloud data lakes are transforming the way organizations are thinking about their infrastructure and data. Enterprises today understand that cloud migration is critical for its long-term success. According to a 2021 Accenture report on cloud trends, worldwide end-user spend on public cloud services is forecasted to grow 18.4% in 2021 and 24% in the following year, demonstrating that commitment to the public cloud is only growing. But more importantly, the report states that “the cloud is more than an efficient storage solution — it’s a unique platform for generating data and innovative solutions to leverage that data.” Across industries, more and more companies are moving to the cloud, and the on-premise data lake is no exception. If you don’t have a data lake yet, the cloud should definitely be a top priority. Cloud based solutions offer elastic scalability, agility, up to 40% lower total cost of ownership, increase in operation efficiency and ability to innovate rapidly.
But in order to get the most of your cloud data lake solution, you’ll need to deploy an analytics-ready data lake stack that enables you to turn your data into a strategic competitive advantage and achieve data lake ROI. The analytics-ready stack requires the addition of two critical capabilities: workload observability & optimization, and query acceleration. These tools sit on top of your cloud data lake and query engine, allowing you to operationalize your data and serve as many use cases as possible with instantly responsive interactive queries, full cost and performance control, and minimum data-ops.
Here are six critical questions to consider when choosing the best cloud data lake stack for your business.
Choosing the right cloud platform provider can be a daunting task. But you can’t go wrong with the big three, AWS, Azure, and Google Cloud Platform; each with their own massively scalable object storage solution, data lake orchestration solution, and managed Spark, Presto and Hadoop services. In the case of object storage, these vendors compete on usability and price (each offers pricing tiers and models based on availability and access time), with Azure being the cheapest and Google being the most expensive on average.
AWS Lake Formation & S3. AWS Lake Formation provides a wizard type interface over various pieces of the AWS ecosystem that allow organizations to easily build a data lake. The primary backend storage of an AWS data lake is its S3 storage. S3 storage is highly scalable and available and can be made redundant across a number of availability zones. S3 has three tiers (Standard, IA and Glacier), with lower storage costs and higher read/write costs depending on availability. S3 also has automatic object versioning, where each version is addressable so it can be retrieved at any time. AWS S3 storage is the clear leader of cloud storage. It offers rich functionality, it’s been around the longest, and many applications have been developed to run on it.
Azure Data Lake & Blob Storage. Azure Data Lake is the competitor to AWS Lake Formation. As with AWS, Azure Data Lake is centered around its storage capacity, with Azure blob storage being the equivalent to Amazon S3 storage. It offers three classes of storage (Hot, Cool and Archive) that differ mainly in price, with lower storage cost but additional read and write costs for data that is infrequently or rarely accessed. Azure Data Lakes rely heavily on the Hadoop architecture. Additionally, Azure Blob Storage can be integrated with Azure Search, allowing to search the contents of stored documents including PDF, Word, PowerPoint and Excel. Although Azure provides some level of versioning by allowing users to snapshot blobs, unlike AWS it is not automatic.
Google Cloud Storage. Google Cloud Storage is the backend storage mechanism driving data lakes built on Google Cloud Platform. As with other cloud vendors, Google Cloud Storage is divided into tiers (Standard, Durable Reduced Availability, and Nearline) by availability and access time (with less accessible storage being much cheaper). Like AWS, Google supports automatic object versioning.
There are many open source and (commercial) tools available to choose from. The most popular ones are:
Presto & Trino (FKA PrestoSQL). Originally built by Facebook, Presto & Trino are distributed query engines built over ANSI SQL that work with many BI tools and are capable of querying petabytes of data. While Presto was built to solve for speed and cost-efficiency of data access at a massive scale, Trino has been expanded by Presto’s founders to accommodate a much broader variety of customers and analytics use cases. Why are Presto/Trino the leading candidates? Both are a user-friendly option with good performance, high interoperability, and a strong community. They allow to access data from different data sources within a single query, can combine data from multiple sources, support many data stores and data formats, and have many connectors including Hive, Phoenix (MR), Postgres, MySQL, Kafka etc.
Apache Drill. Drill is an open-source distributed query engine that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is also available as an infrastructure service called Google BigQuery. Drill uses Apache Arrow for in-memory computations and Calcite for query parsing and optimization — but it has never enjoyed wide adoption, mainly because of its inherent performance and concurrency limitations. Some products based on Drill, attempt to overcome these limitations. Drill shares many of the same features with Presto/Trino, including support for many data stores, nested data, and rapidly evolving structures. Drill doesn’t require schema definition which could lead to malformed data, and it’s throttling functionality may limit concurrent queries.
Spark. Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. SQL is supported by using Spark SQL. Queries are executed by a distributed in-memory computation engine on top of structured and semi-structured data sets. Spark works with multiple data formats but is more general in its application and supports a wide range of workloads such as data transformation, ML, batch queries, iterative algorithms, streaming etc. Spark has seen less adoption for interactive queries than Presto/Trino or Drill.
With managed analytics services, enterprises can start using data analytics quickly — and let their third-party provider deal with all the hassle of storing and managing the data. A managed solution also allows users throughout the company to quickly run unlimited queries without having to wait on the DevOps team to allocate resources. However, as adoption and query volume grow, spending balloons dramatically.
At the end of the day, every managed analytics solution becomes another data silo to manage its data flows. Unified access control, audit trails, data lineage, discovery and governance become complex, requiring custom integrations and vendor lock-in. That’s the double edged sword of “quick-start” managed solutions that CIOs need to be aware of, so they can prepare to shift to more economical in-house managed DataOps programs for the cost and control advantages they offer in the long term.
The cloud data lake analytics stack can be used for a wide range of analytics use cases:
The cloud data lake analytics stack dramatically improves speed for ad hoc queries, dashboards and reports. It enables you to operationalize all your data and run existing BI tools on lower-cost data lakes without compromising performance or data quality, while avoiding costly delays when adding new data sources and reports.
In order to serve as many use cases as possible and shift your workloads to the cloud data lake you need to avoid data silos and make sure your stack is analytics-ready with workload observability and acceleration capabilities, so it can be easily integrated with niche analytics technologies such as text analytics for folder and log analysis. A solution with integrated text analytics can be used by data teams to run text search at petabyte scale directly on the data lake for marketing, IT, and cybersecurity use cases (and more). Traditional text analytics platforms were not designed to handle such specific tasks, often considered as “needle in a haystack” at a petabyte scale.
The agility and flexibility benefits of the cloud data lake are clear. But delivering performance and cost are the critical driving forces behind the massive adoption of data lakes. As analytics use cases grow in demand across almost every business unit, data teams are constantly struggling with balancing performance and costs.
Manual query prioritization and performance optimization are time consuming and not scalable and often result in heavy DataOps. In order to widely expand the open data lake concept across the entire organization, data teams should seek a smart and dynamic solution that will autonomously accelerate queries using advanced techniques such as micro-partitioning or dynamic indexing.
Varada is an autonomous query acceleration platform which gives users control over the performance and cost of their cloud data lake analytics. Varada delivers high ROI by leveraging dynamic & adaptive indexing and caching, efficient scan and predicate pushdown implementation, and optimized dynamic filtering implementation to accelerate SQL queries by 10x-100x vs. other data lake query engines. Varada autonomously and continuously learns and adapts to the users, the queries they’re running, and the data being used. Workload-level observability gives DataOps teams an open view to see how data is being used across the entire organization, and better focus data ops resources on business priorities.
With Varada, data teams and users no longer need to compromise on performance in order to achieve agility and fast cost effectiveness. As an example, check out this data from our benchmarking of Varada vs. Trino, and vs. Snowflake.
Now is the time to migrate your analytics workloads to your cloud data lake. Chances are, your competition is already there.