I’ve held leadership positions in tech companies for over 20 years. Like many others, I’ve seen how data rose dramatically in scale and importance. Unfortunately, data management has become increasingly more complex in direct relation to this surge in data criticality.
In the past, there really wasn’t much to consider; you had a central SQL database storing your data.
At the turn of the century, with Web 2.0 taking over, databases were required to serve numerous clients concurrently, while handling greater volume and throughput than before. NoSQL surged in popularity to address these needs but ultimately fell out of favor as the one-stop solution; as many of us learned the hard way, reporting and running BI over NoSQL is no fun.
In recent years, the central importance of data has risen dramatically. It is now common practice to store raw data as well as processed data. When dealing with massive amounts of Big Data, a single SQL or NoSQL database is far from sufficient. This drove many companies to adopt a strategy of separating data capture and data consumption.
Data Capture is the practice of storing raw data (often from various data sources) in an affordable and robust silo in the cloud.
Data Consumption is the layer of clients that read the data; this includes applications, services, BI tools, data scientists, machine learning programs, etc.
This separation supports the single source of truth paradigm, the potential for data to be available in near real-time and allows storing the raw data in full, without losing any potentially important dimensions.
In the layer between Data Capture and Data Consumption, reside data servicing tools. Generally speaking, these tools can be divided into two groups:
Unfortunately, tools that emphasize Cost and Performance are rigid; run a query for which the system wasn’t designed for and you’ll be waiting far too long for results. Tools that emphasize Agility free you from the need to prepare the data in advance but their performance to cost ratio is not viable for operational concerns.
The rise of distributed SQL solutions such as AWS Redshift, Google BigQuery, and Azure Synapse as well as a slew of independent solutions signaled the return to SQL databases for most applications. The distributed aspect brought great improvements in speed and concurrency.
These solutions work wonderfully once they are up and running. However, getting there is not a small feat.
To get these solutions working properly, you’ll need to model the data in an optimal way. The model should be designed per the queries you’ll be running so you have to have a good process of anticipating what these queries will be. Once the design is ready, you’ll need to prepare the ETL process that brings the data from your data lake to your solution of choice.
All this requires planning, resources (tools and a dedicated team) and time.
If you are lucky business requirements will remain static for at least as long as it took you to set up all of the above. Unfortunately, this is often not the case and being able to respond quickly to changing business requirements is a necessity, not a convenience.
There’s another family of data solutions that emphasize agility and offer high flexibility. Solutions such as AWS Athena and Presto on EMR allow you to run queries directly on your data lake, with no need to prepare the data.
AWS Athena is a great tool that allows anyone with basic SQL knowledge to perform powerful inquiries directly on the data lake. The freedom to run any query and get live results without any data preparation whatsoever is a game-changer. Many companies make Athena available to their data engineers and analysts for preliminary data research. Unfortunately, the pricing model and performance usually makes this tool impractical for most operational and customer-facing solutions.
Presto is an open-source distributed SQL engine designed for the world of big data. Presto integrates with a wide range of data sources such as RDBMSs, NoSQL solutions and Hadoop data warehouses. One of Presto’s unique advantages is that it allows you to run SQL queries across different data sources.
Presto on Amazon EMR is a popular choice as it gives you a very flexible solution. It can leverage data partitioning where applicable but can also handle other queries. The distributed aspect allows you to improve performance by increasing the Presto cluster. However, this translates to higher costs.
Are we doomed to cycle between these two compromises? Must we choose between having the freedom and agility to query on different dimensions or to have a fast and cost-efficient system? Is it really impossible to have both?
Can’t we do better? We definitely need to. Looking to the future, the challenges will only increase. The overall volume, number of data sources and different types of data will continue to increase. Moreover, the strive for insights will continue to rise with huge growth in data consumers. Today, a company’s success is directly related to how effectively it utilizes its data. In the future, this correlation will only be stronger.
This is exactly why we started Varada!
After being in stealth mode for a couple of years, we can finally share that we’ve built a big data infrastructure platform that is not only fast and flexible, it is x100 faster than any database we compared it with.