Data Mesh: Why We Need It and How It Works

By Ori Reshef
I
March 10, 2022
March 10, 2022

Let’s begin this discussion with a prediction: By 2025, those of us who live in the ever-churning world of data aggregation, transportation, ETL, storage, BI, and accessibility will look at Data Mesh much as we look at cloud computing today: a strategy that simultaneously shrunk overhead (time and money), reduced grunt work (maintenance, upgrades, backups), and provided end-user abilities that didn’t exist before. In short, it’s a dramatic evolution whose technical and business advantages make it both obvious and inevitable.

Before getting into what Data Mesh is, let’s review how we got here: what challenges does this evolving paradigm come to solve?

1.     Too much data from too many sources. The tsunami of data pouring in as businesses embrace full digital transformation is staggering. Data points flow in dynamically, on a global level, at a level of granularity never before contemplated. And while historical financial/operational data has always been used as an analytical tool to drive business decisions by management, we’re now seeing BI providing game-changing insights driven by always-on transactional data for the marketing, sales, and product development teams as well. These customer-facing teams can finally instantly know what’s working and what’s not, based on every single action taken by customers. It’s an extraordinary power to have, but the amount of data they have to work with is hard to collect, store, query, and manage.

2.     Data lakes — No longer the single source of truth. Nobody will argue that siloed data is a good thing; Hundreds of startup companies have emerged, offering solutions to break open those silos. But while the goal over the past decade has been the unification of data sources into a single repository to yield a “single source of truth,” that repository suddenly – amazingly – feels like yesterday’s strategy. Why? Because it introduces several limitations while that immense single source swells day by day:

a.     Large-scale enterprise data management is messy. In particular, it’s a challenge to integrate live, flowing data into static or historical data. (A trend we address in the second point in our 2022 predictions)

b.     Data transitioning in and out of the data lake from edge sources – and managing its storage once it arrives – is time and resource consuming and very expensive. The bottlenecks get more frequent, and business agility declines.

c.     A single, aggregated collection of data cannot easily comply with data residency and privacy regulation compliance that varies from country to country; data governance is geographically diverse, whereas the hardware is not. 

d.     Finally – and often the most painful feature of a bloated data lake – is the reality that query overhead doesn’t scale. As more and more users need to query the same database, add sources, or manipulate what’s there, response times slow. This assumes, of course, that the data lake incorporates true data virtualization to seamlessly allow anyone with permission to connect to any data source or platform, an important concern according to our recent survey.

3.     There’s just too much work for a centralized data team. Enterprise data teams are “trapped” between data providers and data users. Serving disparate business domains, each with its own complex, changing, and ad-hoc requests, is taking its toll. Centralizing control and access among a limited number of people means that even minor requests often take their place in a long queue, delayed until the data team can assess the request, create the pipeline, and provide the data for analytics-based insights. Residing in their ivory tower, usually far from the business users (geographically and tactically), makes this all even more of a struggle.

In short, putting all your eggs in one basket has some appeal, but that’s going to be one heavy basket that’s hard to carry … or to locate the right egg.

So how do we maintain the benefits of a centralized, standardized data lake while introducing scalability and access that currently don’t exist? Can there be such a thing as a “distributed data lake”?

What is Data Mesh?

A Data Mesh is a decentralized approach to data management, where the data itself remains within the business domain that has collected it, cleaned it, and now manages it. But before the “silo” alarm bells go off, there’s one critical difference: The SQL clients from the entire organization can query it, with a distributed query engine, and on top of this privately-owned coherent business data, sits the distributed query engine that can access and unify it for interoperability, rather than storing it centrally. In other words, it democratizes the data. It creates “datasets as a product,” a standardized offering, available for anyone with permission. It’s secure, in compliance with local regulations, and suddenly considerably more scalable. In short, with data mesh architecture, the business domain user rises to the top of the priority list. This empowers them to own the decisions about what data can and cannot be provided, freeing them from the costly infrastructure constraints which limit the organization from accessing the accumulated wisdom of all its data, instead of the data scientist who owns the decisions about what can and cannot be provided, and when.

It’s all about Empowering People

The upshot? Automated, comprehensive, instant analytics at scale: Data scientists – and more interestingly, data consumers with less expertise and training – will now be able to access business data, conducting their own analysis focused on their own business needs. This self-service strategy, with its single point of access control, represents for the first time a people-centric plan for data management; a faster and more effective way to get answers without taxing the DevOps team, hoping for their availability. Zhamak Dehghani, Director of Emerging Technologies at Thoughtworks, who is credited with creating this paradigm in 2019 at an O’Reilly conference (she named it later, when she literally wrote the book on the subject), refers to it as a hybrid: “a decentralized sociotechnical approach — concerned with organizational design and technical architecture.”

Access Drives Insights

The data mesh is also, in a sense, the next phase in the “anyone/anywhere” model that we’ve come to expect from cloud computing and data virtualization. A business domain’s own applications and access tools are usually designed for its own users and their specific needs. And in an ideal situation, its data is local, so latency is minimal. But if members of one business unit seek data from another, they are limited by their own frameworks. If they do gain access to that centralized data lake, its remote location (and size, most of which isn’t the business unit’s own data) increases latency. With a data mesh, it is easier than ever to have systems interact, share their on-site data, and make the results available to a diverse group of business users. These may be completely independent teams (say, HR and R&D) or cross-functional teams with the same goals and often the same data (QA working with Product Management, or Sales working with Marketing). This new effortless transparency promises new levels of productivity.

How does it look?

As this approach takes hold, keep an eye out for three “flavors” of data mesh; most companies will use a combination of these:

·       File-based: The data is compiled, packaged, and simply provided as a static file. This is the closest to today’s simple cloud-storage approach, but will exist under the new universal peer-to-peer sharing model.

·       Event-driven: No matter what business unit or department, consumers can “sign up” for alerts when data changes in a way that may be meaningful to them. Again, not rocket science, but only available once this previously siloed data is exposed and accessible across the organization.

·       Query-enabled: Obviously the most powerful, any user can submit federated queries spanning multiple databases, creating insights only possible when combining results. This is the Holy Grail that gives end-users new capabilities, and data scientists some stored up vacation time.

Let’s remember that this movement isn’t entirely about the corporate business employee; the end user – whether a financial client, a game, streamer, or researcher – is going to feel the speed of a response when the data is coming from an optimized, dedicated, distributed source, rather than a massive, multi-purpose one. In return, the clickstream and web data these users provide along the way can be instantly absorbed and processed as a pure feedback loop to improve performance, product features, and ultimately, profits.

Empowering data democratization

While widespread implementation and adoption aren’t going to happen overnight, many organizations are embracing data mesh architecture to democratize and scale their data.

However this move puts responsibility on data teams to become truly autonomous: they will need to ingest and clean data themselves, create ETL pipelines (and maintain them), and manage access control. At the same time, the more they invest in these fully-owned steps, the better results they can expect. And yes, that means a new “warm and fuzzy” era of mutually beneficial sharing as each domain helps the others by simply transforming and offering their data to their community.

We’ll close with a parting thought from Dehghani, as she describes the overarching value of distributed data mesh architecture, with domain-owned data under a centralized access system: “Over the last decades, the technologies that have exceeded in their operational scale have one thing in common: they have minimized the need for coordination and synchronization.”

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept