You can’t discuss game-changing improvements in Enterprise Data Management without discussing its evolution; each re-imagined paradigm, strategy, or technology represents a response to real-world challenges that may not have existed just a few years ago. Whether to improve on efficiency, accuracy, cost, or to reduce frustration among the data team, no new approach appears without specific triggers.
We’re going to discuss here the cause and effect of the latest game-changing shift — dynamic indexing.
Ever since John Mashey, from Silicon Graphics, coined the term “Big Data” in 1998, data teams have struggled to “tame” it. Their challenge, in a nutshell: The more data you have, the longer it takes to query it for meaningful results. And we know that today’s digital migration across all sectors is creating unimaginable quantities of data – both structured and unstructured – tracking every activity, individual, product, document, image, and more. It’s way beyond Terabytes – it’s Zettabytes or Brontobytes. And we’ll probably come up with new names. But the key point is that even with all that volume, velocity, and variety (SAS adds variability and veracity), everyone wants quick, easy access to the insights this information offers.
It has long been apparent that a direct, brute-force query to a massive complex of data is limited: Searching for a customer (using his unique customer ID) is easy enough when you have a thousand customers, or even a hundred thousand. But how about finding one in a million? And that’s a simple lookup – what if you want to identify that customer only if he has other characteristics (recent activity, point balance, a specific mobile phone version) drawn from joining this database with another?
Now let’s introduce that other evolutionary term – the Data Lake. While this massive, centralized collection of data in its original, raw form is designed to present a single source of truth, its very nature emphasizes the need for more effective queries that don’t have to churn through all your data, all the time. After all, data is actually outpacing Moore’s Law, meaning that the hardware’s computing power isn’t keeping up with the data it stores and processes.
Your Data Lake has to ingest the data for various sources, then distill it into some kind of structured schema, cleaning it and transforming it for analysis. Only then can it process it for analysis for real-time and batch queries by API or human data clients. Each of these steps is designed to optimize the process of making meaning from raw data, but often it’s the final one – the actual query – that’s the toughest nut to crack.
First, the basics: An index is a data structure strategy that collects, parses, and stores data to enhance the speed and performance of retrieving and analyzing relevant documents. To eliminate the need to search every row of a database in a query, the index only catalogs specific columns most often queried, drastically reducing the processing power required to search through it. For example, most customer queries are conducted based on unique ID, name, or perhaps product owned, etc… rarely is a street or email address, or “date of most recent phone call” the search’s starting point. These fields may be included in the results, but they are usually not used as the main filter or the main topic of the query. The key search criteria are sometimes obvious, and sometimes easily (and more accurately) discerned from a review of queries over time.
So at its simplest level, a data indexing technique is designed to reduce the amount of data retrieved before reading all of the relevant columns for each record to process. But therein lies the problem: the simplest (and most common) approach is to pre-sort, partition or limit the data being indexed. And yes, if that sounds like a limiting solution … it is. It’s like solving a daily rush-hour traffic problem by limiting the cars allowed to drive. The traffic moves, but way too many people can’t get to where they want to go. When that marketing exec does want to find all female customers who called Tech Support in the last month from Wisconsin, she’s out of luck.
A unique, powerful, and proven option has emerged, eliminating this tradeoff by indexing all data directly from your data lake, breaking data across any column into what we at Varada call nanoblocks – tiny subsets of your data, managed by an index management platform. By analyzing data type, structure, and distribution of data in each nanoblock, a dynamic indexer is adaptive: it assigns each to the most appropriate and effective index, whether a Bitmap, Dictionary, Tree, Lucene, or many others. This approach looks at Big data in very small pieces that can be accurately indexed and creates a mesh of randomly accessed, optimized indexes that represents the entire indexed data set. The result? You can quickly execute queries against these nanoblock mesh indexes without the overhead inherent in traditional indexes. And yes, there are certainly basic queries that don’t actually need this type of acceleration; the platform managing this process seamlessly chooses in which way each query will be accelerated and using which strategy, i.e. the correct index type or caching the data if indexing is not the solution. In other words, with this approach to Big Data indexing, you avoid the resource-draining brute force query, as well as the need to limit your solution to a limited set of partitions. The benefits are numerous:
Speed: Dynamic query analyzers meet continuously evolving performance and concurrency requirements for both human and API queries. When filtering, joining, and aggregating data, a side-by-side comparison to a static index is almost unfair.
Expense: Data teams can keep cost-per query predictable and under control, while improving business decision-making with queries that take seconds, not minutes, or ultra-complex queries that take minutes, not hours.
Labor: When a fully-packaged platform is deployed on top of your existing data lake, there’s no DevOps headache – no need to model data or move it to optimized data platforms.
Familiarity: To many, this is the biggest surprise as they begin working with dynamic indexes: Queries remain identical to the standard you’re used to, in any SQL WHERE clause, on indexed columns, within a SQL statement, for point lookups, range queries and string matching of data.
As data management continues to behave like a roller coaster facing a tsunami (there’s an image fit for Hollywood!) that grows exponentially, watching the evolution of strategies for analytics engines is actually exciting for those of us in the sector. The growing demands to cut costs, accelerate retrieval of fresh, clean, live data, and provide business insights and continuous intelligence (to even non-technical workers in a self-service mode) all mean that dynamic indexing may very well hold the key to the next generation of business-driven, data management strategies.