A standard approach for speeding up access to data in a database is indexing fields that are commonly used in query predicates. The most common indexing techniques were developed for transaction processing databases. For example, by indexing on a customer identifier, a database management system can efficiently find just the records associated with a customer. Instead of scanning the entire dataset, the engine searches an index of customer identifiers which lists the location of all the records relating to each customer. The query engine then processes just the relevant records.
Because indexes need to be updated with every change to the dataset, this approach works well for manageable amounts of slowly changing data, in the range of up to hundreds of thousands of transactions per second.
Big data has traditionally eschewed indexing because updating the index slows down the speed of data collection, where data streams in at millions of events per second or more. Instead, big data management systems often use compressed partitioning schemes, where data is stored in blocks segmented by a primary column, often the data timestamp.
At query time, instead of speeding up requests by looking up predicates in an index, big data systems quickly read compressed data in parallel across many servers, filtering out partitions based on the partition column. Looking for a few days from last month among a data set that spans years is just as fast in a big data system, if not faster than classic transactional Indexing.
A new approach to big data indexing is emerging, which blends traditional indexing technique with a big data approach to partitioning. Partitioning an index eliminates the drawbacks to indexing massive amounts of data and delivers the speed advantages data teams from big data analysis. The approach to managing big data indexes is based on nano-block indexing, which involves storing multiple small chunks of each index. Each chunk is a segment of the complete index. Put together, all of the individual index segments recreate the equivalent of a global index in a traditional database. Nano-block partitions are written independently and read in parallel at query time. Since big data rarely changes, nano-blocks don’t need to be optimized for regular updates the way transactional indexes are.
Compared with basic big data partitions, that are limited to a single primary segment column, users can create big data indexes on any column, adding and removing column indexes without updating the primary dataset. Big data indexing results in even higher performance for data-driven analytics with increased flexibility, and no sacrifice on the speed of data collection or query response time.
Varada has taken big data indexing to the next level. By building nano-block indexing deep into a query engine that runs directly on data lake solutions, Varada is able to deliver faster performance big data analytics than is possible with partitioning. Varada has also taken advantage of the flexibility inherent in nano-block indexing by dynamically and automatically adding and removing indexes in response to changing workloads. Varada has demonstrated that the future of big data analytics is in big data indexing.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!