The most important decision in order to get reasonable performance out of your analytics database is to choose how data is stored and accessed. There are two approaches, both specifically optimized for efficient, high performance analytics. Micro-partitioning uses the values of a single column to divide subsets of each table into blocks of a tens to hundreds of megabytes in size. In contrast, indexing expands beyond the single column and separates out how data is stored (typically in similar blocks to the micro-partition architecture) with how selected column values are accessed, allowing users to optimize access to any subset of columns. The system that you choose has a significant impact on cost and performance.
Micro-partitioning is an approach for creating relatively small blocks of data, usually a few tens or hundreds of MB. Those blocks are organized around some partitioning or clustering key. The goal of micro-partitioning based systems is to allow the query engine to only retrieve the data blocks that are required to answer a query based on the query predicate for the partitioning key, and to do so in parallel. While micro-partitioning is more efficient and faster than full table scans, this approach still requires reading extensive amounts of data from disk, even when compressed. Micro-partitioning doesn’t help with very large join-based predicates, where tables are restricted based on the output of another table that itself is large, unless the predicate happens to align with the partitioned column. When you have few columns and few or no joins, micro-partitions can be a simple, hassle-free approach to structuring your data.
Indexing is the tried and true method of creating separate files that can be used to quickly identify where to retrieve records from the data set. The query engine uses indexes and the query predicate to reduce the amount of data retrieved before reading all of the columns for each record. In some cases, indexes can be used to answer the entire query, a technique known as covered indexes. Indexes are more efficient at pruning data based on joins, for example when looking for transactions from different people at the same organization. In this case, the query engine may need to join two or three tables — indexes can make this join more efficient. Though indexes can be highly performant and cost efficient at query time, they typically require careful design considerations and most implementations are not optimized for loading data quickly. Furthermore, traditional indexes are most optimized for row-based data layouts and not for columnar layouts that are typically used with big data. With columnar data, you can’t index every column without exploding your storage and grinding your load times to halt. The key to big data indexing solutions is to have a dynamic, intelligent indexing system.
Varada uses an indexing mechanism that is uniquely optimized for high performance analytics called nanoblock indexing. Instead of storing one large index for each column that the user selects, Varada dynamically creates millions of nanoblocks – a few dozen kilobyte sized sub sections of the indexed column. Each nanoblock only stores an index for a subset of the data, and each nanoblock index is independent and uniquely adapted to that nanoblock. By taking advantage of modern storage systems, such as SSDs, nanoblocks are fast to load, update, and systems like Varada can quickly read and execute queries against nanoblock indexes without the overhead inherent in either traditional indexes or micro-partitions. Varada is also able to dynamically create, modify, and remove both indexes and table columns without the overhead of rewriting table data.
In both micro-partitions and nanoblock index architectures, data is usually stored in a columnar format and can be processed efficiently by the query engine. How much data needs to be read and filtered can make a huge difference in both query performance and cost efficiency. While micro-partitions exhibit faster load and grouping performance over traditional indexes, Varada’s nanoblock index implementation combined with columnar storage delivers both fast loads and faster queries than micro-partitions. Nanoblock indexes allow the query engine to read just the columns required to compute query predicates, and in some cases can answer queries entirely using the indexes. With nanoblock indexes, users are free to add and remove indexes and table columns dynamically. Micro-partitions must be optimized against a single partition key and contain all the columns, dramatically reducing both query speed and flexibility compared to using indexes. As the system is used and data keeps changing, micro-partitions may require heavy background maintenance such as defragmentation and rebalancing. The independent and columnar nature of nanoblock-based indexes removes the need for background heavy data reordering operations.
As an example, consider an analytics system for analyzing data from a ride sharing application. Users might need to track a couple dozen data points about each ride, as well as information about drivers and riders. Any queries that include constraints on drivers, riders, start locations, and end locations, will benefit from using one or more indexes. With a large number of columns, any query that returns or aggregates a subset of the columns outperforms micro-partitioning by better managing I/O, and reducing query cost. On average, testing shows a 40x increase in performance for these types of queries. In contrast, queries that can be optimized in advance, such as constraints or aggregates by date where data is micro-partitioned by data, ignoring indexes will perform better, often 4-10x.
The choice of a micro-partitioning or indexing based analytics engine can impact individual query performance and overall cost efficiency. In particular, as user load increases, the impact of hundred or thousands of queries a day can add up to significant I/O cost. Micro-partitions often result in larger overall I/O depending on the selectivity of your queries and the number of columns in your data set. While micro-partitions work well for skinny tables, the more columns, either in your main tables or in the tables you join, the higher the cost of using micro-partitions.
In order to decide between a micro-partition based solution or a nanoblock indexing-based query engine, take a look at your query logs and identify the number of joins and complexity of predicates. Compare the performance of the two using popular implementations such as Snowflake, which uses micro-partitions, and Varada, which uses nanoblock indexing. Varada has taken the approach of building intelligent indexing directly into the query engine and integrated it with the query optimizer and query statistics. Rather than requiring ongoing design, Varada creates and maintains indexes based on how users query your data, delivering critical observability capabilities. Varada also responds dynamically to changes in data and to workload prioritization set by administrators. Varada gives you the performance and cost efficiency of indexes with the ease of use of micro-partitions. Using Varada you don’t need to skip on wide columns or complex joins just to get the query times down and costs within budget.
See how Varada’s big data indexing dramatically accelerates queries vs. AWS Athena:
To see Varada in action on your data set, schedule a short demo!