In my previous post, I gave a brief history of big data evolution and reviewed the inevitable compromises one must make when choosing a big data tool.
In this article, I’m happy to unveil how we at Varada built a big data infrastructure platform that is not only fast and flexible, it is one hundred times faster than any database we compared it with.
Varada designed a revolutionary data infrastructure that will support the requirements of the future.
Most modern big data solutions are based on the assumption that the way to achieve top performance is to maximize data locality. This assumption is based on the inherent limits of the underlying layers of storage and networking. By incorporating data locality as a design constraint, these systems are hard-wired to be optimal for only a certain class of queries (those that play nicely around the dimension on which the data is distributed and/or partitioned). By definition, this limits the flexibility of such systems.
Are these limitations still relevant?
SSD costs have been in steady decline since their introduction over two decades ago. It’s been possible to spin up databases on top of SSD for several years now. But the improvement in performance is surprisingly underwhelming; the reason is historical. Database applications were designed to run on HDDs; reading large blocks of data is one example of the low level implementations that are optimal for a HDD. Similar to how NVMe was designed as a new device interface to fully exploit SSDs, database applications can also be redesigned to leverage an underlying SSD storage device.
In addition, network bandwidth has continuously been increasing and it is now easy to spin up a cluster of machines that are designated to reside in close proximity. This ensures high bandwidth, low latency inter-instance traffic within the cluster. When you have such a cluster of machines, there’s no need to limit yourself to data locality. The advantages from being able to freely distribute workloads greatly outweighs the slight overhead of reading data from another machine.
Acknowledging the great advancements in storage and networking allowed us to take a unique and antithetical approach when designing our solution, enabling enterprises to leverage both speed and flexibility in how they consume big data.
The heart of our technology is an innovative approach we called Inline Indexing™. As many big data infrastructure solutions, during data ingestion we index all columns. Here comes the big “but” – unlike any other solution, we store the data in nanoblocks™ of 8KB (compared to Redshift that uses blocks of 1MB, HDFS that uses blocks of 128MB or Parquet which stores in tens to hundreds of MBs sized blocks); each nanoblock™ has its own set of independent indexes and this allows us to load from disk only the minimal amount of data required.
This allows us to be extremely fast for any incoming query without the user being required to define a single index.
How fast? In our internal benchmarks, compared to AWS Redshift, Athena and EMR, we return results x10 faster for selective and cohort analysis queries and a x100 faster for highly selective queries.
Jason Tavoularis, Product Manager at IBM Business Analytics software, explains how IBM leverages Varada to empower IBM’s analytics solutions for big data: “It’s a visually engaging, highly interactive dashboard where behind-the-scenes Varada is accessing a data store with over two billion records and responding to all the user’s analytical gestures within a few seconds.
Despite the large data volumes, the snappy response times on the Varada-powered Cognos dashboard ensure users won’t be impeded from cross-examining intersections of data points that may lead them to uncover patterns which allow them to make smarter decisions”.
Check out the full case study here.
From the start, we focused on developing an innovative and modern approach for storing data and accessing it. We knew we wanted to make this solution accessible via SQL but we didn’t want to reinvent the wheel.
We looked for a distributed SQL engine that would do the SQL heavy lifting and would allow us to concentrate on our innovation. Presto was a perfect match. It gives us an ANSI SQL client and an easy and native method for loading data from a plethora of data stores. Not only is Presto spectacular at distributing SQL queries, it has a rich and thriving community that continues to improve the product. Two years later we can safely say we made an excellent choice and our collaboration with the Presto community has been beneficial to all sides.
We redesigned data layout at the storage level, to take full advantage of the capabilities of SSD drives (NVMe, random access, parallel reading, small blocks, minimal overhead in data fragmentation).
Our architecture relies on utilizing the high network velocity in the cluster to the max; rather than adopting a shared-nothing architecture mandating data locality, we chose a shared-everything architecture. One node can handle requests pertaining to data stored on another node. This allows for an infinitely scalable and balanced distributed system without bottlenecks, regardless of the dimensions of the incoming queries.
Bottom Line: this nothing short of a game changer for big data analytics!
In analytics, the process is often iterative; you run a query, get results and this leads you to devise the next query. With most tools out there you’d have to wait; either for the data engineers to prepare a new database indexed for your new query or simply wait for the existing database to churn out the result.
With Varada you can run query after query after query and the system will always choose the optimal index or set of indices to answer each query.
No preparation or data remodeling necessary. No breaks for coffee while the system is handling a query it wasn’t built for. Just a smooth, iterative and interactive quest into the heaps of data.
So, do you still think compromise is inevitable?