Cloud BI Can Fly on the Data Lake

David Krakov
By David Krakov
I
July 11, 2019
July 11, 2019

Varada is great for BI on the cloud

Varada is a platform for fast data analytics on cloud data lakes. Which is a long way to say how much we love and help BI on the cloud.

The more data exists on a cloud data lake, the harder interactive BI on the data becomes. Your data might be on S3 due to its scale or structure. Queries are simply using a back-end tool as Athena or Presto. But there are a few fundamental reasons why connecting your favorite BI tool directly to the lake is challenging.

  1. Volume: the data no longer fits in the memory of the BI tool. Even when it does, rebuilding the in-memory copy to keep the data fresh is a heavy operation. So a live connection is required to access the breadth of the data.
  2. Format: the data on the lake is not always in a format that BI can handle. Most BI tools need tables and relations. Data must be transformed to be usable, usually by an ETL job to create a copy.
  3. Model: efficient analysis of data requires its model to correspond to the BI needs. Partitioning, sorting, bucketing, denormalization and other data tuning techniques are needed. They require to run and manage various transformation jobs. A deep understanding of BI needs is needed to be effective. A high level of data expertise is needed as well.
  4. Dimensionality: so many data sources that can be used for effective analytics. The sheer breadth and variety of the data make it hard to create an effective model. The best insights can come from correlating different sources of data. Almost no tuning can capture this in advance.
  5. Concurrency: in a cloud environment, many users might interact with data coming from a few centralized sources like a lake. Workload management is a hard problem. It can become even harder if the data is copied from the lake to a traditional data warehouse.

We at Varada built a platform that tackles the core of the problem. We eliminate most of the data modeling, copy management, complex data pipelines, and workload management for BI on a lake.