Presto Carries Your Data Silo Baggage

By Roman Vainbrand
April 5, 2020
April 5, 2020

A data silo is any data store that is in the control of a single department and is isolated from the rest of the organization. Data silos are a construct of the past; their existence in today’s world is more often than not a testimony to the hardship of migrating from a working solution.

Data silos often originated naturally; up until a decade ago it was a pretty standard strategy – if an application, service or tool needed a database it was perfectly reasonable to provide it with its own designated database. Remember, the microservice craze started before the data-driven philosophy that is now widely practiced. In other cases it may have resulted from an organizational structure of the company that practiced department autonomy in IT decisions.

Data Silos Woes

Data silos are a problem for several reasons:

  • Data isolated from the rest of your organization is an issue of itself. Today’s data driven approach calls for data democratization – making as much data as accessible as possible to any part of the organization that can and should use it. Data is valuable only if you can make use of it.
  • Data duplication is troublesome for a couple of reasons. First of all, you’re paying twice to store the same data. The more duplication and the larger the data, the higher this redundant cost is. The bigger problem is often inconsistencies – data tends to evolve over time and you can find yourselves with two separate departments maintaining almost duplicate data; which is the correct one?

All your Data at your Fingertips

When you have a consolidated store of all your data you can reap the rewards of accessing all your data:

  • Gain a comprehensive view of all your data
  • Single point of access to access all your data
  • The ability to cross reference data from different parts of your organization. For example, cross referencing customer retention (R&D’s application data) with quality of support (HR’s personnel headcount).

Data Warehouse, Let’s Go!

Now you’re convinced and you start looking into migrating everything to a single data warehouse. Easier said than done! First of all, your applications are running in production. These various data silos are servicing critical aspects of your company. You can’t just stop it and renovate. In some cases there are severe technical and cultural pushbacks; e.g. a certain department daily work routine may be strongly tied to an application that needs its own SQL database.

And perhaps you don’t want to move some of the data silos. For security or privacy issues you need to stick with the solution in place. Or you find that the data silo is a NoSQL database and you don’t have anyone with enough technical know-how to migrate its data.The reasons NOT to move all data to a data warehouse can be technical, cultural, historical or by design. But you still want to access all data at once, don’t you?

Presto for the Win

Presto can be used to virtually consolidate all your data. One of Presto’s top features is its ability to connect to virtually any data store and access its data (learn more here). Presto is a distributed SQL engine built for analytics. Unlike other solutions, Presto doesn’t require you to pre-load the data from its sources. Presto can connect to all common SQL and NoSQL databases as well as read directly from HDFS and S3. 

This leads to several advantages for using Presto over an ETL process:

  • Saves you the overhead from authoring and maintaining various ETLs. Not only do you save time and money NOW but you’re also more versatile for future changes. Much easier to respond to changing business requirements.
  • The data is as fresh as possible
  • All the data is accessible in its most granular form. Most ETLs will result in data granularity loss, leaving only the “important” parts of the data. Often this results in the loss of the ability to drill down.

With Presto you can leave the data silos where they are, connect Presto to all of them along with your data lake, and run queries on ALL your data. You can run cross reference queries in plain SQL (even if some of the data silos are NoSQL or unstructured data).
This can be a solution that serves you instead of migrating to the data warehouse or while the migration process is taking place.

Wish it were Faster? Indexing is the Solution

It’s very easy to deploy a Presto cluster. Setting up the connections is straightforward and in no time you can start querying your data silos and data lake.

If you attempt to use Presto to query large amounts of data from your data lake, you might find yourself waiting too long for comfort… This stems from Presto’s multi-data source architecture. The downside of supporting many different data vendors is that the API between Presto and the Connectors must conform to a common denominator. Presto supports the concept of predicate pushdown. This means that some part of the user’s query predicate logic (i.e. filtering) is passed to the Connector so it can load only the relevant amount of data. Due to the API limitations, most Connectors will still load more data than actually required and Presto will filter it out when it processes the data. Some connectors do not support pushdown predicates at all and will always load all data and some connectors have limited support that require pre-defined indices manually configured by the user. Even with a “smart” connector (like the Hive connector), the data itself is often unstructured and may be partitioned on one dimension (e.g. date) whereas the query predicates can span multiple dimensions. So in most cases you’ll load more data than actually necessary.

That’s where Varada can help you out.

Varada offers a big data infrastructure solution for fast analytics on thousands of dimensions.

Based on Varada’s Inline Indexing technology, we enable large and complex datasets to optimally serve analytics users and apps by making filtering, joining and aggregating data extremely fast on every dimension of any data source.

Varada embeds Presto and that means that in addition to the speed Varada gives you, you get to enjoy Presto’s full bag of goodies.

Varada offers a free and easy-to-use tool that provides deep insights on how your workloads perform on Presto. You can easily optimize resources and improve performance. To learn more. click here.

Furthermore, we enriched Presto’s pushdown capabilities to allow more information to pass from Presto to the Connectors and with our Inlined Indexing™ technology, all dimensions are indexed automatically. Varada allows you to zero in on the “hot” data – the data you access most frequently – and create materialized views for it (all in natural SQL). You can create materialized views for the data in your data lake but also for your data silos. In most cases they won’t be a bottleneck but if any are, you can use Varada to boost the access to their data as well.Using Varada, your queries will fly faster than ever before. Now you can smoothly analyze and cross reference all your data, whether it’s in your archaic data silos or deep in your data lake.

Ready to see Varada in action? Click here to schedule a short demo.

We use cookies to improve your experience. To learn more, please see our Privacy Policy