Tips to Weaponize the Data Lake to Use Near-Perfect Data

By Roman Vainbrand
April 25, 2021
April 25, 2021

Forbes estimates that every human on the planet will create 1.7 megabytes of information each second!  In only one year, the accumulated world data will grow to 44 zettabytes (that’s 44 trillion gigabytes). This means executives want real-time data lake analytics that manipulate current data from business systems. For example, a dashboard should also mash up an enterprise’s “big data” with inventory, logistics, production, etc., to show all basic aspects of the business and use automation to instantly act upon the data.

We call this “near-perfect data,” since no systems can be 100% optimized.  

Success starts with the ability to easily manipulate big data. The selection of the right technology is critical. Today’s data lakes and data warehouses can both store big data. They are often used interchangeably, but they are vastly different concepts and technologies. A data warehouse is a repository for structured data, pre-defined for a specific purpose. Most traditional data warehouses do not provide access to real time data, and the data that they do provide is weeks, sometimes months old.

A data lake is a vast pool of structured and unstructured data, with its purpose defined at the time of use.  Many organizations now want to move to data lakes, both for the strategic use of data as well as the ability to define tactical success around emerging and existing business problems.  

As more organizations begin to transition from data warehouses to data lakes, most barely understand the processes and the correct enabling technology needed to make the transition work the first time. They also have new business issues to consider, such as the cost and risk of making the switch to a data lake-based analytics, and the most strategic use of all types of data generated by the enterprise systems.  

A newer development is to leverage the data where it exists, an approach that does not require transformation and redundant copies of the data. This makes the transition even more compelling because new data storage systems do not need to be stood up, thus the risk is minimized, and the overall costs and risks reduced. 

Data Lakes Enable Agile Analytics on Vast and Complex Datasets

The biggest advantage of data lakes is flexibility. Indeed, allowing the data to remain in its native and granular format means that data does not need to be transformed in flight, or at target storage. This is an up-to-date stream of data that is available for analysis at any time, for any business purpose. Data lakes enable us to handle vast, complex datasets. This is unlike traditional data warehouses where the data is not current, and the structure is fixed. Data must be transformed and translated using a traditional data warehouse approach.   

If agility is indeed the objective, then how do we make data lakes more agile? Solutions can be found in emerging approaches to leveraging the physical data where it exists. Data virtualization enables us to connect data where it is without moving it, and thus become the best chance for weaponizing strategic data considering the fact that the development of these types of data lakes are done with minimum effort and disruption.

Click New call-to-action to download the full whitepaper.

Schedule A Demo

The Data Lake “Reset”

When considering the growth of data and the value of data over the last 33 months, we see a troubling trend. Source: David Linthicum

By the time enterprise data accumulates over 2.75 years, there is a disconnect between the continued growth of data and the value enterprises can obtain from that data. This will not reset itself, and thus becomes internal IT’s mandate: Learn to leverage data around more flexible frameworks and data abstraction technology.

Near-perfect data means we’re getting close to having access to all data, at all times, for any business purpose. It’s time to move from 10% access to the correct data, as is now typical, to an access metrics of 96% to 99%. Because most enterprises are cost- and risk-averse, the fact that we can now leverage data lake technology without a net-new physical or virtual storage platform is critical to the “big reset.”  

This new paradigm will also remove current impediments that restrict access to data, allow a system to leverage current data, and mix and match data for use in core business applications and processes. For example, the ability to mash up external weather data with sales data could provide better sales predictions.

So, how do you leverage a data lake and data virtualization to provide access to near-perfect data? Below we’ve defined a stepwise process you can employ:

  1. Define the use of business data.  This is also called ‘logical data usage.’  Define what the data means and how it’s likely to be leveraged.  This is typically decoupled from the data sources since we will remove structure that limits how the data can be used.  
  2. Define the data sources.  Look for the sources of the data.  Typically, they are physical sources which are on premises or in the cloud.
  3. Define the use cases.  How will the data be used by humans and business processes?  
  4. Define the application usage.  How will the data be used by business applications, such as ERP, accounting, logistics, etc.?
  5. Select technology that will lead you to a data lake solution.  This will include several types of technology as well as the data lake, data virtualization, and data access software.
  6. Deploy data lake.  This covers testing and deployment to ensure it will live up to the requirements defined above, including the use cases. 
  7. Define data ops.  How will you operate the data lake moving forward?  This incorporates items such as playbooks, security operations, governance operations, and the use of automated management and monitoring systems such as AIOps.  
  8. Continuous improvement. Create a communications channel to provide feedback for all the above, including how to improve the data lake, and the ability to track configuration changes.  

The Next Step: Shift Analytics to the Data Lake

Let’s consider what to look for in query engines or data lake analytics platforms. Traditional platforms that run on the data lake support the agility and flexibility of running any query on any data source. However, the downside is that they are based on brute force query processing, having to cull through all of the data to return the result sets needed for application responses, or analytics. This leveraged too many resources, which when leveraging a pay-per-use services, such as public clouds, can run up unexpected costs. The result is that SLAs are not sufficient to support use cases beyond ad hoc analytics, which drives data teams to revert back to traditional data warehouses.However, there are new technologies out there that leverage indexing as a workaround to avoid the disadvantages of the brute force approach. Thus, enabling data lake analytics to not only support agility, but also enable the use of near-perfect data with traditional data warehouse comparable performance and cost.

Click New call-to-action to download the full whitepaper.

About David Linthicum

David Linthicum was named one of the top 9 Cloud Pioneers in Information Week 7 years ago but started his cloud journey back in 1999 when he envisioned leveraging IT services over the open internet.

Dave was named the #1 cloud influencer via a major report by Apollo Research, and is typically listed as a top 10 cloud influencer, podcaster, and blogger.

We use cookies to improve your experience. To learn more, please see our Privacy Policy