Forbes estimates that every human on the planet will create 1.7 megabytes of information each second! In only one year, the accumulated world data will grow to 44 zettabytes (that’s 44 trillion gigabytes). This means executives want real-time data lake analytics that manipulate current data from business systems. For example, a dashboard should also mash up an enterprise’s “big data” with inventory, logistics, production, etc., to show all basic aspects of the business and use automation to instantly act upon the data.
Success starts with the ability to easily manipulate big data. The selection of the right technology is critical. Today’s data lakes and data warehouses can both store big data. They are often used interchangeably, but they are vastly different concepts and technologies. A data warehouse is a repository for structured data, pre-defined for a specific purpose. Most traditional data warehouses do not provide access to real time data, and the data that they do provide is weeks, sometimes months old.
A data lake is a vast pool of structured and unstructured data, with its purpose defined at the time of use. Many organizations now want to move to data lakes, both for the strategic use of data as well as the ability to define tactical success around emerging and existing business problems.
As more organizations begin to transition from data warehouses to data lakes, most barely understand the processes and the correct enabling technology needed to make the transition work the first time. They also have new business issues to consider, such as the cost and risk of making the switch to a data lake-based analytics, and the most strategic use of all types of data generated by the enterprise systems.
A newer development is to leverage the data where it exists, an approach that does not require transformation and redundant copies of the data. This makes the transition even more compelling because new data storage systems do not need to be stood up, thus the risk is minimized, and the overall costs and risks reduced.
The biggest advantage of data lakes is flexibility. Indeed, allowing the data to remain in its native and granular format means that data does not need to be transformed in flight, or at target storage. This is an up-to-date stream of data that is available for analysis at any time, for any business purpose. Data lakes enable us to handle vast, complex datasets. This is unlike traditional data warehouses where the data is not current, and the structure is fixed. Data must be transformed and translated using a traditional data warehouse approach.
If agility is indeed the objective, then how do we make data lakes more agile? Solutions can be found in emerging approaches to leveraging the physical data where it exists. Data virtualization enables us to connect data where it is without moving it, and thus become the best chance for weaponizing strategic data considering the fact that the development of these types of data lakes are done with minimum effort and disruption.
By the time enterprise data accumulates over 2.75 years, there is a disconnect between the continued growth of data and the value enterprises can obtain from that data. This will not reset itself, and thus becomes internal IT’s mandate: Learn to leverage data around more flexible frameworks and data abstraction technology.
Near-perfect data means we’re getting close to having access to all data, at all times, for any business purpose. It’s time to move from 10% access to the correct data, as is now typical, to an access metrics of 96% to 99%. Because most enterprises are cost- and risk-averse, the fact that we can now leverage data lake technology without a net-new physical or virtual storage platform is critical to the “big reset.”
This new paradigm will also remove current impediments that restrict access to data, allow a system to leverage current data, and mix and match data for use in core business applications and processes. For example, the ability to mash up external weather data with sales data could provide better sales predictions.
So, how do you leverage a data lake and data virtualization to provide access to near-perfect data? Below we’ve defined a stepwise process you can employ:
Let’s consider what to look for in query engines or data lake analytics platforms. Traditional platforms that run on the data lake support the agility and flexibility of running any query on any data source. However, the downside is that they are based on brute force query processing, having to cull through all of the data to return the result sets needed for application responses, or analytics. This leveraged too many resources, which when leveraging a pay-per-use services, such as public clouds, can run up unexpected costs. The result is that SLAs are not sufficient to support use cases beyond ad hoc analytics, which drives data teams to revert back to traditional data warehouses.However, there are new technologies out there that leverage indexing as a workaround to avoid the disadvantages of the brute force approach. Thus, enabling data lake analytics to not only support agility, but also enable the use of near-perfect data with traditional data warehouse comparable performance and cost.
David Linthicum was named one of the top 9 Cloud Pioneers in Information Week 7 years ago but started his cloud journey back in 1999 when he envisioned leveraging IT services over the open internet.
Dave was named the #1 cloud influencer via a major report by Apollo Research, and is typically listed as a top 10 cloud influencer, podcaster, and blogger.