Varada has enhanced its platform by adding text analytics capabilities to it. TFiR‘s Swapnil Bhartiya sat down with Ori Reshef, Vice President Of Products at Varada, to understand what exactly is ‘text analytics’, how it helps data teams and data scientists and then went deeper into the tech components (which includes open source Apache Lucene) of this new capability.
If you ask 10 people, you will probably get 10 different answers to ‘what is text analytics’. Ori Reshef, Varada’s VP of Products, argues that text analytics includes two different categories:
The first category is extracting new information and insights from text. Natural Language Processing (NLP) has been around since the 1950s and focuses on automatically understanding text. It includes word cognates such as stamming, text summarization, etc.
The second category is search on text. There are many analytics use cases that require this functionality, for example identifying a specific section of a gene, logs or events analytics and many more.
Varada recently added the Apache Lucene open-source library to our extensive indexing capabilities, to enable advanced text search. Coupling Varada’s unique data lake indexing technology with Lucene text indexing and search enables data teams and data scientists to run text search at petabyte scale directly on the data lake. Use cases include for example folder analysis, often used in marketing analytics. URLs essentially includes the a series of folders that describe a relevant product or service. Text search capabilities will enable to identify how often specific categories (i.e. folders) are being consumed by customers, for example ‘men shoes’ for an online retailer. Text search functions such as “Contain” and “Like” are at the basis of this analysis.
When it comes to massive amounts of data, this kind of analysis becomes a huge challenged for data consumers, in this case marketing analysts. Another example is cyber threat detection – anomaly detection is very similar in that sense, looking for specific text searches in massive amounts of logs and events.
As use cases started to evolve, and as organizations have been actively collecting billions (and more) of rows they want to analyze, text analytics has also evolved and required a fresh approach. Search in text now requires a technology that will enable it to run on billions of rows directly on the data lake. To enable these search capabilities, organizations often over pay for advanced text analytics solutions that focus more on extracting insights from text, instead of just enabling agile text searches. These heavy platforms are expensive not just in TCO, but also in time-to-market and maintenance which slows down the pace of innovation for data-drive organizations.
Size matters! Text search requires a very specific solution that can easily be served on the data lake, at massive scale, without the need to move data to heavy text-optimized platforms.
Varada’s innovative approach for big data and data lake analytics is all about cutting the data to tiny pieces , called nanoblocks, on SSD. Each nanoblock is at approximately 60,000 rows of a specific column. Varada uses this prism to continuously analyze data on the data lake. According to query behavior and business requirements, Varada’s query acceleration engine autonomously indexes or caches the relevant data. Each nanaoblock that will be indexed by the platform will be assigned the optimal index, including Lucene of course, according to its structure and type. Cardinality, how many different values appear in each dataset, is critical. By assigning indexes on the nanoblock-level, the cardinality challenge is dramatically lower.
By applying deep workload-observability, Varada can not only adaptively choose the optimal index, but also dynamically and autonomously decide which datasets to accelerate out of the entire data lake. The platform can automatically decide whether to index and/or cache data, how to do it, and when to do it.
We at Varada are great advocates of open-source, which has become a core part of our philosophy. We take pride that we don’t only use open-source but are active contributors. The first open-source that Varada decided to use was PrestoSQL, which was recently renamed as Trino. The Trino community is extremely vibrant and innovative, developing the basic distributed query engine for data lakes which Varada leverages for our indexing-based query acceleration platform. Dynamic filtering is just one example of how we support this phenomenal community. Varada also recently announced we open-sourced our Presto Workload Analyzer for easy observability on top of Presto and Trino.
For text analytics, as recently announced, we implemented the Lucene Java library as a part of our nanoblock indexing technology. Any nanoblock that includes text, will automatically benefit from Lucene capabilities, as a part of the mesh of different indexes the platform uses. The performance uplift for text-driven queries is dramatic, and can be x10-x100 times faster than the standard data lake performance by Presto / Trino.
In the last six months we are seeing many customer shift focus to logs analytics. Logs are massive in size and data consumers are interested in keeping more and more historical logs available for analytics. Text analytics on the data lake is the perfect tool for these use cases. The scale of the data lake enables to easily deliver access to 5+ years of logs and with smart text indexing, text searches can also be delivered at interactive performance. Traditional text analytics platforms were not designed to handle such specific tasks, often considered as “needle in a haystack” at a petabyte scale.
We are also witnessing the rise of folder analytics, way beyond marketing analytics. For example, cyber threat detection leverages text searches for address analysis and pattern recognition.
Varada’s roadmap is specifically designed to add more exciting features to support these trends. Schedule a short demo to see Varada in action on your data set!