The Security Data Lake: It’s All About Timing

By Shira Sarid
I
October 19, 2021
October 19, 2021

It seems like everyone is coming to the realization that the glory days of SIEM are over. Security teams are not only measured by their ability to collect as much data as possible, but the emphasis is moving to how effectively they can analyze massive amounts of complex security data, flowing at constant real-time stream.

To top that off, security data is primarily composed of events and logs that are growing in complexity and dimensionality. This means that each row (log or event) often includes dozens and even hundreds of different attributes.

Yes, data is complex, and being able to sift through it effectively will drive ROI and essentially build a solid competitive advantage. In this blog I want to focus on the “effectively” piece of analytics. It all comes down to timing. How quickly you can start analyzing, how quickly you can change queries and adapt, how quickly your queries complete so you can run as many queries as possible to detect anomalies, new threats and attacks.

Breakout Time KPIs

To frame the concept of timing, Crowdstrike introduced the concept of Breakout Time. It’s not enough to identify and react, you need to do it very fast. Attacks generally include five stages: initial access, persistence, discovery, lateral movement, and objective. 

For effective mitigation, security teams need to detect, analyze and respond before lateral movement. That’s essentially breakout time. According to a 2018 research, on average breakout time was 1 hour and 58 minutes. 

The last piece of this argument by Crowdstrike, is that almost 2 hours doesn’t cut it, and security teams should aim for the following:

1 minute to detect. 10 minutes to understand. 60 minutes to respond.

Breakout Time

Schedule A Demo

With SOC teams overwhelmed with alerts and data, traditional SIEM, XDR and optimized search platforms merely compound the problem, not to mention support the 1 hour and 11 minutes goal of breakout time. 

Going back to my focus on “efficiencies” I strongly believe that the security data lake is the only analytics destination to achieve breakout time.  

The Data Lake Offers Optimal Agility for Security Analytics

Threat hunting can be described as finding a “needle in a haystack”. It’s not something you can prepare for or model. Data changes all the time, anomalies change all the time, and data has to remain in its most granular state. In addition to granularity, maintaining easy and quick access to historical data is critical, making the data lake a strong alternative for analytics. 

By leveraging commodity storage storage (i.e. cheap), object storage format (AWS S3 for example) and existing data pipelines, the data lake offers a cost-effective architecture for a wide range of data sources. 

The advantages of the data lake are dramatic:

  • Access all relevant data at its raw / granular form
  • Data is retained for multiple uses over time, enabling full timeline investigations 
  • The data lake serves as the single source of truth, for (all) security analytics workloads

The Security Data Lake Fails to Deliver Efficiencies

There has to be a big “but” coming up, right? The agility advantages of the data lake architecture are often discounted by lack of efficiency.

Most data lake query engines are based on brute force technology, which essentially scans the entire dataset to return answers. So where you save on storage you may end up paying on enormous compute clusters to support decent performance and concurrency SLAs, which are critical to deliver the 1-10-60 goal.

In reality, 80%-90% of compute will be “wasted” on scan and filter operations, resulting in:

  • Inefficient resource utilization
  • Inconsistent performance
  • Unpredictable costs 

Bottom line, data lake analytics solutions haven’t matured yet to support the fast pace and performance required from threat hunting, incident response and security investigations.

The Missing Ingredient Enable the Security Data Lake is Indexing

Indexing is all about finding that “needle in a haystack.” But if you learned anything about indexing back in school / undergrad / university / anywhere else, just scrape it. When it comes to truly massive data sets, the traditional concepts of indexing won’t deliver the much sought after efficiencies. It took a fresh new look at indexing and to adjust it to solve big data challenges.

Introducing Varada’s Autonomous Big Data Indexing Technology

Unlike partitioning-based optimizations, that are designed to reduce the amount of data scanned and subsequently boost performance (or reduce cost), Varada’s big data indexing technology is not limited to several attributes, but rather enables to quickly find relevant data across any attribute (column).

Based on the fact the data lake is not homogeneous and includes data from many different sources and formats, the platform leverages a rich suite of indexes, such as Bitmap, Dictionary, Trees, Bloom Lucene (text searches which are so critical for log and event analytics), etc.

The platform doesn’t require any special skill sets and automatically finds the best index. Let’s take it a step further – to deliver optimal performance on varying cardinality, the platform breaks down each column into small pieces, nano-blocks, and finds the optimal index for each nano-block. 

Varada Security Data Lake
Varada’s Security Data Lake Solution

Case Study: Varada’s Impact on Endpoint Security


Using deep observability and machine-learning, Varada’s platform dynamically and automatically indexes data according to workload priorities (some applications will always top others), changing queries and changing data. Indexes, as well as cached data, are stored in SSDs to enable highly effective access and extremely fast performance. 

To highlight how fast Varada is, we ran a benchmarking analysis against data lake vendors such as Snowflake and Trino. The results were very clear. Varada is 10x-100x faster, and as queries get more complex (needle in a haystack), Varada’s performance advantages become much more extreme. 

Let’s go back to where we started. Varada is designed to finally beat breakout time:

  • No need to move data or model it – ask questions whenever you want
  • Keep data granular – ask any question you need
  • Run directly on your data lake – run on x10 more data, including historical data
  • Add the power of indexing to your analytics – get blazing fast answers

All you need to do is start timing it!

See the magic of Varada’s Security Data Lake on AWS Marketplace or schedule a live demo!

We use cookies to improve your experience. To learn more, please see our Privacy Policy
Accept