It seems like everyone is coming to the realization that the glory days of SIEM are over. Security teams are not only measured by their ability to collect as much data as possible, but the emphasis is moving to how effectively they can analyze massive amounts of complex security data, flowing at constant real-time stream.
To top that off, security data is primarily composed of events and logs that are growing in complexity and dimensionality. This means that each row (log or event) often includes dozens and even hundreds of different attributes.
Yes, data is complex, and being able to sift through it effectively will drive ROI and essentially build a solid competitive advantage. In this blog I want to focus on the “effectively” piece of analytics. It all comes down to timing. How quickly you can start analyzing, how quickly you can change queries and adapt, how quickly your queries complete so you can run as many queries as possible to detect anomalies, new threats and attacks.
To frame the concept of timing, Crowdstrike introduced the concept of Breakout Time. It’s not enough to identify and react, you need to do it very fast. Attacks generally include five stages: initial access, persistence, discovery, lateral movement, and objective.
For effective mitigation, security teams need to detect, analyze and respond before lateral movement. That’s essentially breakout time. According to a 2018 research, on average breakout time was 1 hour and 58 minutes.
The last piece of this argument by Crowdstrike, is that almost 2 hours doesn’t cut it, and security teams should aim for the following:
1 minute to detect. 10 minutes to understand. 60 minutes to respond.
With SOC teams overwhelmed with alerts and data, traditional SIEM, XDR and optimized search platforms merely compound the problem, not to mention support the 1 hour and 11 minutes goal of breakout time.
Going back to my focus on “efficiencies” I strongly believe that the security data lake is the only analytics destination to achieve breakout time.
Threat hunting can be described as finding a “needle in a haystack”. It’s not something you can prepare for or model. Data changes all the time, anomalies change all the time, and data has to remain in its most granular state. In addition to granularity, maintaining easy and quick access to historical data is critical, making the data lake a strong alternative for analytics.
By leveraging commodity storage storage (i.e. cheap), object storage format (AWS S3 for example) and existing data pipelines, the data lake offers a cost-effective architecture for a wide range of data sources.
The advantages of the data lake are dramatic:
There has to be a big “but” coming up, right? The agility advantages of the data lake architecture are often discounted by lack of efficiency.
Most data lake query engines are based on brute force technology, which essentially scans the entire dataset to return answers. So where you save on storage you may end up paying on enormous compute clusters to support decent performance and concurrency SLAs, which are critical to deliver the 1-10-60 goal.
In reality, 80%-90% of compute will be “wasted” on scan and filter operations, resulting in:
Bottom line, data lake analytics solutions haven’t matured yet to support the fast pace and performance required from threat hunting, incident response and security investigations.
Indexing is all about finding that “needle in a haystack.” But if you learned anything about indexing back in school / undergrad / university / anywhere else, just scrape it. When it comes to truly massive data sets, the traditional concepts of indexing won’t deliver the much sought after efficiencies. It took a fresh new look at indexing and to adjust it to solve big data challenges.
Unlike partitioning-based optimizations, that are designed to reduce the amount of data scanned and subsequently boost performance (or reduce cost), Varada’s big data indexing technology is not limited to several attributes, but rather enables to quickly find relevant data across any attribute (column).
Based on the fact the data lake is not homogeneous and includes data from many different sources and formats, the platform leverages a rich suite of indexes, such as Bitmap, Dictionary, Trees, Bloom Lucene (text searches which are so critical for log and event analytics), etc.
The platform doesn’t require any special skill sets and automatically finds the best index. Let’s take it a step further – to deliver optimal performance on varying cardinality, the platform breaks down each column into small pieces, nano-blocks, and finds the optimal index for each nano-block.
Case Study: Varada’s Impact on Endpoint Security
Using deep observability and machine-learning, Varada’s platform dynamically and automatically indexes data according to workload priorities (some applications will always top others), changing queries and changing data. Indexes, as well as cached data, are stored in SSDs to enable highly effective access and extremely fast performance.
To highlight how fast Varada is, we ran a benchmarking analysis against data lake vendors such as Snowflake and Trino. The results were very clear. Varada is 10x-100x faster, and as queries get more complex (needle in a haystack), Varada’s performance advantages become much more extreme.
Let’s go back to where we started. Varada is designed to finally beat breakout time:
All you need to do is start timing it!