As organizations look to become more agile from the start there has been a mass movement jumping headfirst into what is called a security data lake.
Gartner defines data lakes as “a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.”
Expanding this concept to include security-specific data, “security data lakes” can help you centralize and store unlimited amounts of data to power investigations, analytics, threat detection, and compliance initiatives. Analysts and applications can access logs from a single source to perform data-driven investigations at optimal speed with centralized, easily searchable data.
The global data lake market size was valued at over $8 billion USD last year and is expected to grow at a compound annual growth rate (CAGR) of over 21% from 2021 to 2028. According to Gartner, over half of organizations plan to implement a data lake in the next 24 months. An enormous amount of information is generated daily on digital information platforms and requires efficient processing and indexing architectures.
Security data lakes are designed to centralize all of your data so you can support complex use cases for security analysis, including threat hunting and anomaly detection at scale. One of the top challenges is long-term retention and having the ability to search across collected telemetry. Most vendors have a data retention cap between 7 and 30 days and often offset costs to the buyer whether they know it or not. For example, according to Gartner and multiple cloud benchmark studies over the years, on average, it costs $6 USD per endpoint per year for 7 days of endpoint detection and response (EDR) continuous recorded data, which is why EDR solutions are so expensive.
Accessing all of your historical data is critical to being able to have the right contextual information to conduct an effective and efficient security investigation.
As seen with the SolarWinds supply chain attack, it was months before the security community was made aware of the malicious artifacts and adversarial tactics, techniques, and procedures (TTPs) and the motivations and scope behind such a complex type of attack. This meant that many organizations could not perform the historical hunting across the relevant time window because those logs had already aged out of the platform or had been moved into offline archives making it difficult to triage the scope of the attack.
There are 4 key data-related challenges that security teams must have in place for a security data lake architecture to operate efficiently and effectively.
Organizations are taking extra care in implementing a best-of-breed approach that not only addresses immediate needs but also does for the long run.
The main pitfall for data lake architectures, especially when evaluated against existing SIEM solutions and other optimized platforms is resource efficiencies. Data lake query engines are often based on brute force technology that essentially scans the entire data set. The result is that 80%-90% of compute resources are ”wasted” on ScanFilter operations.
Organizations that have already attempted to leverage data lake architectures often find themselves managing huge clusters to ensure performance and concurrency requirements are met. This is extremely expensive on both resources and maintaining large data teams.
There are different ways to tackle these challenges, ranging from partitioning-based optimizations, using managed platforms or even serverless solutions such as AWS Athena.
Unlike partitioning-based data lake optimizations, which are limited to several dimensions, Varada offers a data lake analytics solution that is based on proprietary big data indexing technology.
Varada can index any column and automatically decides which data to index and which index to use on each nano-block (small chunk of data, 64K rows, of a single column). Varada’s indexing suite includes a variety of indexes such as Bitmap, Dictionary, Trees, Bloom Lucene (text searches), etc. Based on the format of the data, structure and cardinality, the platform automatically assigns the most effective index and, driving optimal performance.
The platform seamlessly chooses which queries to accelerate and which data to index according to workload behavior and automatic detection of hot data and bottlenecks. The platform also enables data teams to define business priorities and accordingly adjust performance and budgets, eliminating the need to build separate silos for each use case.
Agility is not limited to the type of queries but also to the volume of queries, which means volatility in compute for query processing is expected to be high. Data teams are often measured on how quickly they can react to spikes in demand. Varada’s architecture is extremely elastic to enable teams to add more clusters and use cases quickly and dynamically scale out and in, delivering the most effective TCO. Effective separation of compute and storage enables to elastically scale and add additional clusters as query traffic fluctuates, avoiding overprovisioning and idle resources.
As a part of Varada’s indexing suite, text searches with Apache Lucene are a native part of the platform and are applied automatically by the platform.
Organizations collect massive amounts of data on various events from many different applications and systems. These events need to be analyzed effectively to enable real-time threat detection, anomalies and incident management. In various security-related use cases, text analytics is leveraged to provide deep insights on traffic and user behavior (segmentation, URL categorization, etc.). Text analytics has proven to be critical for security information and event monitoring (SIEM) and other SOC tools in reducing the overall time and resources required to investigate a security incident while being as effective and efficient as possible.
About the author
Brad LaPorte is a former top-rated Gartner Research Analyst for cybersecurity and held senior positions in US Cyber Intelligence, Dell, and IBM, as well as at several startups.
Brad has spent most of his career on the frontlines fighting cybercriminals and advising top CEOs, CISOs, CIOs, CxOs as well as other thought leaders on how to be as efficient and effective as possible. He is currently a Partner at High Tide Advisors and actively helping cybersecurity and tech companies grow their go-to-market strategies.