0:00:00.7 Max Havey: Hello, and welcome to another edition of Security Visionaries, a podcast all about the world of cyber, data and tech infrastructure, bringing together experts from around the world and across domains. I'm your host, Max Havey. And today we're diving into the world of Data lakes with Troy Wilkinson, CISO at Interpublic Group, also known as IPG. Troy welcome to the show.
0:00:21.5 Troy Wilkinson: Thanks Max. Really a great pleasure to be here.
0:00:23.8 Max Havey: Glad to have you. So to get things started here, can you take us through just what are the concept of data lakes and why are they important? As an aspect of modern security.
0:00:34.2 Troy Wilkinson: Yeah, absolutely. I think it's important to take a little bit of a step back and talk about the reason we collect data in the first place, everything that we do feel see and touch and technology has some type of machine logs that come out of it. Some of those event logs are just normal log-ins, log-outs, but some of that is a very important security telemetry. And so what we've been doing over the last 25, 30 years, is really trying to decide what's important to us from a security operations perspective, what data we need to collect, what data is important to an event or an incident, and then really kind of diving into the logic or the data science behind how we can tie those incidents or events together, and so this has been a data problem for the longest period of time, and we are really stepping into the next generation or the next frontier of this data by decoupling the data from the analytics for so long, we've been asked to put this data into a singular place, what I like to call legacy SIM, where you pipe your data into a massive database and then you do the analytics on top of it there to gather insights from all of your incidents, now with the data lake structure, being able to put your data into a common schema in a Data lake allows you to decouple that data from your analytics, so if the next whiz-bang AI solution comes along and you wanna apply that AI to.
0:01:55.3 Troy Wilkinson: This data set, that's great. It's a flip of a switch, you don't have to move your data into a new solution, you don't have to port it anywhere, you can just apply those new analytics, and I think this really gives security leaders and secure operators flexibility of how they do security operations and correlation searches into their data lakes, and so I really feel like that flexibility, that transparency and the data ownership, and really being able to decide how long you keep that data is a really important decision-making criteria for Data lakes and how theirs is gonna change the industry of security operations.
0:02:30.5 Max Havey: To an extent, it's sort of a place that serving as a repository for all of this data that organizations that they've created over all these years, and that they can now use that data for whatever purposes they may need, whether that's with an AI model or with analytics or what have you, but it's essentially something that helps them keep it all contained in a way that they can keep it secure as well.
0:02:51.7 Troy Wilkinson: Yeah absolutely, and I wanna touch on cost as well, so if you think about the cost of data has tremendously come down, so storing data in the cloud is less than pennies per gigabyte now, so you're able to store more data, so in the past, you really had to be cognizant of what data I'm bringing in to my SIM and what data I can do correlations on, so there were limitations, and I may decide as a security leader, I can't bring in that very voluminous data source because it's too expensive to do that, but I really wanted to.
0:03:17.7 Troy Wilkinson: And so now with the data lake structure, you're able to bring that in at a much lower cost and use it to do correlation searches, you've never been able to do before. So as an example, DNS logs are usually very noisy and very unusual, so a lot of secure leaders don't bring them in, however, they're very valuable in times of incidents or if you wanna go back and see if a user went to a particular site and really get down into the weeds. So having that data in a data lake where it's very cheap storage, you're able to have that for long-term and very in-depth investigations, especially in a forensic investigation after an incident.
0:03:51.5 Max Havey: Totally the advent of having that cheap storage and being able to even just have all of this data ultimately creates new opportunities for how you can best use it and having more storage is leading to more innovation with this data, and more exciting things that folks in security and elsewhere can do with this data.
0:04:08.7 Troy Wilkinson: Absolutely, and another thing to mention is that being able to store this data over time allows the security leader to apply different types of analytics to it. As an example, today we have multiple types of AI-generated searches and AI-generated correlation events, and being able to stitch together telemetry from all of your data sources at scale and at speed or we've never been able to do that before. Now, that was the promise of SIM in the past, to bring all of your data into a single place, let's do all this fancy interpretation of it, however, I think that we just never got there from a street operators perspective at scale because of the expense, Because of the knowledge that it took to run that and because of the upkeep of it, we were on-prem for a long time, so the data center full of servers that you had to maintain, and then we moved into a cloud era where now you're SIM is in the cloud, and it's very expensive with the compute power needed to do those highly complex analytic, being able to de-couple your data and most importantly, having that data in a common schema or the open Cyber Security scheme of framework, so that every log source is in the same schema, so that a host name is a host name and a computer is a computer, and an IP address is an IP address, you don't have to translate that, you don't have to look across multiple indexes or data sources and translate it.
0:05:24.2 Troy Wilkinson: In other words, it's all in the same language, you can ask questions of your data at scale and across multiple different places, and so that really helps find that needle in the stack of needles, as we like to say, to find threat actors doing bad things, moving laterally exporting your infrastructure, your servers, your cloud, really tying it together where you may have missed those insights before.
0:05:45.8 Max Havey: Absolutely. And that sort of brings me to my next thought here, what are some of the challenges that you ran into as a CISO when it comes to using Data lakes and protecting Data lakes?
0:05:54.6 Troy Wilkinson: Well, I think that the challenges tend to be the same as they are for any type of data source, you have to have data protections in place, you have to have data ownership and lineage, you have to make sure that you're deprecating data in the right time frame as your regulatory requirements that you have. So you still have the same data protection concerns that you would with any other data source.
0:06:14.5 Max Havey: Absolutely, and then in that same sort of vein, why have Data lakes become an increasingly important threat surface to protect from malicious actors and other folks who are either trying to get in there or to poison that data, why is that becoming an important threat surface to keep in mind for security practitioners?
0:06:31.9 Troy Wilkinson: Yeah, good question. I think that from a data perspective, threat actors are always looking for data to exfiltrate, I think we've seen that as a rising theme across the threat actors in the past few years, the recent snowflake incidents that we've seen across multiple large organizations show us that threat actors are looking for large data sources to exfiltrate, and so data protections are extremely important, certainly data protections and exfiltration is top of the threat actors playbook, and so we are always looking to protect that. I think threat actors are really intent on getting to company's data and they find it very valuable. We used to see ransomware attacks where it was just encrypt the servers and hold the companies for ransom, now they're actually exfiltrating that data. And so they're secondary and even tertiary data, ransomware, where you say, If you don't pay us, we're gonna release your data to the public, so data has become a monetized commodity for the threat actors can continue to be a target.
0:07:24.6 Max Havey: Absolutely, and you've seen that with corporations or organizations that have had like giga leaks, I remember Nintendo specifically, there have been some kind of large scale like entertainment corporations and other folks throughout industries over the years where they've had those sorts of huge leaks of data, and I think that's an interesting point that there are these troves of data now that maybe weren't there 15, 20 years ago, just because we are able to keep a hold of it now.
0:07:49.2 Troy Wilkinson: As we look at data sets we look at the Sony hack and exfiltrating movie information. You look at the other banking industries where they're trying to exfiltrate information on customers, I think that every data set is as unique and needs protecting, but if you think about the security Data lakes that we're talking about here in the security telemetry for security operations threat actors could gain a very big insight into what a customer does to protect themselves, which would give them a path to take advantage of them even more, in other words, they could find ways to get into their back-ups, into their databases, into their servers, and so this security telemetry is very valuable to threat actors as well, so we even need to put more guard rails around our Data lakes.
0:08:29.4 Max Havey: Absolutely, and then, I know we talk about the idea of using Data lakes as something to help train AI models and things of that sort. I know the idea of poisoning data is something that is a real risk when it comes to training, generative AI and other AI models, how is that an issue, and what are some ways that folks can think about protecting against that when it comes to Data lakes?
0:08:47.5 Troy Wilkinson: So as we look at large language models and other types of foundation models for artificial intelligence that we're feeding ourselves, so this is a model that you're building and maintaining on-premise or in your own cloud. I think it's really important to understand that this data poisoning option is there for threat actors to take advantage of. You need to have input validation, you need to make sure that nobody's able to basically poison the inputs and also to exfiltrate, even if you have a rag architecture or a reference architecture of sharing an AI model, you can still have some of the data poisoning at the input level, and you can also have data exfiltration where there is an exchange between the input from the user and the exchange with the underlying foundation model, so I think it's so important to protect all the components of that.
0:09:32.9 Troy Wilkinson: And It's a different genre of security at this point, where we're seeing AI protection, protecting the Foundation Model, protecting and detecting data poisoning also bias, and that bias can be inherent bias or it can be unknown bias, where you don't even realize that your model is turning into some big algorithm that is taking you down the wrong path, so as it relates to security, Data lakes and SIM soar and security operations, I think we're a little bit of a long way from there, I think that we're pretty safe on that because we're not implementing or instituting AI models on top of our security Data lakes at scale, yet there are vendors that are doing that behind the scenes, so they would have a big challenge to protect those underlying models, but for us, I think as practitioners across the industry, being able to get all of our data into a central Data lake and imply advanced analytics to it is still what I would consider machine learning and some of the older school type of security correlation searches. Now, the best part about a Data lake, again, is having your data in a common schema and in a centralized place like that, you're able to change out analytics, so if the next AI solution comes along, let's say in the next 12 months, where.
0:10:44.9 Troy Wilkinson: Security operators say I wanna apply that new AI to my Data lake, it's very easy to flip that switch and do it without having to move that data, so we have the flexibility there, but I don't think that we are to the point yet to protect that Foundation model on our data lake.
0:10:58.8 Max Havey: Absolutely, and it gets back to what you were saying about how the idea is that all of the data is sort of speaking the same language, that everything there does need to be decoded in a way that is going to confuse your security operators and such. And I think it's especially interesting considering how quickly security and AI and all technology innovation is moving at this point, we're seeing new solutions popping up every other week at this point, it feels like, and being able to adjust that data and apply it accordingly, if you do see a solution that is coming your way, I think that's really exciting and really interesting, and it speaks volumes for what you can do with innovation down the road here.
0:11:33.2 Troy Wilkinson: Absolutely, I think one of the most unique advantages now that I see in the near term for AI is being able to translate complex queries for a security operator to write in natural language. I think that security operations team have gotten really adept at writing complex scripts and queries to query their data, but training that next generation of security operators is gonna be much easier if they can just ask questions of their data, show me where this is, or show me where I have this vulnerability being able to ask just normal questions and then have the AI translate that into a complex query that can search the Data lake very quickly is gonna help us have better outcomes and faster. I also think that the data lake is gonna empower us to keep this data for longer periods of time, so that if you have a breach, if a company has a breach, you can look back and stitch together to telemetry that you may not have had the option to before.
0:12:25.6 Troy Wilkinson: As an example, the IBM Ponemon Institute last year, so the average length of a breach before detection is about 180 days, so that's six months before a company realizes threat actors are in their environment, and so if you're not keeping six months of that full telemetry from your firewalls and your endpoint detection and response and your antivirus, you're gonna be missing some of those critical data components to put the story back together of how that threat actor got in and what they did in the beginning of this access, data lake allows you to store that data at very low cost over longer periods of time, and so you're able to then go back and use that in your investigation to find out exactly what happened from the moment of entry all the way through today.
0:13:06.3 Max Havey: And in that same sort of thought there, have there been any major security incidents that you've seen reported that have come as a result of improperly secured Data lakes, and if so, are there any major lessons that can be learned from those sorts of incidents?
0:13:19.6 Troy Wilkinson: Yeah, I think the recent Snowflake issue is a good example. So this is a massive database Data lake that customers use for a variety of reasons, we use Ticketmaster, which is one of the most well-known incidents relating to snowflake this year, I believe that it's a good example of how to use proper cyber hygiene, having all of your accounts behind multi-factor authentication, having the right application firewalls in place to make sure that those service accounts are protected, so I think just those best practices of accessing the data or Data lake is so important in this, being able to craft that and have that right cyber hygiene as key to success.
0:13:57.7 Max Havey: Absolutely, you don't wanna have a situation where you have passwords in plain text, or things laying around that shouldn't be laying around when you're dealing with data of this volume and of this sensitivity.
0:14:08.3 Troy Wilkinson: Absolutely.
0:14:09.1 Max Havey: Bring us around here. What are some strategies or advice that you'd recommend to CISO's and other security practitioners when it comes to protecting Data lakes, beyond just broad cyber hygiene, are there any other pieces of advice or strategies you'd wanna recommend to folks?
0:14:24.8 Troy Wilkinson: Yeah, I think from a protection perspective of Data lakes, I think that you have to decide what's right for your particular company, you can run this on-premise, you can run this in the cloud, and all of the same protections that you would nor