Coffee Sessions #45

Enterprise Security and Governance MLOps

MLOps in the enterprise is difficult due to security and compliance. In this MLOps Coffee Session, the CEO of Algorithmia, Diego talks to us about how we can better approach MLOps within the enterprise. This is an introduction to essential principles of security in MLOps and why it is crucial to be aware of security best practices as an ML professional.


Demetrios: Welcome everyone to another MLOps Community coffee session. Today I'm joined by none other than Diego Oppenheimer and my man Vishnu. I want to start out the session by saying a big thank you. Diego and the Algorithmia accrue for sponsoring this session. It is an absolute honor to have them a) in the community and b) throwing their weight behind the community and really supporting us in what we're doing and showing that they appreciate what is happening. Diego has been a huge supporter from the get-go. He sent me over one of these awesome bottles that I managed to break. And then I got a new one. So big, thanks to Diego and Algorithmia. Today. We're going to be talking a lot about security and governance and how that relates to ML ops. There's a blog post if you want to know more from Algorithmia and we'll link to that in the description. So without further ado, let's get into it. Diego, it's been a while since we chatted how are you doing, man? Diego: I'm doing great. Thanks for having me always super excited to be chatting with the envelopes community. It's definitely my favorite community that I've ever worked with. And so super relevant at the space and really exciting to see. I mean, the growth, I think what, like when first invited me to meet you, this was like, what, like a hundred people. Yeah. Like what are we at? Like 3000, 4,000, 5,000? Yeah. Vishnu: I was looking at the numbers today. I think when you were in when you joined the meetup last year, it was like May or June 2020. I think it was at 300, 500 people today. It's at 5,287. This space is exploding. Diego: Yeah. Great work super excited. I love that, you know, the fact that you know, we're getting into that like level of maturity in machine learning where people are actually like, kind of thinking about it. Operating at scale and like how it's going to be applied. And like, so the fact that there's, that level of interest is, I mean, who could ask for anything better? Demetrios: Yeah. It's super exciting. And it is very funny because we talked pretty much one year ago almost to the day. And you gave an incredible explanation on the building verse buy paradigm, which I still refer back to many times when that comes up because it was so in-depth. And it was, it just covered all of the bases on what you need to know, especially when you're looking at building verse buying when it comes to an ML ops tool or just your ML ops infrastructure. And so we will also actually link to that in the description below because I think that's a great one, even though it is a year old, I don't think a lot has changed the bait, the greater ideas there haven't really changed much. So that is very cool. Now today, though, we're going to talk a lot about security and I love the way that you look at it in the community, in the community, slack, you're very active, whenever security topics come up. And so that's why I wanted to talk about this with you because I think you're looking at it in a very unique way, and you're also very, very focused on it. Maybe you can share with us why you're so focused on it. Is it just because you have all these enterprise customers that are asking for it or were you burned in the past? Do you have any war stories for us? What is it? Your, your fascination, or your fixation on security? Diego: Yeah. So maybe I'll start from like, kind of like, you know, where, you know there's kind of separation and like our head and how we think about things between like when I would consider like machine learning dev or development in training, you know, people call it training, but just let's call it MLDev. Right? Cause there's, it's bigger than just training. It's like everything from data acquisition to building out and production operational systems. And they're obviously very, very tied together. You can't do you link them because of, you know, you have to retrain on new yeah. Data and it has to be all automated, everything that, all the other goodies that we talk about from like the end-to-end processes. But when you're really kind of focused, which is what algorithm does on the operational side, There's a, there's a key component in that, right. Which is okay. An operational system is one that is running all the time. It's built into the operations of an organization and it has a level of scrutiny from a governance and security perspective. That is way higher than any dev environment. And this is not new to machine learning, right? Like if you think about software development. "The level of scrutiny for apps and development and that of the operational software is much higher." You know, what DevOps does for the kind of like the things that run the business is much higher than, you know, you, you can, you can allow for more stuff when you're doing on the development side. And so high level, you know, we as an organization and kind of what we do, we focus on the operational side of machine learning and because the operational side of machine learning is so tied to the operations of an organization, right. You have to live within the security governance standards of that organization. And this goes even particular like higher, if you're talking about regulated industries, right? So things like financial services and life sciences and defense, you know, these are, you know, these operational systems just have a level of scrutiny that are way beyond you know, kind of like what you see there generally in development is that kinda why we, you know, that's just kind of like the world we live in. And you know, my, the general perception and, you know, you know, the way I look at it is, you know, five years from now, we won't be talking about machine learning anymore, we'll just be talking about software. It'll be implied that every single piece of software that we're writing will have some level of predictive analytics or machine learning embedded into it. And so what we're really talking about is operational systems software here and what is going to be required for that to be able to happen and organizations to kind of bet the house on, on ML in this, you know, for fraud detection, for sales, for acquisition, for kind of all these use cases that exist there. Okay. And when we kind of double click into that, now you start getting into like, you know, kind of what traditionally is it governance with the flavor of machine learning, so was like, Hey, how is this, you know, strategic to the organization? How are we doing cost controls? How are we securing it to making sure that, you know, we don't expose ourselves to operational risk that exists in the organization? So that's a little bit of a long-winded saying of why our focus is so much on kind of like the security governance has really operational software has a level of scrutiny and a big part of why. And I know you kind of make fun of this. I'll try, like when you double-click onto like how many models don't get into production or like why it takes a long to get into production. It's not putting a model on a flask app and saying that's hard, right. Or putting an API behind the model. It's actually getting through the security and governance requirements of an organization before you can certify a systems operational. That's the long tail of a lot of these, that kind of like machine learning projects. And that's kind of like, I know you like picking on that statistic, but as the reality here is, is just, you know, you're in a bank getting through security might take six months, right. From the operational systems, stuff like that. So that's kind of why, again, very much, you know, we, "We take the Ops part of MLOps very, very seriously and it's really about the operational side of the equation." Vishnu: Yeah, that that makes a lot of sense. Before I, before I jump into my question here, I just want to say thanks again for your support. Whenever Diego is posting on a slack thread, I'm always checking it out. I advise everyone listening to, if you're on the slack, always, always check out those threads that Diego's commenting on. And also it's funny to look back now, but your meetup about building verse buy was my first meetup in the community. It's crazy to think. That's what I realized. We were talking before the types of the story about how we met. And I was like, I met you, but you never met me cause it wasn't a meetup. Diego: Got it. Got it. Vishnu: So it makes sense what you're saying about really focusing in on this operational component of machine learning and how. As that becomes more of a reality. Some new risks are being presented. Some new challenges for organizations are becoming a reality in a way that they haven't had to grapple with before. And, you know, I think one of the challenges, you know, as a machine learning engineer myself, is you start off coming from this world of, of, of ML and training algorithms. You'd be you're first exposed to software engineering and all the best practices there. And then you get exposed to this world of it and infrastructure management and all these sorts of components they're networking, that seem a little bit foreign to you. And so now there's this newer component of security, operations, and security becoming more and more just not just the role of IT, but really the role of the entire organization. It's the responsibility of an entire organization, to help ensure you know, that, that things are secure. You know, what would you tell an MLE who is kind of learning about these components or sort of about, I guess what we're trying to call MLSecOps you know, how would you tell them to go about getting started in this realm and understand their responsibility? Diego: Yeah. So, so one of the things that we do, so a lot of this is kind of like organizational, like in a large like organization, if you're in a large enterprise like you probably have. DevSecOps team that you know, is existing and established. I would bet the house that had a great exists there. When you're starting to think about what you're going to be putting into production, right? And you kind of start that planning process. And this is something I talk about in the build versus buy. Like you have to have the end game in mind, like. Fig like, well, you can try, but like figuring it out step-by-step is kind of a recipe for like this thing taking forever, right? Like there's always a new door, you know, to kind of open up. And so to learn around is one of the things that we suggest teams is, bring in your DevSecOps team as early as possible into the conversation and explain to them kind of like what it is that you're trying to do, where this system is going to be because what you want is that. It's, it's kind of unfair on the ML engineer to be expected, to literally know everything. Like, I mean, like, and that's kind of like, you know, we, we, we go out in the surgeon and this is not because of a lack of smarts or a lack of IQ or lack of ability to learn. It's just like, there's a lot, right? Like suddenly you have to know everything about all the libraries you need to know about data science. You need about training. You need about productionizing you about, and then on top of that, you need to know about like every single software process inside the organization. I mean, It's really hard. Right? And like there's only 24 hours in a day. And so, you know, "One of the things that we suggest is these are kind of perfect partnerships to make up early and bring in the IT and Ops people into the conversation of how you move things into production as early as possible, because it has a dual purpose." One, the DevSecOps teams are not super aware of kind of the intricacies of it. Right. Like they understand software, they understand it, they understand, but you know, but they don't understand it in a lot of cases. Sometimes they do about, you know, kind of deterministic code versus probabilistic code. So like "the code doesn't change, but things change cause the data changed." And so, you know, "there's this dual education process that happens in an organization as you drive through getting more of these you know, workloads and it's, you have to think about it as an opportunity to educate both sides." You know, you want to get your dev sec ops team and your security teams, like more up to speed on the, you know, kind of what it is, the world of machine learning and how it's, you know, same, but different in some cases, from a software development perspective. And you also want to learn about like where the gotchas are going to be, right? You know, cause like, you know, in a lot of cases, you know, I'll give you like kind of a clear example of this and this is kind of like a really basic one, but like, so let's just say you got. You know, kind of container that you built out with a model and you plan on like deploying it as a flask app. Okay. What happens now? When you know, container baseline images need to be updated, containers need to be scanned for vulnerabilities on a weekly basis. What happens when the system needs to be taken down? You know, so a lot of these banks kill all their systems on a weekly basis repave. So imagine taking down the entire system with zero down on a weekly basis to repave all the images from a security perspective, all the patches need to becoming. And so you're now dealing with a world where you're like, Hey, I got the system right up, I'm running and I have my APIs running. Great. And now I need to repave the whole thing from a vulnerabilities perspective, just in case on a weekly basis. Like that's not something that you've probably gotten used to, or even know how to do it. Right. Like, I mean, this is taking down entire Kubernetes clusters and like flipping them right. Or like re you know, and so these are just things that like, you know, people figure out and I'll go pick on Demetrius again on the, on the, on the, like how long it takes to do these things. These are the things that take so long, right. Because suddenly you got to the end of the line and you're like, okay, we're ready for production. That only took like two months. Okay, great. Now take that the entire system. Oh, I need to go build that automation. It's going to take me another six months, you know, that kind of thing, bringing in the team early to understand it in different organizations. Like if you're in a startup and you know, you don't have these kind of like super high like requirements that's, but you know, any company that's increasing their maturity is going to have a CSO and that she show's going to be looking at. Operational system and determining risk saying, okay, Hey, this is what it's exposed to. This is where, you know, where we could potentially have attack vectors. And these are our policies to avoid that. And you're not going to get away just because you have the cool new machine learning from not being part of that conversation. Because at the end of the day, the risk component here is so big that people are not. So you have to. Either adapt to the process or work with that team to come up with a new process. If that process is not adaptable to the world of machine learning. So three-year original question about learning. I mean, bringing people in the conversation early and in the planning, like we always recommend, you know, second conversation that you're having, right. Building out an operational system, bring in your DevOps SecOps team. Vishnu Yeah, that, that makes a ton of sense. I think that level of collaboration is key to any kind of organization. That's seriously doing ML. I think, you know, honestly, I'll just say this as a, as a, as a, as a professional. And one of the big realizations you have to make is realizing that you can't do everything that has to happen and the solution stuff has to happen. Sure. And you recognize it, but you got to bring people and you have to figure out how to create leverage for your team and your project. And, and so I totally see how that, how that makes sense. From a security standpoint, in particular, I want to seize on one thing that you said there, which is about maturity. And the maturity of the security processes. You know, I think one of the things that Dimitrios and David has spent a lot of time breaking down is the sort of Google cloud maturity model for ML ops systems, right? There's a seven-step sort of rubric that they have. And, you know, you can score yourself on that maturity if you're. Continuous retraining or if you're able to deploy automatically all those different things. Do you think that we have such a clear sort of vision for what the maturity of an ML system security is? And, and if so, what does that look like? Diego: The baseline is there, right? And so, you know, like when you go look at, you know, and we'll be publishing a new survey soon and like you pretty much, everybody right now is declining. Some level of machine learning with Kubernetes and Kubernetes itself has a security, maturity thing that's going on right now. Right? I mean, like it's, you know, like, I mean, it's not like we've been using Kubernetes for the last decade. Right. And that's at the kind of container level at the networking level at the kind of like in show. There's a, I think, answer your question specifically. I do think there's a maturity kind of like rubric applied to ML security, and a lot of it is around system. That you're using under the hood. Right? So there's obviously things that are very specific to machine learning. And we'll probably talk around those things. They're mostly around specifically around like, kind of like, how can you affect the model or how can you affect the data, right? Like those are the ones that are very specific to machine learning, but like everything under the hood, in terms of compute and networking and access controls and containers, all of those are they, they mimic pretty well. Like, you know, kind of like the security models of any of the other kind of like components that exist in that, you know, kind of like it world. And so as long as you can kind of recognize those, that kind of split I think there's a lot that can be inherited from you know, you know, their models. You know, kind of and this is nothing new, right? I mean, like these like security models are built out and pretty mature in a lot of organizations, a lot of is around processes, right? Like, Hey, you know, kind of talk about it. Like, Hey, you, you know, we use these container registries. We use these dependencies, you know, I, I used to joke around that. Like I've seen super secure companies, right? Like they, they have like big, big focus on secure security. They're data science containers. We're allowed to bring in like anything they wanted to imply by. And I'm like, well, that's an attack factor, right? Like, I mean, like, right, right. Everything's locked down, but you can bring in anything from pipeline and like, you know, how are you, how are you representing that? Like, how are you actually making sure that you're not causing issues? And so like, these are kind of like the details, like, you know, independency managements, And kind of like mirroring dependency package managers is not a new thing either. Right. And so that's the maturity that like, you know, we look at where ML is running on top of and kind of the components that we use. And we can inherit a lot of the security model from those while understanding that there's certain things around data and the models that are, you know, are different and you adopt those, which is kind of like the ethos of analogs, right. "To a certain degree, we have general parameters of software DevOps In software engineering and DevOps, and we're adapting it to this new world of ML." Vishnu: Yeah. Okay. So I'm sorry Demetrios, but I got to go, I got to ask another one and something that you mentioned there, like this, this concept of dependence, dependency management, not being a new thing, you know, PI being an attack vector. It makes total sense. And I think this speaks a little bit to the, to the development workflow itself. And this is something that we talk about ad nauseum in, in, in, in the community is like the difference. Flavors of machine learning professionals, their respective roles and how their workflows in place impact and how they work impacting the ultimate delivery of ML solutions. Right? So we talk about, should data scientists know Dockers should day science is snow Kubernetes in your experience, how are organizations that are doing security right? Dealing with the fact that their data scientists may want a lot of flexibility. Their machine learning engineers may be thinking about how do I put something quickly into production? And then they're it professionals saying I need something that's secure and well specified. Like how do organizations deal with that sort of professional complexity? Diego: Well, my answer is going to be super biased because we literally built a software platform for this, so, right. But you know so the way that, like, I like to think about it as like, you know, it's IT, and the kind of like Ops team are under the CIO, whose job is to provide an operational system, essentially as a service so that data scientists can move into production and provide them that flexibility. So if you split that world of an MLDev, And MLProd, right? And you think, Hey prod is, this system, that's supposed to be built with the proper guardrails while giving that flexibility, right. That's kind of like, you know what we see and kind of more mature organizations where you can actually grab models, you can build them, you can have the dependencies, but you can actually pull those dependencies from the right mirror. You can go check it. You know, security is taken care of for you. Authentication is taken care of for you. The operational system and the scans are already set up. And so, you know, "In the ideal world, you're just sitting in your data science platform, your auto ML platform, whatever it is that you're working with, you can push a model." Right? The CI/CD or get porsched or whatever it is that the methodology from automation there. And that triggers kind of like the process of going through kind of like the proper security scans, the proper you know you know, kind of ability kind of packaging the whole thing in a way that's saying, look, you have all this flexibility. It's like maybe one way of thinking about it is like, it's like a bowling alley with the guard rails up. Right. So you can throw it as hard as you want, and you're not going to be allowed to kind of like gutter it, but you can still kind of like, you know, have the flexibility as like you're throwing it in. So that's kind of like maybe a little dumb analogy, but like that's kind of the way I think about it. Like let's set up the gutters so that, you know, data science can actually be flexible. One thing to kind of, if you go historically and look at, you know, maybe before Trisha machine learning, like the quants in financial services that would actually deploy kind of like, you know, regression systems into production, how did that process work? It was usually written in art or lab. It was given to a software engineer and they would rewrite it and see. For performance reasons and a lot of cases, but again, there was a rewrite of the entire software and it was built into kind of like the software process that would have security and authenticate. Yeah. Kind of all that, that rewrite can't happen anymore because we need the speed of, you know, kind of retraining and deploying. We want to give flexibility to the data scientist. So we kind of broke up that, you know, kind of like slow nature of like going from somebody in charge of hardening everything to now, Hey, can we give you a pre-hardened event? So that you can actually flexibly go in and deploy. So that's, I think that's the, just the general trend out and being able to provide that. Vishnu: Yeah, that that makes a lot of sense. And I think what I'm hearing is that you, you have to bake these kinds of best practices, these, these, these sort of accepted ways of doing things, the secure way of doing things or the efficient way of doing things into the platforms that you're using. And that's really the responsibility of, you know, in an organization that might have a CIO or a CIO or a CTO or chief data science officer, whoever it is, that's responsible. They need to bake it in at an underlying level. Is that fair to say? Diego: Yeah. Yeah. I think automation, automation, automation, automation, and automation include security, authentication governance. Like that's like, these are the things that like, if you're building it up one-off per environment or one system, like, first of all, it's completely uncontrollable. Your technical deck gets out of control. You're spending, you know, you asked about like, what does an ML engineer need to learn about dev security? Well, "what you don't want to learn is how to do this every single time there's a new use case. Right? Like, that's just not a good use of your time." Demetrios: So Vishnu took all of the questions that I had, but luckily I thought of some new ones on the fly. And when you were talking there, there were some awesome things that you're saying around how you think like your vision for the future is everything is going to have machine learning involved in it. There's not going to be the separation that we see, right. Right from dev ops to MLOps. And it's just going to all be some kind of implied machine learning because it will touch everything. After a few years, maybe that's in five years, maybe that's in 10 years, maybe that's next month. Who knows? But I'm wondering along those lines, there are some pretty significant hurdles that we need to clear before that happens. And you were talking about this to make it so that we feel like when we operationalize something, it is Bulletproof. Can you talk about some of those hurdles that need to be cleared? Diego: Yeah. So "I would argue that there's no such thing as Bulletproof in software, right. That doesn't exist. It never has and never will." Right? But what you can do is you can reduce risk by, right, like, "there's no such thing as an impenetrable software system. Period." Right. But you can, you can, you can cover your bases. Right. And you can, and you can do, and so a lot of it is around doing exactly that. And so if you look at it from a risk framework is actually really interesting, right? Because essentially there's kind of three risks that an organization can take. Right. There's a kind of operational risk. Something goes wrong and lose a lot of money. Yeah. Right? There's brand risk. Something goes wrong. I look really, really bad and I can reduce those risks by tightening everything. And I can go to the extreme, there's a pendulum, right? Where I like locked down. Everything, nothing gets into production. It takes 24 months to get anything into production because I've completely locked it down. Now I've exposed myself to what's called strategic risk, the risk of not doing it. And this is particularly important at ML, right? Where it's like, Hey, what am I losing by not? putting this model into production, what am I losing by not doing that? And so there's a pendulum between the, this kind of strategic risk and the operational and brand risk that needs to be kind of like, you know, navigated. And you can do a lot of that with just making sure that the systems that you're building and the kind of operational systems can kind of like take into account a lot of the things that avoid the operational and that brand risk, which is, you know, I got hacked. I got, you know, a lot, which again, a lot of this is, is somewhat known, but you know, I'll double click on the fact that like, there's no such thing as an impenetrable system. Like that's it's software, right. Unless you're like, even, you know, I mean, I guess if you're in a completely air-gapped environment, but you know, we've all seen the Tom Cruise movie, so, you know yeah. Demetrios: So maybe we should go and zoom in a little bit on what. Some specific machine learning security risks are. And, and also I think Vishnu just wrote me on slack right now saying that those three types of risks, that if you haven't written a blog post on it, we may have to write a blog post on it because that is some wisdom right there. The idea of all of these different risks that you're facing and how you can't be extreme with one or the other because the more you go to this side, then the more risk you're going to have on the strategy side or vice versa, that kind of thing. So that's awesome. Vishnu: Yeah. It's a trade-off. It's a trade-off. Not, not necessarily a choice, right? You don't just pick and say, oh yeah, I'm going to take on, I'm going to present brand risk. It's like, okay, if you're going to do that, you're, you're taking on some strategic risk. You've got to make a trade-off there. So I think that's, that's definitely a blog post. Diego: And it's definitely, you know, and then it has to be, this is why, like, I always talk about like, thinking about the business value of what you're getting in production and what you're doing here. Right? Because if your risk is super high and the ROI potential is low like that's the trade-off again. And that you need to make. Now, if your ROI is potentially super high in what you're doing, Hey, we're going to go reduce fraud by, I don't know, 90%, but like, we're going to take, you know, like, and that's a giant number. It's, you have to make these decisions on like, you know, risk-reward there's entire offices inside financial services that are doing this right on a daily basis. And so that's kind of like, how do you imply their systems is, is also important. Demetrios: Exactly. So let's jump into some of these specifics. Machine learning security risks. And what, as a machine learning engineer, we can talk to others about, especially like you were mentioning, like the dev sec ops teams that have to come in and they potentially are going to kill your whole project because it doesn't live up to standards or their standards. And, on the other side, there may be things that you know about as machine learning engineer that you want to tell them about. So it's not like they're flying blind. Diego: Yeah. Yeah. So, I mean, I think there's kind of like five being categories or at least today, you know, I'm probably like selling it short of some of the categories that we see, you can see. Yes. Risks around data, complete confidentiality. Like, did I somehow leak data in any of the processes in terms of PII or, you know, what am I using, especially if it's sensitive information? And so, and that's no different, in my opinion, from traditional analytics systems, right? Like you have that same risk and there are ways of locking it down and understanding it and, you know, consider, you know, a lot of companies have this concept of like platinum data and gold data and having just different standards around like who can access it and why they can access it. And, you know, like if you're a credit card processor and you have super high, you know, you probably shouldn't be, you know, you'd be careful with like exposing people's credit card behaviors and stuff like that. And so these are kind of around that data confidentiality you have kind of system manipulation kind of like a, you know, an ML security risk. And this is where you're exposing potentially a recommender to the outside. You're potentially exposing something like an endpoint, right. Where somebody else had your organization can interact with it. And especially in an online learning scenario, like could somebody manipulate it to take advantage of it? So like, could you somehow trick or recommend, or to giving you a bunch of discounts, could you use somehow trick a, you know, kind of like. You know, betting site for potentially that's using a kind of like your model to give you like some different, I don't know, like, you know, like there's like when you're exposing an ML system that takes in inputs and potentially can actually you know you know, respond to it and you can take advantage of it. Like that's the kind of like system manipulation. There are adversarial, examples why like US, which is similar to the one in the system. Start giving it bad examples in the expectation to kind of like veer off the model in one way or the other. Right. And by, you know, by looking at it and you know, there's, "There's a world where you can actually, like, it's, it's pretty hard, but there's a world where kind of reverse engineer, a model by essentially see like, you know, feeding it a whole bunch of data and understanding like where, you know, how that comes back." And say, okay, now I actually understand how this model is working and potentially manipulate it. Right? And so these are kind of like the attacks that you can, that probably a DevSecOps person is not thinking about, but an ML engineer would start being like, oh, How do I, you know do that in the transfer learning world. Like if you use the baseline model, you know, like everybody uses kind of these, and I don't actually know of any attacks in this space, I'm just kind of thinking out loud, like everybody's kind of using the same base NLP models, you know, kind of like, could you, you know, from a transfer learning perspective, like go reverse engineer, one of these, like an LP model that, you know, to make it say something bad. I mean, it wasn't really an attack, but you saw what happened with Tay and Microsoft? Hey, and a bunch of them. You know, language thought square, you know, people got them to be really racist. There you go to your brand risk, right? Like if we go back to that brand risk, like, you know, like people are manipulating these models to and they had to shut it off because, you know, the last thing Microsoft wanted was a racist bot. You know, "When people figured out how to like, get it to do that. So these are kind of some of that, those they can poison the data you can. So these are kind of like, like, like some of the categories. But they're really like, you know, they come down to, can I leak data and stuff some bad way that I shouldn't be, or can I manipulate the system to do something that it shouldn't be doing." And this is where kind of like proper monitoring and proper kind of like understanding of what can potentially go wrong with the model. Yeah. Super important. Because to your point, Vishnu, like the DevSecOps people probably unlikely to understand. Hm. You know, this, this attack is like, Hey, the containers are secure. The networks secure are the authentication secure. Like I got everything like, okay, well here's another level of security. That's important. And then, you know, again, like going back to that exposure risk, like an internal system machine, the machine inside your exposure areas, like, okay, somebody's a bad actor inside your organization or something that went wrong by mistake. If you have an externally exposed, you know, system now you're a random person from the internet, instead of like, you have to kind of measure where that risk. Vishnu: Yeah, I think it's really helpful to hear how again and how to think about risk, what the different forms of risk are. And one of the things that come to, you know, comes to mind for me, just from like a very basic standpoint is, you know, I think, you know, to some degree, our impression of security in, in, in the entire culture, internet, culture, and everything is still kind of stuck in the Nigerian prince era, right. It's like some bad, bad guy out there to get you. And I think for a lot of employees, You know, especially because it's not like security is taught, you know, as one of the first five things you learn in school, or even when you walk into a work environment a lot of times, right? It's just one of those like corporate training things you gotta do. I think it's hard to really embrace the security mindset beyond just kind of saying, oh, you know, there are a couple of bad guys. The downside of that is especially like I work at a smaller company. It's easy as a smaller company or at, you know, maybe a non-consumer-facing company to kind of be like, well, we're not really a target anyway. And you know, not necessarily think that this is something that you have to embrace earlier on. I know that you've mentioned a lot that financial services are an industry that you've mentioned as an example. And I could see there, you know, for example, that mindset being. Sort of aggressive because money's at stake. What would you say to, you know, professionals or employees or even employers who may kind of have a little bit of a lax attitude, sort of security right now? How would you kind of encourage them to change their mindset? Diego: I don't know if I would change their mindset. I would actually just make the, like the offset decision, right. Make that risk-reward decision. Right. It takes time and money and, to, to figure out security, right? Like, I mean, like, like it's, it's, it's a fact, right? If you're looking at it and you're making this like a conscious, like "The problem is when you don't make a conscious decision, right. That's when you get burnt really bad. The lack of awareness is a problem." Right. But like, you can actually, you know, I wouldn't go and say, look, you know, you know, everybody needs to be thinking about security on day one. Like, I mean, that would be ideal. But, you know, I think there's a risk-reward ratio here in terms of like, look, you look at your system and you look at words exposed. If I have an internal recommender, that's only exposed between my machines. It's very kind of like, what's the, what's the potential outcome that could be problematic. Right? I mean, like, I'll give you a good example of this. Like, is there a, a really big high risk? And I know the Spotify folks are always, and so I'm, I'm talking out of turn here, but yeah, but like, I can't really imagine a world like you know, getting, you know, poisoning the Spotify recommender, like, you know, somehow like becomes a problem. Maybe I can think downstairs, like, okay, suddenly, like there's a random artist that gets like a bazillion hits, and now they have to pay out royalties to that. I mean, Demetrios: Hopefully, it's me, the rental artist is myself. Hopefully, take that one out. Diego: There are systems where it's like, you know, the, the risk isn't just. Like the risk rewards or the cost of getting things done is just not there. So to you Vishnu in your, in your company, what I would be looking at is saying like, okay, well, where do we, what do we do with machine learning and what, you know, as long as you have aware of what could potentially go wrong and where could this be? Like, you know, how would this be problematic for the organization? And are we an increased target of it? Right? Because eventually, everybody becomes, you know, "The more valuable a system, the more it becomes a target and the larger the surface area of that." And so it's really, you got and figuring out how, you know, for us, you know, I mean, just like you can say, Hey, look, we use email internally only. There's nothing exposed to the outside world. And the worst-case scenario, you know, modeled on wild. Like we capture that and it's not really a big problem. And so you don't have to, like, why would you go spend a ton of time on security today that might change over time. "I think it's a bigger problem when you're just not aware of like what the potential risk and problem it comes." And like if you go into it completely blind and then suddenly get the bit that's where, you know, so I think you can make a, you can make a conscious decision. Demetrios: Well, that's. So my question is, as the machine learning engineer in this situation, how do you properly think through the situation that you're getting yourself into? Like, how do you know all of these data points? Is it just by educating yourself on ways that? You can mess up or talk to different DevSecOps people. Like how can I, as a machine learning engineer, be more conscientious when I'm trying to build these systems. Diego: So I think you always start with the end result, right? Like what are you actually doing? What's the business case for this machine learning workflow, right. What am I affecting? Right. And when you look at that use case, again, just forget about technology for a second. Like you're looking at the use case and being like, okay, Well, now that I'm looking at the use case, what are the risks of getting this wrong? Right. And then what am I looking at the risk is getting it wrong. How could I get it wrong? And you start kind of going down there and what you find in that kind of like workflow thinking, like, kind of like from the end result backwards is you'll start exposing a lot of places where like, where could this go wrong. And then, you know, I mean, if you want to get really precise about it, you can be like, okay, let me try to make a, you know, Machine Learning engineers are pretty good about this. Let me make a probabilistic assessment of what I think is potentially like, you know, where these things could go wrong and work backwards and then make a decision on it, you know, kind of a cost-benefit analysis. You know, I'm, I'm probably being more prescriptive here than necessarily, but like, this is something that you should be able to like understand, and that like, if I, you know, if what I'm building is as a fraudster, Right. Risk of getting wrong is big and like, okay, who's exposed to their fraud system and how do they get involved and who could actually access it. And, you know, these are kind of the things where like, "Look at the end result of the workflow and understand the value of that workflow, which you should know at that point, right? Because if you're going into an ML workflow without understanding what the end value is going to be, it's not a good sign." And now you can understand, okay. If I start working backwards, where are the risks? And I can, that's how I would, you know, educate myself in terms of like what could potentially go there and you'll see that as you work through that, you'll find that there's a lot of opinions. A and there's a lot of people who are going to be helping like, involved in like, kind of like figuring. Vishnu: Yeah, this sounds a lot like those the five why's that Toyota does, you know, where it's like a root cause analysis, it's similar sort of thing that you can apply in the sort of, you know, ML security realm. And it kind of gets me going and thinking, this is a really cool, like workshop or blog post or something for us to do is kind of just the same way that, that there's like a, there's an article that we read in the reading. Continuous delivery for machine learning, where, you know, they talked through a Martin Fowler posts where they talked through, you know, how you do CD for. And I'm also going to be really cool to do the same thing for a sample sort of ML system and saying like, well, what does security look like? What are those questions going through that I'm answering them? I think I think we may have our next point of collaboration, Diego. Diego: Yeah. Yeah, absolutely. Look, look, you know, the pretty much every single large company that builds. Has threat modeling associated with building a new component of the offer. Right. And I think that like, "There's a natural next step here where there's threat modeling for ML systems and it's a task that gets built and understood, and nobody's going to enjoy doing it." But like, you know, it's kind of like a necessary evil where, you know you know, get my, you know, my experience working at Microsoft, like there was not a single piece of software we could shift. That didn't have an associated threat model with it, right? What's the surface area, what's the potential attack surface area. How do we get through that? And this was a collaboration between, you know, the technical PMs like myself or the, and the InfoSec and DevSecOps teams where they would help you build-out. And, you know, they would ask you a bunch of questions and help you like build out a threat model around it. And like, I think that threat modeling scenario for machine learning is clear. Like if we're, if that's going to be the future of software, like, you know, we're going to have to be building out these threats. Yeah, I love that threat modeling for MLOps. Vishnu: That's it? That's the poster of our, that's our that's the title of our blog post. Demetrios: Yeah, and there's a lot of blog posts that we're going to be writing after this one. Vishnu: So much content so slightly different direction. I kind of want to take this in and that's really around, you know, kind of going back to that day of big baking in best practices and. Something that's, you know, I think inspiring a lot of people in the ML ops world is kind of how DevOps evolved with, you know, infrastructure as code. So something like, you know, AWS cloud formation and, and CDK and, you know serverless and, and a lot of other sorts of DevOps frameworks. What they did was take a manual process. The operations professionals and D development professionals we're collaborating on and turn it into a codable process, turn it into something that could take advantage of the beautiful properties of code. Anybody can write it, it can be version controlled and it can be, you know, scaled. So my question to you is we've seen this happen with you know, with infrastructure. We've seen it happen with other components of the entire software creation process. Do you see a future where we have almost like security as code? Is that already a reality? Is that possible? Diego: Yeah, I think, I mean, I mean, I mean, I think like even the component is an infrastructure as code security is baked into that. Like, or in a lot of the, like, I mean, if you go look at anybody, you know, if you think about how you're developing you know, something in AWS or in like GCP, like, like you start with sitting up all day permissioning and IAM roles. That's step one. It's annoying, but it's true. World of infrastructure as code is like, you know, security and off is like a big, big part of it. And so there's a question now of like, okay, can we build that same concept into our machine learning workflows? And I would say you asked, I mean, that's, again, a little bit biased because that's what we've done with our platform, which is, you know, kind of how you actually go deploy and run, like, you know, it's security is built into it, right? So for every model that you deploy in Algorithmia, authentications already stood up like audit controls of all the models who's calling it. What, when with what data is already set up for you you know, which package managers you can you can access and use and how those dependencies get managed are already. The use of the source code management system under you, which will now have all the code scans and the dependent bots and all these kinds of like is already set up. And so that's kind of "The concept of like automating the, you know, kind of like security layer and governance layer of your production pipeline. So that, you know, from an ML engineer, you know, the ideal world is I got a candidate model. I pushed it into. Right. And like, everything else was just done." Right? Like that fully automated. Vishnu: So, yeah, I think so it sounds like what you're saying is that there's certain areas of the threat model basically that you kind of have managed away or, you know, abstracted and Algorithmia. And that's, I think I really appreciate some of the examples that you mentioned around the package managers, that dependencies it's pretty powerful to be able to do that. Demetrios: So I wanted to jump in. And talk to you at the risk of talking about everything and us not having more information to talk about the next time we chat. But I think that's impossible because I originally told Diego, Hey, let's try and do this for like, I don't know, an hour and a half, two hours. He was like, whoa, whoa, whoa. I can't have a podcast be longer than a workout session. And apparently he's also busy running a company, so we can't take up too much of his time. But I do want to talk to you real fast about this idea that you told me. The last time we spoke that MLRE. Can you share some or shed some light on that? Diego: Yeah, so, okay. So like, like, let's think about, you know, what happens when an operational who, who gets called an operational system goes back in an organization today. Like there's an entire world of people which are SRE's, right? I mean, like they are, you know, pagers response time, like up like this whole concept of like, I am, you know, the first line of defense, something goes wrong. I'm getting woken up. I need to jump on it. Right. And especially with cloud and SAS, like this is like core, core, right? To how multiple, multiple companies operate. I'm sure. You know, you guys have them as well. You have these SRAs folks that are like, you know, inside your organization and, and responsible for them. So what happens now, when you need somebody who needs to be able to react to like that? Just the application to the traditional way that this would work would be like, Hey, something went wrong with my operational system. I, you know, wake up, I look at it. Hey, was it a networking problem above all that kind of like operational problems? And no, it was something with the model. I'm going to go wake up the data scientist and say, Hey, something's wrong with this model? We should go, you know, you should go do that. Right. And then the model, the data scientists will go in and kind of investigate. Okay. What happened with the model? Was it a data problem? Was it a teacher problem? Was it like, you know, what, what, what was, why the district. I think as more and more real-time systems get lifted up, right. We're going to have quick response teams that are more trained in machine learning, but are, you know, really responsible for the operational application, right? Like if your entire business is based on a recommender system to like sell more stuff, like you're probably gonna move some people to be kind of like, unless ML, reliability engineer which is you own, not just the ML model, but also the application that that's running through and you are that first line of defense. Right. And maybe all you do here is, Hey look, this new model, you know, especially when models are being like published automatically, right? Like, so if you think about like, kind of like online learning, like I have this video, like, Hey, something went wrong. I need to roll it back. I need to go like, you know, I need to go do something with that system. "I think we're going to start seeing like SRE's that are more ML skilled and, you know, so I see kind of this world of the SRE that are going to start learning more and more of the ML skills, what can go wrong in the ML world and be more conscious of that. So I don't think it's like ML engineers becoming SRE's. I think it's actually SRE's is becoming more ML conscious." Vishnu: Yeah, I think that's, that's a, that's a brilliant point. And, you know, I can already see the points of sort of interface where that can occur. Right. You know, if you're doing continuous training or something of that sort, your training set gets polluted by some, you know, sampling bug you know, you have an unexpected you know, Drop-in performance turns out that the test set that you sampled, you know, may not have necessarily been the best possible sample. And I think it's interesting actually to see I like looking at papers from KDD the conference, because you always get interesting sort of MLOps papers you know, backwards compatibility, you know, changes between different updates and how that breaks certain systems. There's a lot of research going on and you could see how ML are, is, could start to be the interface between, you know, production and development environment. Diego: Yeah, and that's new, but do you, do you, do you, are you on PagerDuty? Like, do you have a pager, like for your workload or is it somebody else who is? Vishnu: It's not me. It's somebody else. Diego Yeah, exactly. Right. The, the most common answer that you're going to hear from ML engineers and data scientists, it's like, okay, great. So now who's that person wearing the pager and how far can they take. They're debugging or action, like how quickly they can they react or they depend on bringing you Vishnu into the picture. And so I think this, this, "This idea of a skill set for, you know, as our, that are going to be more ML focused is going to allow that quick re you know, that quick action-reaction time to problems." And that's kind of where I see that, that, that thing I didn't mean to put you on the spot. It's just like, that's a very natural thing that we see, like, you know, Data scientists. They usually wear pagers. Vishnu: Yeah. No for sure. Yeah. I, I totally, totally get that. No worries at all. Yeah, I, I don't wear that pager and I don't, I don't wish to, I don't have I don't want to have that, that brick of led on me. Demetrios: He's happy to just passing the book for sure. When it off to somebody else, that's somebody else's. Others have said, you know, it's like that mentality definitely needs to come into the picture more as like you start to realize, Hey, I'm responsible for the result and the output of whatever I am creating. I'm not just responsible for my one little piece. And so having this missing piece of the whole team, I think is a really interesting part. And I like how you're talking about it. From going from an SRE perspective and getting more involved with the machine learning side. As opposed to trying to go the other way. I think that's a really important point. So this has been awesome. Diego, once again, then you, you do not fail to give me so many things think about, I cannot tell you how much I'm going to probably reflect back on. And then reiterate what you say here in any of the podcasts to come and many of the different meetups, because there are so many gems in this. I think we just barely scratched the surface of the blog posts that we mentioned before. So if anyone wants to really get. Down and dirty with all of this security stuff and not have it be as we went off on a few tangents there. Like if you want it more focused, the blog post is in the description, have a read of it. And also it would mean, yeah. To us, if you gave this video a thumbs up, or if you're on podcast, land subscribe so we can keep doing more of it. Again, Diego, thank you so much for the support. Thank you, Algorithmia for sponsoring the community. It is absolutely amazing to see the amount of like, just weight that you put behind this and the wisdom that you bring not only to, anytime you jump in a thread on Slack, but these conversations that we have. So thanks again. Diego: Love chatting with you guys. Thanks so much for for the invite in the, in the conversation. And I look forward to the next one. Vishnu: Lots of lots of blog posts to come. Demetrios: That's what we get to promise you. We're always saying we're going to write blog posts, but we never really do. I mean, we have kind of. Vishnu: Positivity. Come on. Demetrios: There we go. Vishnu this one's on you. Vishnu: Fine. Demetrios I just throw it over the fence division new and let him take it, but oh, we'll see you all later!

In this episode

Diego  Oppenheimer

Diego Oppenheimer

CEO & Co-Founder, Algorithmia

Diego Oppenheimer is co-founder and CEO of Algorithmia. Previously, he designed, managed, and shipped some of Microsoft’s most used data analysis products including Excel, Power Pivot, SQL Server, and Power BI. He holds a Bachelor’s degree in Information Systems and a Master’s degree in Business Intelligence and Data Analytics from Carnegie Mellon University.



Demetrios Brinkmann

Demetrios Brinkmann


Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.

Vishnu Rachakonda

Vishnu Rachakonda


Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.