Finding an ML model that solves a business problem can feel like winning the lottery, but it can also be a curse. Once a model is embedded at the core of an application and used by real users, the real work begins. That's when you need to make sure that it works for everyone, that it keeps working every day, and that it can improve as time goes on. Just like building a model is all about data work, keeping a model alive and healthy is all about developing operational excellence.
First, you need to monitor your model and its predictions and detect when it is not performing as expected for some types of users. Then, you'll have to devise ways to detect drift, and how quickly your models get stale. Once you know how your model is doing and can detect when it isn't performing, you have to find ways to fix the specific issues you identify. Last but definitely not least, you will now be faced with the task of deploying a new model to replace the old one, without disrupting the day of all the users that depend on it.
A lot of the topics covered are active areas of work around the industry and haven't been formalized yet, but they are crucial to making sure your ML work actually delivers value. While there aren't any textbook answers, there is no shortage of lessons to learn.
Performance metrics can't always be directly correlated to training metrics
Models suffer from a "curse of success": the more successful a model is, the more likely it is to retrain it, and the harder it becomes to retrain it without breaking someone's workflow
You develop operational excellence by exercising it
**What's happening, Adam? How are you doing, man?**
**Very good, very good. How are you?**
**I am trying very hard not to make the joke about what your last name means. [laughs] That is how I’m doing right now. People are sick of it. It's just basically “insert random animal” and “random country” and say that’s what your last name means. [laughs] If it is an animal that has to do with the mar or the ocean, even better. But anyway, we're here today. We just got off this podcast with Emmanuel Ameisen. Who – Wow. He blew me away. I don't know about you.**
**I think “wowzers” is the word. Yeah, definitely.**
**All right. So some top takeaways from you, and then I'll give you mine.**
**Yeah, he obviously wrote the book on machine learning power – wrote _a_ book – which is very favorably reviewed. He obviously knows his stuff. And he's at that end of the MLOps space where he's worked in really quite mature organizations. That's quite interesting. Lots of interesting takes on how to go from the clunky, hard, “how do you get value out of the model” process to flying and the hard yard to put into going from that starting point to where we kind of want to be, which is really quite cool. He’s got just quite a pragmatic approach to things actually, like how to align your metrics to get value, and how to _do things_ properly. I think those were the big ones for me.**
**Yeah. It's funny that you mentioned that because it was very much like, “Okay, this isn't _really_ the zero to one phase – this is more like the one to infinity phase” that he is trying to fine-tune right now at Stripe. He's doing an incredible job. And I think probably one of the best quotes that I've heard in a while was (before we actually started recording), he mentioned, “You develop operational excellence by exercising it.” Then I asked him to go into it, he goes _hard_ at what that means and how he implements it. I just loved it, man. **
**So let's jump into this. Before we do, I will mention there's a little bit of an aside. Before we jump into this, there are all kinds of things I want to actually announce. One is that we've got all kinds of cool swag on our shop – on the website. Go check it out. Adam’s baby has been wearing it. It's our best model. [chuckles] That's incredible to see. We've got baby clothes, but we also have big kids’ clothes and big humans’ clothes. **
**The next thing I will mention is that we're looking for people to help us take some of these podcasts and choose where the best parts of the podcast are and then we can create snippets out of them and basically make a highlight reel for those who don't have time to listen to the _whole_ podcast and just want the ‘quick and dirty’, ‘best of’ podcast. So if you are up for telling us your favorite snippets of the podcast episodes, get in touch with me, because we would love to have you help out. And that's it. Let's jump into this conversation with Emmanuel. **
**Yeah, man. Let's talk about your book – Building Machine Learning Powered Applications. What inspired you to write this?**
Yeah. I used to work at a company called Insight Data Science. I joined it after being a data scientist myself for a bit. And that company – the whole goal was machine learning education, and more specifically, professional education, so teaching people how to actually do machine learning in a corporate setting to actually deliver value and then get hired for it. And so, the way it worked is we would lead projects with people that wanted to get hired as machine learning engineers and data scientists – oftentimes applied projects in partnership with companies. And, you know, we'd build a machine learning application to do text classification, email classification for support tickets, computer vision, or even reinforcement learning.
We kind of touched a _broad_ range of applications. And I started seeing that the failure modes of all of those applications, the ways in which things would go wrong, actually had a lot more in common than I thought. Initially, I thought that maybe every machine learning application was its special jewel. And then I realized, “No, success criteria and failure criteria are _pretty_ consistent across the board, at least in some ways.” That felt interesting. So that's why I wanted to write about that. Before I started writing, I saw that there wasn't much being written about it, so that was further motivation. I was like, “Oh, this is an interesting topic and also I can't find resources on it. So I'm going to try it.”
**I always find it fascinating when you talk to people that have written books because it's such an undertaking – a document that size. How did you find the writing process? How did you go about starting, as well? Did you start writing and then find a publisher? And what would be the next book you'd write?**
Oh, man. Okay, how did I start writing – I'll start there. Because I feel like I didn't really… Well, I always wanted to write a book eventually – at some point – but I didn't want it then. What happened was, I was actually getting pretty frustrated with NLP projects, and how 1) they all tended to look the same (the successful ones) and 2) everybody always wanted to overengineer them. This was like three or four years ago and it was wonderful because for most NLP projects (natural language processing projects) you can do stuff that's pretty simple and get just _amazing_ value immediately. But everybody was excited about incredibly complex architectures.
I wrote this blog post (I don’t really remember what it was called) something like “How to solve 95% of NLP problems,” and it was a tutorial of like, “You just do this, and then you do this. And if that doesn't work, you do that. And then that doesn't work just do that.” It was based on literally dozens and dozens of NLP projects, and just seeing them succeed and fail. And that blog post just took off. I think now it probably has like half a million reads or something. People just really liked it. And O'Reilly, the technical publisher actually reached out to me and they were like, “Hey, we love your blog posts. Do you want to write a book?” So that's how I wrote – that's how we got started with writing. Yeah.
**Just before you tell us what your next book will be – do you think it's changed since then? Do people still want to overengineer NLP?**
I think yes and no. So I think what's cool about ML is that it is a field that's evolving fast. And so the kind of easy, “Hey, just do _this_. Don't do the complicated thing, just do _this_.” That solution gets more elaborate as time goes on. Maybe initially, to give you an example, think like 10 years ago, you'd say, “Oh, you want to do review classification or something. Just do like something called TF-IDF. Do counts on words. Have some version for each sentence, count the words, count their occurrences, and then you train a classifier. And that's fine. That'll work.” Then maybe five years later, papers Word2Vec and stuff like that came out, where you could actually download pre-trained embeddings that were really good. And it was like, “Oh, don't train your own model, just download these embeddings and use them as-is. It will take you like an hour. It's easy.”
It will slowly – we're not exactly there yet – but I would say we're slowly getting to the point where maybe that version is like, “Oh, just have an API call to open.ai or some service and they’ll have deep models that you can just kind of use.” So that's changed. But what I feel like has always remained true is that machine learning engineers are incredibly good at self-sabotaging. Whatever the simple, rational thing that you could do to solve the problem, they're like, “No, no, no. I'm gonna build my own 16-layer thing. First, I’m going to buy a supercomputer, then I'm gonna train on it for like three months and do this thing.” And you look at it like, “It's been a year. What have you produced?” And they’re like “Nothing, but I burned $100,000 of cloud credits.” I think, somehow, we're still always excited about the fancy stuff. So as the fancy stuff becomes normal, it becomes boring, like, “Oh, I don't care about this anymore. I want the _new_ fancy stuff.” So that hasn't changed.
**Do you think that's intentional? I have a bit of a view on that. Years ago, I used to do talks about… I used to call it something I probably can't say on a family-friendly podcast – but it was about this idea of all CV-driven development, that was the other one, where you get people solving a problem who have seen it. Do you think it's intentional? Or do you think it's actually just a natural component of the world we work in, where you've got these interesting things that are technically complicated, and it's easy to go into a complex solution?**
I think it's a bit of both. There's definitely resume-driven development – for sure. I think, for a while (and still now) there was this perception – again, I worked with people that _wanted_ to get a job in machine learning and so one of the things that maybe they believe is they’re like, “Well if I want to get a job as a machine learning engineer, I have to find _the most complicated_ ML solution and implement it. I'm not gonna get hired because I took some pre-trained model and used it for something super useful. I'm gonna get hired because I invented a new type of machine learning model (or whatever).” And that's not true.
In fact, what happens is that, people that will be hired and will be successful are the ones that were like, “Hey, here's what I _achieved_. It doesn't really matter how I do it, but here’s what I achieved.” So I think that resume-driven development is a part of it. I think the other part of it is that this stuff’s just cool. A lot of folks that work in this industry (me included) just probably like nerding out on this stuff. If I'm just doing something for fun, yeah, for sure I'll go for something that's overkill and complicated, because it's fun. So I think that's a natural tendency that when you're trying to deliver on impactful and important projects you have to fight. You need to really ask yourself, “How can I make this simpler?” Rather than “How can I make this fancier?”
**So basically – there ain't no shame in using a pre-trained model. Really, the idea is getting down to what metrics… How are you moving the needle? And how can you prove that you're moving the needle? Which leads nicely into some questions that I wanted to ask you about. As you're now working at Stripe and you're looking at how your machine learning team is moving the needle – how do you look at KPIs? How do you attribute things back to the machine learning team? What does that look like when you interact with stakeholders so that you have something to bring them? **
Yeah. Hitting me with the easy questions, huh? Okay.
**We've already done… This is like year two of the podcast. So we already got all the easy ones out of the way. [chuckles] You didn't realize. You should have been on here a year ago, I would have given you the real softballs.**
Tough, tough. Well, I think there are a few ways to tackle your question. One thing is – it's always easier, especially in machine learning and engineering in general, but especially machine learning, if you can tie whatever you're working on to some number that the company you work for cares about. In a lot of cases, that number is some version of revenue. Sometimes it can be other things – it could be cost-related, it could be security-related. I work on a fraud team so it's not always obvious to say, “We've improved our fraud prevention – how many dollars does that make us?” It's not always straightforward. But anyway, I think being able to tie it to that number is important.
One of my mentors, when I was just getting started, had this saying that it was really cool, which was basically, “Write the press release before you start doing the work.” And I think that that helps a lot in those situations where, before you're gonna pitch some new model or some work to your manager, or try working on something just write the email you're gonna send to the company as you're done with the work. Sometimes, you'll realize that that email sucks. It'll be like, “Well, I spent six months building the system. Now, I guess, we’re kind of slightly better at doing this thing that some people care about. Maybe.” If that's the email – then just don't work on it. So I found that that's really helpful. If you can write the email in advance and be like, “Wow, this would be _really_ cool.” Then that kind of motivates that project and helps make sure that you _wil_l have something to go to stakeholders with. You've already written it, so you just need to execute it.
**Not like, “Yeah, we spent six weeks and turns out this is hard.”**
Yeah. [chuckles] That’s right. Well, that's the other thing, though. The other way to handle your question is to just say, “Just reduce the iteration cycle.” Maybe that's one of the ones that you got out of the way with two years of running the podcast and many machine learning folks coming up. But, if there's one thing that's been a constant (I feel like in my career) is that the shortest iteration cycle wins. And making the iteration cycle shorter will make you win. As much as you can say, “We wasted our time in the last six weeks.” That's not a great look. But if you're “Hey, I ran an experiment. Took me five hours. It wasn't good.” That's fine. Who cares?
**So how do you shorten the iteration cycle? What are some tricks that you’ve found to do that?**
I'm a big fan of automating everything you can. It's hard, especially what I found is it's hard as companies get more mature and how you may have like a bunch of business processes or a bunch of different outcomes that you care about, that aren't just like, “Hey, is this model good?” But also like looking at different slices of traffic, different key users – that sort of stuff. But as much as you can, automating away every single step is really what matters. You want to get to the world where… There are two different iteration cycles that I often think of. One is the experimentation cycle. So you want to get into a world where you want to try a new feature for this important model or you want to try a new model for this thing – you can get your answers really quickly. So for that, you want to automate, “How do you gather data? How do you generate your features from the data? How do you train the model? How do you generate evaluation metrics?”
Oftentimes, when you start on teams, these are like 12 different things that you have to do and maybe have a checklist that's like, “And then download this thing. And then you go there and you pull this branch, and you do that thing.” And, of course, it takes forever and if you do one thing wrong, you have to redo the whole thing. So just kind of automating the glue there and the system. And then one thing that we worked on a lot last year is the iteration cycle of deploying your models, which I think is often maybe not taken care of as early as it should. So how do you make it so that if you have a new cool model, it can just be in production and you're not worried about it without you spending weeks of work to get there, or you’re not getting woken up at 3 AM when you deployed the model and it was terrible? So I think those are the two loops that you kind of shorten.
**So that leads to another interesting question that I think we had. I completely agree, I think that is usually where the big wins lie. And they're the ones that scale, right? They're the things that remove all the barriers to scale and things like that, especially if you're starting out. It can be scary, though, to do it with critical stuff. I think it's interesting working in finance, because there are a lot of tie-overs, like in the energy industry we are tied into national critical infrastructure. Basically, we get fined out the wazoo if anything breaks, so automating things and iterating things can get scary. So do you have advice for how you tackle that other than “Leave it alone and let it creep over in the corner.”? [chuckles]**
Yeah, the first option is just to never touch it. “We deployed this model eight years ago.” “The person that deployed it left the company and we haven't touched it. If we ever have to change it, we'll probably just go bankrupt because we'll break everything.” Yeah, that's one option. [laughs] I actually think that, in many ways, automation will make it safer. So I agree with you. We have like the main model that I work on, which is the model that powers what's called Stripe Radar. Basically, it decides for every transaction on Stripe, whether we allow it or block it. So you can imagine that messing the deployment of that model up is pretty bad – potentially, absolutely tragic.
So we have to be pretty careful not just to make sure that we deploy a good model, but to make sure that maybe if the model is good on average – what if it's good on average, but it decides to block all payments to every podcast provider on Stripe? And it's a small enough slice that we didn't notice it. That would be really, really bad. So there are a lot of failure modes. But I think automating a lot of the things ends up making it a lot safer because what happens when you have a lot of failure modes and a lot of things to think about is that they get distilled in the team's knowledge. And it's like, “Oh, Adam knows about this particular thing. Make sure that you ask him because you have to run like this one analysis – I don't really remember – it's based on this query. It's kind of broken. But last time we didn't run it, something went wrong.” And then like, “Oh, and this thing happens.” And you have a bunch of weird tribal knowledge and if any of it goes wrong or goes stale for any reason, everything goes to hell. So I think automation is safer.
The other thing I'll say is, you asked, “How do you do it?” One thing that we found really helpful is – before you automate, you suggest. That's kind of the motto. And that works for machine learning, too. Let's say that when you deploy machine learning models, usually you deploy them with a threshold – it depends – with classifiers. You might say, like for fraud, “Anything above this score is positive or negative.” Maybe you want to select a threshold automatically, but you're not too sure. So initially, you write your automation and all you do is – you have a human in the loop and you suggest that value to them. And you do that over a few cycles. And once a few cycles go by and they say “Yeah – actually, I never change it. It's always good.” Then you can feel more comfortable automating.
**Yeah, I completely agree – for some of the big critical stuff like that, where it's possible. When I went to insurance, that was kind of the only approach. People weren't confident in taking their hands off the controls. And I kind of got to the point where I thought “Maybe that's the only way to do this, actually. Let's not think about full automation just yet.” I find it really interesting, actually. Going off-piece a bit it here – but how similar your language in the way you're talking about doing this stuff, and especially about failure modes, aligns with the chat that we had with Mohamed Elgendy a few weeks back. Because he talked about really similar stuff and that idea of failure modes, it's really interesting to hear. Demetrios and I spoke about building confidence around testing and that – but he says that actually identifying failure modes and trying to do that – it's just, I think, certainly something to check out. There are similar lines of thought in thinking that's obviously the right approach.**
Yeah, I feel like there's a common thread in operational work where you think to succeed, you just obsess about the potential failures and reducing their likelihood.
**So, can we dive in for a moment to what your actual infra looks like at Stripe and how things are going there? Like the nitty-gritty of it. I guess we can maybe start off with you giving us a little bit of a background of what you're working with. But really – you've been at Stripe for a while and I imagine you, as you mentioned, you've been iterating quite a bit. So how has the iteration looked from the infrastructure side? What have you seen, as you mentioned in the beginning, when you got there? Potentially, there were 13 steps on data collection. I imagine you've cut that down a little bit or you automated it so it's not as painful. What were the low-hanging fruits when you iterated on some of the stuff in _your_ specific use case?**
Yeah. So we started from a pretty privileged place, I would say. Because Stripe is certainly a larger company than the median company, I guess, statistically speaking. So what that means is there's a lot of infra. So, we have compute infrastructure and also batch compute infrastructure, meaning we have teams that handle orchestration tools and infrastructure tools like Airflow. We have teams that handle a model training and serving service where you can train a model and with one function call have that model be an API that you can call live. So when you deploy your model this kind of happens magically for you.
We have a feature team – a feature computation team – that has a framework where you can define your features offline and then once you've defined them once offline, they're available to you online. These are pretty gnarly problems that I think I'm pretty happy that we don't have to solve on our team. So that's always been a tremendous help. I would say that… For context, the last year we actually spent focusing quite a bit on reducing both of those feedback loops –like prototyping and deployment – and almost all of it on both sides ended up being in these layers of glue that I mentioned. Let me give you an example. Let's say you have a new feature, you've tried it, it helps the model a lot, and you want to deploy it. So first you’ve got a notebook somewhere, trained your model with a new feature, you saw it was better.
Then you'll have to get that code and merge it to our production code without breaking up production training code, and that code will then train a model. Once you have that model, you'll have to then score it in a bunch of tasks against our current production model to compare it against, like we talked about, there's kind of _a bunch_ of different slices of the world that you would want to compare. So you use, let's say, ad hoc Airflow jobs to score millions and millions and millions of charges – expensive jobs over time. Then, once you've scored all these, you're going to have to run your analysis on it. So you have a notebook, you have some queries, you have some SQL, and you do analysis.
Then what happens is – at Stripe, we don't deploy for this particular model. We actually customize the model for large users. And the way we do so is mainly by (at least initially) customizing the actioning threshold. So for hundreds and hundreds of users, we decide on a different actioning threshold. That means that you have to do your analysis for this and you're trying to figure out which threshold. And then finally, you think, “Oh, I'm ready,” and you do something called a shadow deployment, which we can go into a bit, it's a very useful process – where, again, we can rely on the infrastructure there that we have. And then you monitor this manually. Then if this looks good, you say, “Okay, let's slowly ramp up traffic to production.”
And then once you deploy, for a few weeks or a month, you'll want to also keep an eye on it because maybe there's something wrong, so you take a look at performance. Basically, that's the process. I'll stop in just a short bit, but what I want to say is – everything we did last year was just automating all of this. It's kind of unglamorous work in many ways, where you're just going in the guts of your system, and you're like, “Okay, this job does this thing. That job does this thing. And then a human does this. Can we systematize all of it, make those processes just directly connect to each other _and_ define and code what the human was doing?” When they were looking at two curves and were like, “Yeah, it looks good to me.” So a lot of it was just doing that.
**It's such valuable work, though. I totally… I think it's one of the things where people who try to bake machine learning into their product or platform don't actually understand how complicated and kind of _unknowable_ that path is. Because that path you've described there, sounds right for Stripe and what you're doing, but it'd be completely different from the next organization. Right? Because it's context-specific. People come in with their own thoughts and ideas, _and_ actually common knowledge and the consensus is changing so rapidly. Did you have to do any of that path-trailblazing-type stuff yourself? Was it quite a natural fit for where you were going for Stripe?**
Sorry, what do you mean, exactly?
**Well, I mean how much of that was quite apparent “These are the steps we'd have to take to deploy it this way. And these are the tests.” How much of it was actually going out on a limb and trying things?**
Oh. Yeah. [chuckles] A lot of it, I think, you learn by trying to deploy maybe without doing one of these steps. You know? You're like, “Okay. Well, we're gonna just deploy a new machine learning model.” And then, you get to this point where you’re like, “Well, do we know that we didn't completely break this specific…? Are we just rolling the dice here?” And you're like, “Okay. Well, I guess we should do this one analysis.” And then you do it and you’re like “Okay, I'm pretty confident with this.” And then like, “Well, okay. So we know that we haven't broken this user,” but then you're like, “What about this _other_ use case? Is this something that we've taken into account?” And I think two things.
1) One of the reasons I joined Stripe and one of the reasons I really enjoy working on this is – this is a problem you only have if you're really successful. So it's a good problem to have. We have this problem because there are a bunch of different users using our APIs every day, in a variety of ways. And so, that's a great problem to have, that we have to think about all of these different use cases and we have all these business processes.
The other thing that I feel like I've become more of a zealot about after doing this work is, basically, the work that we ended up doing this year was encoding our business expectations in code, essentially, for lack of better phrasing. So, what that means is, usually, there's a lot of arcane stuff that goes in machine learning and it's like, “Oh, you train this model and then whatever, like you do some stuff, you do some analysis – whatever that means – and then you deploy.” Instead, it was like, “No. When we have _this_ model, here's the contract we have.
The contract we have is that we won't change the rate of actioning of charges on this specific, whatever country by more than _this_ rate. If it's more than this rate, then we'll change it.” And then we also always will move towards having the same (this is just an example) false-positive rate and a higher recall. We’ll never trade false positive rates down – we’ll always improve recall, that sort of stuff. And that's really the key there for us.
**So, a few things. [chuckles] First one – awesome vocab word that you said, “zealot”. I haven't heard that word in a while and it made me think, “Oh, I need to incorporate that into my speech more.” The other thing is, you mentioned this phrase before we got on here, which was already a quote, and I was like, “Oh, man, we gotta get you saying that when we're on the podcast because I want to make a t-shirt out of it.” And it goes, “You develop operational excellence by exercising it.” That, what you've just said, is basically that – if I'm not mistaken. But can we go deeper into that quote of yours?**
Yeah. So [chuckles] Okay. Essentially, maybe the clearest example I have for this is, again, releasing models. Here's how most machine learning teams I've been on will go about releasing models. They'll have a new use case, they'll think of a model, they'll make the model, they'll be happy with it, and they'll release it. Eventually, sometime between weeks and years later, somebody will say, “Oh, we should release a new version of this model.” And then in comes the problem, right? The code they used to train the model is _way_ out of date – doesn't work anymore. The data – nobody knows where it is. The release criteria, again, aren't in existence.
So you're just kind of looking at what the model does today, like, “_Well_, it _looks_ similar.” And so you end up with this problem, but again, I've had multiple times in my career where you’re kind of doing this reverse archaeology to figure it out. That's because, again, you haven't _exercised_ your release pipeline – you did it _once_. But what ends up happening is, even if you do this exercise twice – you release the model once and then you did it again – what I found is that, as long as there's enough time between when you're releasing this model, your code will rot. Just because code naturally rots. So, whatever you're building, even if it's very smart and very fancy, and you're like, “Oh, I've mathematically solved how to set whatever parameters for this model,” it’s like, “Yeah, sure. But all of the assumptions that you had about the world broke.” When you created this pipeline, maybe your company only operated in _this_ country. Now it operates in 12 countries. So all of your assumptions are wrong, all the distributions are different, all this stuff.
So really, the only way that you can have your models retrained and re-released – which for many applications is a _gigantic_ boost in performance that you would be foolish to leave on the ground – is to have that pipeline (that operational work of releasing a model) be _at least_ as frequent as, essentially, the data demands you to be. Basically, once you get to that point where, maybe automatically, you're going through this _whole_ process every week or every two weeks, the issues that happen at a two-week granularity are small enough that you can fix them. It'll be like, “Oh, we changed this one thing, so I'm gonna change this thing.” And then it's fine. So your pipeline would only ‘break’ in terms of small cracks and you'll fix the small cracks.
It's kind of like cleaning regularly or doing housework regularly – you fix a small thing here and there. But if you leave it for a year, you come back to it and it's like a haunted mansion – the ceilings falling on your face and you just give up. You burn it all down and you build a new house. So I think it's all about doing that _frequently_ enough so that you don't end up in the haunted mansion scenario. You can just regularly exercise the work and regularly just touch it up. It becomes almost like a Marie Kondo zen-like thing rather than a nightmare.
**Oh, that's so good, man. That's brilliant to think about. Though, do you ever feel like you get the ‘death of 1000 cuts’ because you're continually doing all this small work? You're like, “How can I get out of this painful patching up these little cracks?” Or is that just part of it?**
**Yeah, I was gonna kind of similarly ask – do you ever struggle with the balance of that work versus doing new stuff that isn't that kind of productionization piece? Is that balance hard to find?**
Yeah. I mean, I think some of it is… maybe you have to be the kind of person that likes tending to your garden, you know? Like trimming the weeds and all that stuff.
**[cross-talk] Yeah, the constant gardener.**
[chuckles] I think what happens is – everything is always like a prioritization exercise. So, for this stuff – to take the example of our platform – essentially, we document this pretty well. You just released a model –What went well? What went wrong? Did anything break? And then I'll just stack it – we’ll be like “Okay. Well, this thing – there's this bug. You have a 1 in 10 chance that this kind of annoying thing happens, but it's not dangerous. It's kind of annoying. It would take a week to solve. We have more important stuff to do.” You know? So some of that stuff, we’ll just decide not to do.
And then some of the stuff is like, “No. This could cause a big, big issue.” In some cases, we actually get the _wrong_ performance metrics, like, “No, we _have_ to solve this.” And so you prioritize that above doing new work. But again, you can always bring it back to “Is this worth _more_ than me building a new model for this other thing?” Then you make your decision that way. One thing I'll say – this is maybe I'm just becoming superstitious – but when we were doing all this foundational work four times, _four times_, I found a small thing and I was like, “Oh, this is kind of annoying. It’s a small thing. It’s not worth our time to fix it.” And the same thing, two months later, ended up causing just a _huge_ issue. Every time.
So I've become kind of way more of a stickler, being like “Yeah, yeah. It's a small thing.” But if I can, I'll just do it. I don't know if that's rational. It's just that I kind of got just unlucky breaks four times in a row on small things becoming a big thing. So I guess you develop your own heuristics.
**There’s something about being in the mind space at the time and thinking “Right, I'm thinking about this now. So let's just fix it.” As opposed to [cross-talk] thinking.**
**Also, along the lines of the way that you do reproducibility, you mentioned how much of a headache it is for a lot of people out there, and all the teams that you had been on previously, where you have to figure out like, “Alright, I want to go and get a better model. Let's see. What data was this trained on? What code was this trained on? Where did we get this data? How did we clean it?” All that stuff that goes into trying to reproduce the same results – how do you go about that now?**
Yeah. I can maybe go a little bit more into the Stripe tooling around it because I think it's really good. We also have some blog posts on our engineering blog about it. I think there are a few things that you kind of want: 1) You want your data generation to be something that you can rerun. Honestly, that can be as simple as – if you are going to train a model, it has to be attached to a Spark job, ideally, that runs in a scheduled manner. Again, because if you just have like, “Hey, I wrote the Spark job two years ago. Just rerun it.” Then I guarantee you that if you try, it'll break – the assumptions are all messed up. And so, if you can have something that generates your training set every week – even if you don't use it, it's fine – and that will alert you if it breaks, then that's really nice.
So that's something that I think is accessible for most companies. It doesn’t have to be Spark, but have _some_ data job that just keeps running, keeps making your training set and, ideally, you do some simple testing on it. Nothing crazy – but just test that it's not all empty or just has only one kind of label. 2) The other thing that we have that helps with this is – we do the same thing for our training workflows. Our training workflows are essentially… think of it as similar to a SciKit Learn model. It sits on top of the data generation job and basically, everything is defined either in Python or in a JSON config, telling you to like, “You take the data. You filter out these three rows. The label is defined _this_ way. You take _this_ column and you multiply it by three (and like whatever). You train the model with _these_ parameters, either this many months of data for training, but also this many months of data for test.” And you have the same workflow for model evaluation where it's like, “Cool, you take the results of the previous workflow and just do a bunch of stuff on it.” As we get through more complicated use cases, we extend this by just adding more, essentially, scheduled jobs.
For us, we'll have scheduled jobs that take like, “Yes, we've trained this model. We've tested it. It looks good in terms of general performance. But now we want to take the actual binary score – let's say like a bunch of French charges – and compare how it does to the previous model and verify that in France _this_ specific condition is met.” And then we do a bunch of things. But essentially, what I'm trying to say is that all of this is just scheduled and it runs whether we deploy models or not. And if it breaks, we know it.
So extending that to all models – this is something that you can build at a platform level, where you just say for every data scientist in the company, “You want to ship a model? You have to have this job that creates your data. You just write the job and then your model training just keeps running.” That is super helpful. Because then you just have it by default. It's maybe like 50% more work when you do your initial model. But then you kind of get away from this huge problem scot-free after that.
**Yes, interesting approach there. I suppose I've not thought of it like that. But you’re kind of artificially creating users for your pipelines so that you can then just treat them like you would any other software product, right? You just go, “Well, the user raises a bug. Need to go fix it.” Which is quite cool. As opposed to going, “This thing works. Put it on the shelf and let it collect dust and fall over when I next try and I need it.” Yeah, it's quite an interesting idea, though. Maybe it's just my lack of reading, but I’ve not thought of that.**
Yeah. I haven't seen much writing about it either, but yeah. I think it protects you from exactly what you called out. Right now, I think one of the biggest risks of ML is the stakeholder that made the ML model oftentimes is not around, and nobody understands anything when it breaks. So that’s kind of like a forcing function. Because maybe you built the model and the pipelines, but now, usually the ownership is at a team level. So it's like, yeah, but the _team_ now owns the pipeline that trains and evaluates this model, and so they _keep_ owning it as long as the models are in production.
**Yeah, because I always think that the tricky bit comes from things like the static artifact of the dataset that trains the model can easily get forgotten. Not so much the model itself, which might be fine, but the actual artifacts that hang off of it. Yeah, that's cool, that’s cool. I like that.**
**That's incredible. This has been so insightful, man. And I just want to call out everyone who has _not_ read your book – go and get that. You can buy it from O'Reilly Media or Amazon. And I think they probably have that like, where you do the 30 days free trial with O'Reilly – I know they do that a lot. So if you want to go, read the book in those free 30 days [chuckle] if you're really cheap, go do it. Get after it. The book is called Building Machine Learning Powered Applications Emmanuel. If I got that right, with my French – my poor French accent. I appreciate you coming on here so much, man. It's been _really_ enlightening. There are so many key takeaways that I've had from this chat. Just thinking about… [chuckles] **
**From the beginning, you were dropping bombs – machine learning engineers are extremely good at self-sabotage. We all know that. And they’re trying to overcomplicate things. I think that's just an engineering thing in general, _but_ with all the different shiny tools and new frameworks and Py Lightning, or whatever the last PyTorch thing that just came out yesterday – I can't remember the name. But we all want to try it, right? And maybe we don't need to for our use case. Like, let's bring it back to the KPIs and let's bring it back to the stakeholders and actually move the needle. Then – this was a great one from your mentor – “Write the press release before you start doing the work.” **
**And make sure the press release doesn't say, as Adam mentioned, “We worked on this for six weeks and it turns out it was really hard. [laughs] So we gave up.” Also – this was key for me – when we were talking about the KPIs and how to interact with stakeholders, “Tie whatever you're working on to some number the company cares about. Figure out what number that is – it doesn't have to be revenue, specifically, but figure out what the company cares about. Tie whatever you're working on to that and be able to show that you can move the needle on that.” **
**We're also going to link to something in the blog post about your internal tooling at Stripe because I'm sure there are a lot of people that want to go a lot deeper into what you mentioned in this call. Last but not least, and maybe this is all doom and gloom – Adam and I created a fake startup for the MLOps tooling sector while we were listening to you called Haunted Mansion, which is, in effect, doing exactly what you talked about. [chuckles]**
**One of the best analogies I've ever heard. I loved it when it was mentioned. I'm sorry, I love that.**
**[chuckles] The cobwebs and all of that, that's what we're trying to avoid. “The stakeholder who built the model is not around when it breaks,” and I think we've _all_ had to deal with that. We all know that feeling. Last but not least, “You develop operational excellence by exercising it.” Man. _So_ many good quotes and _so_ many key takeaways here. This was incredible. Thank you so much.**
Thank you. This is really fun. Thanks for having me.
In this episode
Senior ML Engineer, Stripe
Emmanuel Ameisen has worked for years as a Data Scientist and ML Engineer. He is currently an ML Engineer at Stripe, where he worked on helping improve model iteration velocity. Previously, he led Insight Data Science's AI program where he oversaw more than a hundred machine learning projects. Before that, he implemented and deployed predictive analytics and machine learning solutions for Local Motion and Zipcar. Emmanuel holds graduate degrees in artificial intelligence, computer engineering, and management from three of France’s top schools.
Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.
Dr. Adam Sroka, Head of Machine Learning Engineering at Origami Energy, is an experienced data and AI leader helping organizations unlock value from data by delivering enterprise-scale solutions and building high-performing data and analytics teams from the ground up. Adam shares his thoughts and ideas through public speaking, tech community events, on his blog, and in his podcast.