Coffee Sessions #49

Aggressively Helpful Platform Teams

At Stitch Fix there are 130+ “Full Stack Data Scientists” who in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand team were in a bind. Their data generation process was causing them iteration & operational frustrations in delivering time-series forecasts for the business. the solution? Hamilton, a novel python micro-framework, solved their pain points by changing their working paradigm. Some of the main workers on Hamilton are the dedicated engineering team called Data Platform. Data Platform builds services, tools, and abstractions to enable DS to operate in a full-stack manner avoiding hand-off. In the beginning, this meant DS built the web apps to serve model predictions, now as the layers of abstractions have been built over time, they still dictate what is deployed, but write much less code.

Take-aways

Demetrios mentioned we'd be talking about "lets talk about hamilton and just the general way of doing ML over there [Stitch Fix]?"

Transcript

0:00 Vishnu **Hello and welcome to another episode of MLOps Community Coffee Sessions. Demetrios and I are joined today by Stefan Krawczyk. He is a manager on the Data Platform Team at Stitch Fix. As many of you know, Stitch Fix has one of, if not the most legendary data science processes and teams in the valley – in the world. There’s a lot to learn from what they're working on and how they are powering all the closing recommendations that my girlfriend gets (and she's very satisfied with). Thank you, Stefan, for joining us. ** 0:35 Stefan Hello! Thanks for having me. 0:36 Demetrios **Stefan, one of the things that you said that I just want to start out with is that you and the team created Hamilton, and you chose to name it Hamilton, because…?** 0:50 Stefan Oh, yeah. So, the team that we're working with, their kind of short nickname is The Feds (forecasting information demand team) and we were trying to think – this is going to be some pretty foundational piece of software for them, or like an abstraction for them. So who was the foundation of the Fed? It was Alexander Hamilton. Then because of the way that we actually approached the problem, there's a bunch of graph theory and so there are a bunch of Hamiltonian-type of things with graphs and things. And so, in which case we thought, Hamilton was pretty well-placed. It pays homage to that team's name, but then also pays a bit of homage to how we're actually approaching and helping them solve the problem.  1:35 Vishnu **Yeah, that's a very appropriate name – an appropriate name for what sounds like a pretty cool system. Before we jump into the entirety of that framework, I'd love to learn a little bit more about your background, your journey, and how you ended up where you're at, at Stitch Fix.** 1:51 Stefan Right. So how far back do I start? I mean, do you guys want from birth [chuckles] or just from time in the Valley or…? 1:59 Vishnu **Time in the Valley sounds good. [cross-talk]** 2:02 Demetrios **Also, you've got a wild story about where you're from, and how you came to be in the Valley. So maybe you can start with that, too.** 2:10 Stefan Yeah. I grew up in New Zealand, immigrants to Polish parents. I did my undergrad there and then computer science and undergrad. There was a bit of maths there as well. Then I came out to the Valley on an internship with IBM. That was my first foray into Silicon Valley. I was there for, I guess, a year and a half. I applied to graduate school. Did Stanford for two years – computer science kind of specialization – and from there, I did an internship at Honda Research, building a spoken dialogue system at Stanford. I kind of focused a little bit more on the classical NLP stuff, which was obviously made obsolete by most of the deep learning stuff these days. But after Stanford, I was at LinkedIn for two and a half years on the growth side, and then on the prototyping recommendation products. I went to Nextdoor, built a first version of quite a few things – data warehouse, email tracking infrastructure, A/B testing infrastructure – being full stack data science, prototyping a bunch of things. So yeah, at Nextdoor I got to build a couple of first versions of everything. Then I went to a startup called Idibon, which was NLP for enterprise that was going into the depths of trying to build machine learning on top of Spark and then getting it accessible on a web server. Then the last five years or so, I've been at Stitch Fix, where I've been on the data platform – so engineering for data science – where I've kind of gotten to build three teams – rather, I'm on my third team now – and spanned everything from API backends to experimentation, again, to now focusing more on machine learning platform or what my team has called the “Model Lifecycle Team”. We still call ourselves the “ML Team,” but because there are so many data scientists at Stitch Fix doing so many different things, machine learning would kind of alienate some teams. So we actually broadened it to be the Model Lifecycle Team, but still have the ML shortening. 4:14 Vishnu **Awesome. So your team – the model lifecycle team. Can you tell us a little bit more about how Stitch Fix thinks about what the model lifecycle looks like? And maybe kind of get into Stitch Fix’s uniquely empowering mindset about data scientists? I'd love to kind of get a sense of that.** 4:32 Stefan Sure. Since I've been there for so long, I've seen things kind of evolve over time. Back when I started, it was hiring PhDs from Applied Math Stats / Physics backgrounds and kind of giving them the autonomy to build things end-to-end, so they were in charge of prototyping, building an ETL, prototyping that, productionizing it and then getting whatever they were building to production. So over the course of the years, the data platform team has built more platforms than abstractions. So I want to say, the data center still has autonomy, but we have engineering less of the things required to get things to production and using more platform tools. So I think the autonomy that Stitch Fix very much thinks about – autonomy is a very key component in what the data scientists are doing, or at least the agency that they have – and so I think that means we can iterate faster, because there's less people involved. The data scientists can do something end-to-end, they can go talk to the business partner, figure out the problem. They're solely in charge of figuring out what action needs to be built and then if they can build it themselves, then, obviously, they can give themselves better feedback as to what's working and what's not. As we continue to scale, I think we're trying to figure out how we can best ensure that as things go, they’re progressing into the future. So with my lifecycle team, for instance, we're trying to reduce the amount of code that someone needs to write to get a model to production. Ideally, we can kind of reduce it to zero and make a configuration, but… Yeah, what do you want to dive into more? 6:15 Vishnu **[chuckles] Yeah, no. There's a lot there. I want to hone in on one particular point, which is – you mentioned that Stitch Fix has this notion of a full stack data scientists that can go identify a need, and ideally, _themselves_, solve the problem end-to-end. This is something that we talk a lot about in the community, “How do different professionals create business value?” “Where do those teams come together?” And I think the specific question that I have, judging by what you just said, the full stack data scientist at Stitch Fix looks like – how do you manage if every one of those data scientists is maybe building a product end-to-end themselves? How do you manage all those different products? Or how do you manage all the outcomes that each of those fully empowered data scientists generate?** 7:03 Demetrios **Ooh. Good question.** 7:04 Stefan Yeah, tough question. I want to say… this answer is probably different for organizations of different sizes – at Stitch Fix, we have like 130+ full stack data scientists. I think to be able to manage that 1) you’ve got to be able to kind of control and measure what they're doing. So I think having a first class experimentation system – and what I mean by experimentation system, I mean an online A/B testing system. I guess in the MLOps machine learning world, experimentation is tending towards these days to mean, back testing or an offline test. Here, I mean when something gets to production and it's being served by an API, how can you test that that model is working well? The experimentation system should enable concurrent tests so that people aren't stepping on each other's toes. I want to say that’s one core component – that you can actually push multiple things to production and people don't step on each other's toes. I think the other side of it is just ensuring that there's accountability and when something's pushed, that there are some sort of measurements. So, not everything is pushed in an online sense – some things happen in offline. Just having some sort of rigor around analysis and documentation and sharing out. I think this is where, depending on how companies structure their processes, or at least planning processes. One of them can be OKRs (objectives and key results) – how can you map what people are doing to objectives that the business needs, and then seeing whether you can move the needle on them? Technically, I think it’s the experimentation system, otherwise, there's a bunch of human planning and accountability, which I think is a little nuanced and dependent on the company. But at least at Stitch Fix that’s kind of how we've been doing things. 9:03 Demetrios **Basically, diving deeper into that – have you ever seen teams that are doing the same thing? And how do you keep from that happening?** 9:17 Stefan I mean… I want to say, it really depends. So, yes. I guess, with Stitch Fix, it's a clothing business, where we try to sell clothes, but we also have inventory and we try to optimize the amount of inventory we have for the amount of clients we have. So you could say some people end up modeling very similar objectives. I want to say that by enabling them to be autonomous, we've actually enabled them to be self-sufficient in some regard. This means that if there's a bug in someone's code, but if no one else is using that model, it's only really isolated to that team and that kind of vertical. I've seen it not really be an issue as Stitch Fix has grown up, in terms of model reuse or people redoing the same thing really hampering things. If you are trying to iterate quickly, then the more control you have. At least at the size that we have been to the last five years, it hasn't been an issue. As we get bigger, there are some models where, yeah, they're getting really big and it doesn't make sense for other teams to build their own version of it. That's where that model essentially becomes more horizontal – it's not only serving one particular vertical – so you need to kind of have some organization maturity around process, like with how you communicate, you have to know that there are downstream stakeholders and things. Therefore, I guess my short answer is “Yeah, it hasn't been too much of an issue.” I think with features and tables, that's where we're trying to actually get more leverage and reuse. But from a modeling perspective, we actually haven't seen too many things crop up now. I think this is also potentially related to metrics. In the analytics world, people are coming up with their own definition of LTV (lifetime value) so it's important that there is potentially some team that owns a central definition that everyone can understand and revolve around. I want to say that this is a new problem in general for data warehouses and backend data science teams. But so far, at Stitch Fix, because of the autonomy that we have, we've been getting by just fine. 11:46 Vishnu **Got it. Yeah. I love what you said about organizational maturity and how, in a structured fashion, you've been able to avoid that problem. You sort of mentioned a phrase around this idea of ML platform and model lifecycle that I think is really interesting, which is: you'd like to ideally get the amount of code you have to write to deploy (or create and deploy) a model down to zero and have it be configuration. I think this idea around reducing the amount of code and increasing self sufficiency, has been the motto, or rather the guiding principle, for a lot of platform teams, which we're starting to see evolve in a lot of companies. Even at my early stage company, inspired by what Uber, Stitch Fix, DoorDash, and other companies have called their platform team, we've started to think more aggressively about our platform team now. So I guess my question based on that general observation is – over the course of your last five years at Stitch Fix, how have you seen that vision of getting code down to zero emerge and coalesce as the system wide complexity has increased? How have you been able to go about that process in building the platform?** 13:11 Stefan [chuckles] Well, part of it is – one nice thing is that my team has engineering for data science. So we're not beholden to any other real stakeholders. Our stakeholders are other data scientists. We're also not on any critical path, necessarily, because we try to give the full stack data scientists autonomy. In which case, we're trying to avoid handoff, therefore, in some sense, it's actually reasonably easy for us to start something new and try to get people to adopt as they do new work. You could say we're not necessarily burdened with how data scientists have been doing it previously. We can think entirely about “How should we be doing it and what are the steps to take it?” So this means 1) we need to kind of be able to communicate this vision and then find a way. Or I guess, what I think has been the most successful way for us to approach this is to actually being able to really understand what the process that data sciences are doing is, and try to figure out what the most impactful things that we can abstract are, or what the core abstractions are that we need to really get to zero code in order to get something to production. This hasn't been a quick thing. In terms of our journey, we actually started… When I first joined Stitch Fix, it was really hard to spin up API's or backend API's, so rather than focusing on modeling, we actually focused on “How do you make it really easy to spin up a backend API?” Before, people were writing Flask apps, and instead, we wrote a little abstraction on top of Fast API, where we abstracted the web service, and instead, just enabled people to think about functions. If you think from a layers approach of how to do things, we started at the outermost layer of serving online predictions, “How can we abstract that and make that easy?” So with that, and then with experimentation cobbled together, we made it really easy for them to kind of experiment and deploy things in production from an API perspective.  15:34 Stefan Then, we started down… So, Hamilton was a specific case for a specific team, so we focused on that. But otherwise, since then we've actually used features in a more general, centralized featurization kind of problem that is, I think, one of the hardest ones to tackle and the zero code things. So we've actually left that to be… Or rather we're now thinking about it. And before that, we actually decided that we really needed to understand what a model is, to be able to get to the zero code thing. So knowing how we can get something deployed, because we've kind of made deploying a backend API or a model in an API very easy. We then thought about “How can we actually automatically generate that code?” What we decided, or figured out, was that we really need to understand what the model is. So “What are the model’s dependencies? What's the model’s API input schema?” And then “How can we make this configuration-driven?” So those are kind of the starting points. I can keep going or do you have any questions while I'm thinking on what to say next? 16:46 Vishnu **[chuckles] No, no. I feel like I threw a very general question at you and you handle it like an absolute pro. I think some of the things that I really enjoyed hearing about from you are that, number one, stakeholders being the data scientists first and foremost, maybe not necessarily the customers at the end of the pipeline where they actually get their clothes, but really the data scientists – the corollary of that being that you're not necessarily on the critical path. I think this is an interesting thing: to be on a software engineering team but not necessarily associated with the core delivery of the product itself – kind of working adjacent to that – because your rhythm is very different. The timelines that you think along are very different. And it's actually very helpful to hear that this is the way that you manage that element of building the platform.** 17:36 Stefan Yeah. Once they are on our platform, obviously, we have to keep production SLAs. But at least from a starting perspective, it's quarterly early – find the right data science team to work to get them and validate what you're building, co-develop it with them and reduce the pain point. Because then I think it makes us look good, because we have a win on the table, “Hey, look. We helped this team.” That team feels good, because they got some pain points removed. Then it gives us more credibility to then expand and look at other teams to kind of, “Hey, can you adopt our stuff?” 18:19 Demetrios **I’ve got a quick one that is a little bit of a tangent, but I keep thinking about it. If you're enabling all these different data scientists to effectively go out and solve their own problems or solve business problems. Have you seen that… Of course, you can always deny answering this question – you can plead the fifth. But have you seen where that freedom and that autonomy has gone wrong? Whether or not that has been that they've consumed a ridiculous amount of resources for the problem that they're trying to solve? Or they just managed to introduce a bug in and it threw a jackknife into everything that it touched? Have you seen any kind of war stories or major fuck-ups?** 19:08 Stefan [laughs] Yeah. There's this quote of like – I forget how it starts – we essentially gave data scientists keys to AWS and “with great power comes great responsibility.” So that's the kind of attitude and cultural attitude we have. So I don't think it was through any malice or anything that anyone did anything on purpose, but where we made something really easy to do and some people took us up on that offer and really used it. Like backing up the Spark cluster, because we make it really, really easy to backfill something. So it’s nothing bad or egregious but just by making something really easy, some enterprising data scientist can go, “Oh, okay. Well, let me try this for loop and to spin off 1000 jobs at once.” So those are the kinds of things that come to mind. The other quote I like to say is, “process exists when culture breaks down.” In API design, someone who's without too much experience doesn't necessarily know the best practices for things so we've had to institute a bit more of a process here on that being cultural of how to design the API. “Hey, you should get the API checked and this other team needs to sign off.” That's something we've had to learn along the way. But I think that's reasonable. It's not fair that we assume that every team knows how to design APIs well, especially if they're not from a software engineering background. 20:52 Vishnu **Yeah, that makes total sense. It’s always useful to get those war stories. Demetrios loves those – absolutely loves them.** 21:02 Demetrios **Yeah, I’m a sucker for them. [chuckles]** 21:04 Stefan I mean, I'm gonna say that usually the applied physicists are the ones who usually figure out how to abuse systems, so… [chuckles] 21:11 Demetrios **[laughs] Always those physicists. [cross-talk]** 21:12 Vishnu **[laughs] [cross-talk] So Demetrios and I were researching Stitch Fix a little bit, and as we understand it, you have a data platform team, which – as the famous blog post puts it, is “aggressively helpful”. And then you have, as a sub-component of that, or like a sub-team within that, is your model lifecycle team. Is that correct?** 21:35 Stefan Yep. Yeah. So, the platform team has a bunch of teams, but the model lifecycle is one of them. Yeah. 21:41 Vishnu **Okay, what are some of the other sub teams? Just curious.** 21:45 Stefan Yeah. We have a team that's focused basically on compute infrastructure – looking at the Spark cluster, Presto, “How do you distributed compute?” So they take care of distributed compute, Kafka, and then how things are stored. Essentially data access, how do you get it to some compute and then provisioning that. Then we have another team focused more on workflows, you could say, and environment, “How do you build a Docker container? How do you build an image? How do you create a workflow and get it scheduled and executed?” Then we have a team that's focused on – you could say an internal UI/UX team. But the UX that they’re mainly focused on is data visualization. We have an awesome team that helps spin up internal dashboarding or provide internal React tooling to make it really easy to spin up some sort of front end to visualize some data, which is a common need data scientists have. And another team that's focused more on the backend API, sort of this interface layer between the algorithms organization and the wider business. They'd take care of that and build tooling and infrastructure for that. And then we have the centralized experimentation team. They handle centralized experimentation for the company. So they have the most traffic service and they have the front ends to turn on experiments and plot results. Yeah… does that cover all the teams? I think. Oh, and then we have our data engineering team. As we've grown bigger, we've actually had spun up more of a data engineering team, who've owned some core data model abstractions within the data warehouse. Then they also provide self service access through Looker, like BI tools. So we've also spun that team up as well. 23:58 Vishnu **Very, very helpful. So this is the big picture of the data platform, and model lifecycle sits within that. So, as you kind of mentioned, your mandate is really focused on this idea – and correct me if I'm wrong  – “create and deploy models easily”. Is that correct?** 24:14 Stefan I mean, yeah. I don't think I actually said it. But essentially, the mission of the team is to streamline model productionization for data scientists. So that’s anything we can do to streamline their process of getting a model developed into production. 24:27 Vishnu **That's awesome. So now, I would love to dive into some of the specific projects that your team has done around that model productionization work, because that sounds like a pretty interesting challenge to define across the lifecycle of very different models, or maybe models that serve different needs of the business. Is that fair to say?** 24:47 Stefan Yep, totally. I mean, at Stitch Fix, we have people doing time series forecasting, neural net stuff, matrix factorization, neural logistic, linear regression, constrained optimization and simulation. So they're all models, they all want to be tracked, they all have some sort of notion of metrics. So trying to build a platform that can enable all of them is kind of what we're trying to do. 25:14 Vishnu **Man. So if you had to give an example of a project that your team worked on that you're most proud of in terms of how it really helped that process of model productionization, what would you say?** 25:30 Stefan Good question. [chuckles] I mean, I like everything my team does, so it’s hard to pick favorites. [laughs] 25:36 Demetrios **[chuckles] There’s a diplomat. He is very diplomatic, isn't he?** 25:39 Stefan [chuckles] Hard to pick favorites. I mean, I want to say Hamilton is a nice framework for featurization, for time series forecasting, and then there's Model Envelope, which is our core abstraction for capturing what a model is and then being able to auto-generate code to get it to production. I would say those two are the two things that I'm most proud of thus far. If you want to talk models and fit in like zero code, then I think Model Envelope is where we should go. If you want to talk about DAGs and like a different way of creating data frames, then Hamilton is where we can go. 26:17 Demetrios **Well, independent of what Vishnu says right now – I'm going to cut them off real fast. I just want to know about Hamilton, because the last time that we spoke was at the Apply conference, where we met. And you mentioned that Hamilton may be open sourced. Any updates there?** 26:38 Stefan Yep, yep. I've engaged the internal process – just waiting for a few more boxes to be ticked, for instance, legal. But otherwise, we're looking to try to get that out as soon as I can – as fast as our own internal process allows us. But I’m very excited to try to get that open sourced and get other people to use it. In short, I guess Hamilton is a micro-framework for creating data frames. What essentially it does (the TLDR) is that it changes the way that you can create a data frame. For instance, for time series forecasting, you need to create a data frame with a bunch of columns – at Stitch Fix, it's 1000s of columns wide – managing the codebase to do that actually gets pretty cumbersome. And what Hamilton really does is it enables a team to really manage that process really well by focusing on building functions. Hamilton can then build the data frame that's required by that team for use in time series.  27:49 Demetrios **Yeah, I'm pretty stoked. I know there was a _huge_ amount of reception, or a very warm reception, for the idea of it to be open sourced. So keep us updated. Once it is open sourced, please let us know. And maybe come back on after a few months of it being open source and you can teach us how different of a beast it is to open source something or just have it within the company. It's not like diametrically opposed, but they are different to work with, we can say.** 28:25 Stefan Yep. This will be I guess my first real experience open sourcing. Something so definitely, yeah. We probably will have something to chat about. 28:35 Demetrios **I get the feeling Vishnu wanted to go down the other route, though.** 28:40 Vishnu **No, I mean I will say that Hamilton sounds like a very impressive piece of engineering. The reason I am interested in talking about Model Envelope is because I actually had to work on a model deployment system. And so I would love to just kind of talk a little bit about how you guys were thinking about the model deployment and just go through that process overall. Maybe we can start with something similar to how you summarized Hamilton. Could you summarize for us what Model Envelope really was and what it meant to the team?** 29:08 Stefan Yeah. I guess the Model Envelope, in a nutshell, is a container. Hence why we use the term “envelope”. So it's a model and a bunch of things that we can put into an envelope and then ship it and send it to production. That's kind of the metaphor we were going for. What it essentially entails is a data scientist uses an API much like MLflow to save the model – that turns up in a registry and then they just needed to kind of click a few buttons, provide some config, and they can get that model deployed to production via an API, or serve that to a batch workflow. But essentially, it enables us to kind of think of the models as a blackbox. So we know the inputs, the outputs, and what environment that's required to run it. 29:51 Vishnu **Got it. So essentially, that process of specifying all those inputs and outputs and everything, and thinking about how the model is used in the end use case is kind of abstracted away for the data scientist?** 30:06 Stefan Yep. Yeah. We started this project around the same time as MLflow was kicking out. There's now the project ModelDB, which is, I guess, the predecessor of MLflow. TensorFlow Extended has also come out, they have a nice, you could say ‘flowchart’ of the different steps that they have in the model deployment pipeline. Essentially, we were thinking of what problems we had internally, but then also thinking about looking at where industry was going. So we took a bunch of ideas from those projects and incorporated them into the Model Envelope. We wanted to make it super easy to save the model and then deploy it. But then we also wanted to kind of decouple deployment. We encourage that any and every model we create is saved in the Model Envelope, even if you don't know it's gonna go to production. That enables a bunch of other things that we can kind of do. But really, I think, from an ergonomics data science standpoint, we wanted to have them focus on the model and then when they want to take it to production, just because we know what the model is and what API inputs are required, that service that actually ends up getting deployed is really simple, because it’s only about the model – it’s not within some business logic, etc. So it's actually then pretty easy to put into some production workflow without knowing the context a priori, because that's not required. 31:34 Demetrios **Before Vishnu jumps into more technical questions and pillages your brain for all this information for his next project, I want to ask you about something. I know we've talked about this before, but we weren't recording when we talked about it. And I love what you say about this. It's like “deployment for free,” I think that’s how you put it. And that is just a great way of explaining what exactly you're trying to do – the vision there. But as you mentioned, you were creating this when MLflow was coming out. Also, TensorFlow Extended has come out. Has there been a moment since you created it that you've thought “We need to jump ship because MLflow (or TensorFlow Extended) is doing this and we liked the way that they're going. Maybe we should just outsource it and bring that on.”?** 32:34 Stefan Yeah… I mean, in all honesty, no. And the main reason is because it's much easier to integrate with our internal tooling than it would be with those options. Now, we do keep tabs in terms of features or things and thinking how they're doing. For instance, I actually think the TensorFlow data validation module is pretty awesome from the perspective of what it's trying to do and the schema that it then uses downstream to determine API validity, whether the right data is making it to the API, what the bounds of data should be expected, etc. A nice thing about open source code and other things is that you can have a look at the internal details and really kind of try to understand what they're trying to solve. MLflow has since, for instance, added the ability to – you pass in a data frame and it introspects that API. Before it didn't do any kind of API validation. So it's nice to actually have some of our internal ideas validated, so that's made us feel good – that we've made some reasonable design choices. One of the things that I think is key to us is that we actually wanted a really easy way to tag models. We didn't want to institute any hierarchy because we wanted to make it really easy for, not just an individual, but a whole team, to manage a suite of models – or for a manager to quickly understand what's happening. So we really needed a different approach to model management. And so because of that, I want to say like, no, we haven't gone “Oh, damn. We should have used them instead.” But for the most part, we've actually developed things in parallel and had our ideas validated, which has made us feel good. 34:32 Vishnu **Got it! Got it. Interesting. And so… [cross-talk]** 34:35 Stefan I mean TensorFlow Extended is a pretty good standard to reach for, but unfortunately, not everyone wants to use TensorFlow, so that's also one of the reasons why we didn't go down that route. 34:47 Vishnu **Exactly. It makes total sense. I mean, you have very different modeling use cases. So that kind of leads me into my question, which may be a little bit more of an exploration. So I'm a data scientist. I've written some modeling code, I pulled some data in some format. I came up with a model, it has this performance, and I'm like, “Okay, Stefan. You're telling me I should use this ML envelope. It seems like our company standard. How do I actually use this ML envelope on my model artifact?** 35:21 Stefan Yeah. We try to make it so that it is as simple and easy as possible. One is making the installable dependency being pretty nice and light so it's not trying to pull in different libraries and things – it's pretty agnostic to that. You can write it all in one line of code, but if you want nice looking code it’s gonna be a bunch of lines. So essentially, you just want to make it as ‘cut-and-pasteable’ as possible as well, so that when someone sees it in someone else's script, they can cut and paste it and generally, it should just work. You don’t have to tweak too much about it. Because as you know most people start projects by cutting and pasting the previous one, so we want to ensure that it was clear from the API description, or at least the parameters that are passed, people can understand what it's going for. But we tried to make it as simple as possible. With other frameworks, for instance, you then also have to have a deploy step in that same script – we explicitly remove that, because we just wanted to make it super simple. If someone just had a modal, just need to add these lines to the end of your script and it should just work. 36:36 Vishnu **Wow! What an elegant workflow. Okay,** 36:38 Stefan Yeah, I mean, the other thing is we try to also introspect as much about the environment that the model is running in as possible. As opposed to other API's, you need to pass in – these are the Python dependencies that are required. Instead, snapshot there and try to do some reasonably pseudo-intelligent things to determine like, “What is the actual set of dependencies required for this model?” Therefore, we also try to make it so that nothing has to change as the model evolves over time. You don't really have to change what's passed into the API because it should just work. 37:16 Vishnu **Okay. So let's say I'm working on a model, it takes in five features in my API spec. I take in five features – age, height, gender, whatever it might be – and in return, “Okay, you should have this clothing item.” But then, after some additional research six months later, I come back and say, “Stefan! It's a new feature. We've actually discovered that which color eyes you have are really important! How do I extend the API definition in Model Envelope for my model?”** 37:44 Stefan Yeah, so that's kind of where I was saying that, hopefully, the way that we structured it with what the inputs are required that it's always going to be correct. You don't have to do anything. Maybe it's only on the API deployment side that you might have to do something there. So, for instance, if the predict function is Python primitives, so there's no data frames – it's all ends, floats, and things – we take that as a signature for the inputs for your model. So if eye color becomes a new parameter in that function, we automatically figure that out and that's captured in the new model. Also, if it's in a data frame, we actually require you to pass in an example data. Ideally, that example data is actually what you use for training in its simplest case, in which case, that's always going to be up to date. So no code change required there. It's only on production integration, where, if no one's passing you the eye color feature, there'll be an issue. But then that should return, like an HTTP 400 result or something like that. So therefore, very much focused on, “How can we make this very low maintenance and always correct?” 38:59 Vishnu **Got it. So once I've packaged up my model into the Envelope using the five lines of code that basically describe – I'm assuming dependencies or, as you said, snapshot dependencies – I guess, I do some kind of validation of the API? Is that correct?** 39:17 Stefan Not strictly… it really depends on what validation you want to do, right? Validation really is potentially a little context dependent on where you want to deploy it. So an outline probably should be more part of the CI/CD step before deploying. But yeah. We enable you to log metrics and other things as well. So if you want to not just have five lines, add a few more lines to log metrics about the model. Then we also capture that and then store that. With validation, we haven't been opinionated in modeling that, but we have the constructs there to enable whichever opinionated way you want to do things and just have the work. 40:06 Vishnu **Fascinating. Okay. So now I have deployed my model. You've told me how I can actually extend the model as well, which is, “Hey, Vishnu. Don't worry about it.” Now, I'm curious – kind of zooming back into the platform engineer’s thought process. Okay, now all the data scientists are actually deploying using the standard format – how have you built on top of that to maybe enable monitoring at scale? You mentioned model management using the tagging. I'm curious where you've been able to build on top of this in other elements of your infrastructure?** 40:41 Stefan Yeah. So with tagging, the way that we've divorced saving the model and deployment is that we then have a system where people essentially use tags and some operations. Basically, they come up with a query. The things we can query over anything about the model, and then determine whether it's an eligible candidate to be deployed. So whenever a new model is written, we go, “Oh, look. Does it match any of the current rules that are required to deploy something?” If so, we then kick off a deployment. Once things are saved on Envelope, the only things that are really immutable are tags and metrics – everything else is kind of static. Each new model saved means a new deployment, a new model instance ID, but you can update tags on it. This enables us, for instance, to have – to your validation question – they can save the model with a bunch of tags saying, “Hey, this was trained.” They can then kick off the second thing that pulls that model and does some validation, pulls it, scores it, etc. And then I can add another tag saying, “Hey, this is good for deployment,” in which case, then, as long as I've set up rules to look for that particular tag on that particular model, then that will then qualify it for deployment. So that's one thing that we build on top of.  41:58 Stefan The tags function also enables us to build – it gives people a way to query for their models, using whatever tagging system they've come up with. So we haven't actually been very opinionated on tags, purely because we wanted just to be pretty organic and to figure out the right processes – human processes – of how teams are organized and business models. We've been pretty flexible there. But in terms of “building on top of” we store the API signature and then if we're given data, we also have expected data shapes and distributions. We are in the process of connecting that to the online side of actually capturing things online so that we can determine “Do we have training serving skew?” We had the hooks in place, we auto-generate the service, and we can log the things that we need. So those are things on our roadmap that we're actually getting to in the next quarter or two. What else have we done? We also enable batch deployment, so anyone can put someone else's model into their workflow, as long as they know the tag query required and it will always get the latest model when queried. Or if you know the instance ID, you can use that. We’ve also enabled it to make it easy for other teams to use other teams’ models in an offline context. 43:27 Vishnu **That is awesome. So much that you've enabled by just defining what a model is and how it will work in a number of different settings. And so, that observation right there – if you were talking to a more junior developer, let's say, or somebody that was maybe in an earlier stage system, where modeling is still kind of an emergent use case (maybe someone named Vishnu Rachakonda) what advice would you have, in terms of the process of… I'll share my challenge candidly. When I was building a model deployment system, it's like – you have your glue code and you have the model itself, right? And it's like both of those kind of evolve together. How do you actually have this neat abstraction where you can just say, “Okay, this is where whatever the model is required to kind of slot in – this is the envelope that encapsulates it. That is not mutable.” That was always hard for me to define. How did you guys actually go about saying, “This is what a model constitutes.” I feel like that's a pretty hard challenge. How did you guys do that?** 44:34 Stefan Essentially, if you can serialize some bytes, then essentially, it's a model. I mean, you could write a plain Python function and serialize it and it will just work. We were, on purpose, trying to be very agnostic. The only requirement was like, “Can you serialize it in Python?” early on. Now we do have hooks – different frameworks have specialized serialization format, so we do have hooks to enable that. But essentially, getting started was like, “If you can use Pickle or Cloud Pickle to serialize it, then our framework can serialize it.” Because it's Python, you do need to give us a pointer to the function so we can actually ensure that this is something that we can call and execute. But that was essentially all that we were required, like, “Hey, it's some function that we can execute.” That’s essentially how we were thinking about it. And it just so happens that for Cloud Pickle, a lot of model frameworks and other things can be serialized pretty easily that way. There's a few that can't, TensorFlow being one of them. So that's kind of the point – we tried to be as agnostic as possible. Now, the other side is that Python dependencies are generally where people get caught up, so you have to be super strict on that, or at least be able to capture those dependencies properly. And on the production side, we're just gonna make sure we have tooling that doesn't really conflict with the Python versions, which isn't the hardest thing to do, as long as you know what versions you have of things and where things conflict, then at least if a conflict does arise, it's easy to track down so you can fix it. 46:18 Vishnu **Got it! Okay. So is it fair to say then (the final question before I kick it back to Demetrios) that Envelope is a little bit more constrained and more well-specified than a container?** 46:33 Stefan Basically, we make a base assumption that everything runs on the same kind of Linux base – Linux version. We track what Python version it was run on and ensure the container has the right Python version as well. So I want to say it builds on top of Docker, so eventually we created a Docker container. But essentially, this enables us to use that model in any context. We could use a Python environment locally if we wanted, with the same information. So I want to say it goes hand in hand with it, but it doesn't… We don't necessarily capture the C or or underlying dependencies. We make it a bit of an assumption that there's a common base layer. But otherwise, it's pseudo like that, but not entirely. 47:30 Vishnu **Right. Okay. Makes sense.** 47:33 Demetrios **So – zooming out a little bit and coming back to the whole vision and goal for this that you talked about in the beginning, in that you're trying to speed up the time to development and the time to production. Do you have numbers around how this has helped and what the benefits have been?** 47:56 Stefan Yeah, that's always a challenging question for platform teams. Partly in that you have data scientists, and at least in our case, we have data scientists who are building very different models and also have very different tenures. So if you were to measure time to do something, you'd have to ensure that you're comparing similar data scientists to each other and gathering that data in and of itself is challenging, I want to say. The other side that I think about is actually “Who would be very disappointed if they couldn't use Model Envelope?” And everyone who uses us really enjoys and likes us and actually gives a shout out because we made some part of the process much easier to do. So what we instead look for is, you can say, “net promoter score” type thing like, “Would you tell another colleague to use us?” But we do track the amount of models created, that's obviously a vanity metric because if someone's doing back testing or something, they can create 1000 models in a day. We can't track that as an absolute number, so we’ve actually been tracking more “team penetration” and “how many teams are using us” “how many services are deployed?” and that's a nice linear growth of over the course of quarters, because that's just how people develop and that's the pace of new things cropping up. So as long as our metrics are generally trending upwards and towards the right and people are happy, that's kind of our success metrics thus far. But I'd love to be able to quantify that more easily but it's just a challenging problem. 49:44 Vishnu **I think that makes perfect sense. I think, at the end of the day, as you said, there are all these compounding factors and people using the tool is the greatest endorsement that they can give. It certainly sounds like that's the case. Thank you so much for taking us through the Model Envelope use case. It feels like we've learned a lot about how platform teams can actually enable better data-oriented and machine learning-oriented workflows through the right kinds of abstractions. And that's always something that we're talking about in the community. And with that, thank you so much, Stefan, for joining us and sharing your knowledge.** 50:21 Stefan Thanks for having me. 50:24 Demetrios **This was awesome, man. Yeah, and I'm gonna have to recommend – or I will nominate you as the guest with the best background that we've had. So if you're just listening and you do not see Stefan's background, you might want to just pop over to YouTube and check out what he's got as his background, cause it is _insane_.** 50:46 Vishnu **And so if anybody wants to maybe read your blog posts, or look at your content on Hamilton or Model Envelope or anything else – those slides are available publicly. Is that correct?** 50:58 Stefan Yeah. So I have a SlideShare account. So if you look me up on SlideShare, you should be able to see all the presentations I've given. Otherwise, I am in the MLOps Community Slack. So, also happy to take questions there. 51:11 Demetrios **Yeah. We'll link to your SlideShare in the description. And if you all, who are listening, enjoyed this. Give us a thumbs up, like it, all that good stuff, whatever the cool kids do these days to show some love to this.** 51:25 Vishnu **It’s still a thumbs up, Demetrios. Still is a thumbs up. [chuckles]** 51:29 Demetrios **It still is? Alright. I didn't know if we needed to TikTok or something. Or do a special podcast to show the appreciation. Anyway, we're getting a little ahead of ourselves. I think it's time to wrap. Thanks again, Stefan. ** 51:43 Stefan Thank you so much. 51:44 Vishnu **Thanks again, guys.**

In this episode

Stefan Krawczyk

Stefan Krawczyk

Manager, Data Platform, Stitch Fix

Stefan loves the stimulus of working at the intersection of design, engineering, and data. He grew up in New Zealand, speaks Polish, and spent formative years at Stanford, LinkedIn, Nextdoor & Idibon. Outside of work in pre-covid times Stefan liked to 🏊, 🌮, 🍺, and ✈.

Twitter

LinkedIn

Demetrios Brinkmann

Demetrios Brinkmann

Host

Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.

Vishnu Rachakonda

Vishnu Rachakonda

Host

Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.