Coffee Sessions #50

Creating MLOps Standards

With the explosion in tools and opinionated frameworks for machine learning, it's very hard to define standards and best practices for MLOps and ML platforms. Based on their building AWS SageMaker and Intuit's ML Platform respectively, Alex Chung and Srivathsan Canchi talk with Demetrios and Vishnu about their experience navigating "tooling sprawl". They discuss their efforts to solve this problem organizationally with Social Good Technologies and technically with mlctl, the control plane for MLOps.

Transcript

0:00 Demetrios **What's going on, everyone? We are back for another edition of the MLOps Community Coffee Sessions. This is a unique Coffee Session because we have two guests. We are lucky enough to be talking to Alex Chung and Srivathsan Canchi – he'll correct me if I pronounced his name wrong, because everyone knows that I am a master at pronouncing names incorrectly. I haven't quite figured out how to pronounce them correctly. But today, what we're going to be talking about is something that is dear to my heart. It is the standardization within the machine learning space – within the MLOps space. And I'm excited to dive into this. I've talked to Alex quite a few times and we have the social good on the YouTube channel. Actually, we can probably start there in just a minute with that. But we have talked a lot about how we can standardize things within this space, because there are so many tools out there, and there are so many opinions being made. And it's not really clear where we're gonna go from here and how we're going to make it easy for everyone – if that is even possible. So that's what we'll be explaining, or exploring, today. First up, maybe we should just go around and hear about who you guys are? Alex, who are you, man? Sri? Who are you? Alex, I'll let you go first.** 1:35 Alex Thanks, Demetrios. I've always really enjoyed our conversations just because the MLOps community’s really become, for lack of better words, the community and the hub and spoke for a lot of the conversations that are happening of new ML engineers into the field, but also for those that are looking for a dialogue in a forum. And in this pandemic, it's just, frankly, quite hard to have those organic Coffee Chats at NeurIPs, or these other conferences where normally we'd have very thought provoking sessions, in terms of trying to understand what the direction of the production automation processes is. So, again, my name is Alex Chung. I used to be a PM on the SageMaker team at AWS. Before that, I ran the computer vision DataOps at Facebook.  When I was at AWS, my whole role was to figure out of the SageMaker tooling – what can we sell to larger enterprise customers? If we look at these end-to-end ML platforms like SageMaker, that basically do everything. The reality is – not every customer needs everything. The larger organizations that have a strong cloud practice and IT teams are probably thinking about, how do they blend what they have existing with SageMaker with some of these other tooling vendors? And it was my role to figure out with SageMaker, “ “How do we pull out these features and glue it into an enterprise's internal ML stack?” And that's actually how I met Srivathsan. He and I became very good friends and he was one of my customers. 3:07 Srivathsan Hey! My name is Srivathsan. Until recently, I used to run the engineering for Intuit’s machine learning platform. That's where I've been a customer of Alex’s products and SageMaker. Intuit uses a lot of SageMaker capabilities and a lot of home-grown capabilities for building out its ML lifecycle and my team's job was to make sure that ML at Intuit was going on at full speed and we do not have any roadblocks in the way of data scientists dreaming up wonderful models or ideas and putting them out there in production and testing them out. The platform is used to enable all of the ML lifecycle steps and make them really, really smooth so we can actually take an idea from inception to production in less than a week. 4:22 Demetrios **That is everyone's dream, I think. And we've heard that stat being quoted before. I know it makes a few people in the community jealous, to say the least – if not pissed off. But Alex, what is SGT? I am sure that some of the people that are listening have heard or seen some of the stuff that we've been putting on the YouTube channel about SGT and it would probably be good to let people in on what you're doing. What's that effort around it?** 4:58 Alex So the backstory is: when I was at AWS, I realized very quickly that the biggest challenge of selling SageMaker into Enterprises was that there was no common integration point. Each customer of mine had basically built in their own internal libraries and then those libraries connected with services like AWS, or Kubeflow pipelines from open source. And it was basically one big mess. It was clear to me that we needed to build an industry forum for MLOps specifically – so nothing to do with the models themselves – but really on this productionization process. “How can we either converge some of the schemas or incubate certain efforts so that these different tools for MLOps can come together?” So within SGT, it's a nonprofit organization organized here in the United States – we have a 501(c)(3) status. We really break down our work into three categories.  First, it's education: we partner/collaborate with various influencers in the field and we really make sure that there's good content and we share it. We also give a lot of advice and feedback. Second, we focused on having discussions of standardization efforts. Basically, when I say standardization, it's a common link between different parts of the MLOps lifecycle: from model serving to your feature serving. Or from your training systems to taking the artifacts to your model serving systems. Then the last component – and this is where I think we've started investing even more time around is – how do we incubate open source projects that can really tie together MLOps? And we think of this as the next generation of orchestration. Historically, orchestration has been around workflow management, but now we see it as the jobs themselves and how they can get composed together from different tools but all in a coherent experience for the data scientists and ML engineer. 6:58 Vishnu **That's awesome – really admirable effort. Hearing you talk about what SGT is doing, what some of your career experiences have been both you, both you, Alex and Sri, the first question that strikes me, that I think a lot of our listeners are asking is, “Why is standardization desirable?” You know, I'd love to understand from the standpoint of a Platform PM in Alex and from a Head of Engineering in Sri – why is it that we look for standardization across different tools? Maybe we can kick it to Sri to start.** 7:29 Srivathsan Alright. I would like to take from my previous experience and how the cloud journey has been. Sometimes history repeats itself and that's very true in the ML space. A lot of the problems that we're seeing in the ML space are not new. If you take the cloud world, for example, in the early 2010 timeframe, there were many players in the cloud space in the first half of the decade. You had AWS, you had Azure coming up, Google is a big player as well. And then you had a lot of the smaller players, and you had the private clouds with the OpenStack console. It’s just very difficult to actually build applications and make them portable across the clouds, be able to build a technical quorum, or a group, who can share experiences and share knowledge. So you saw these pockets of expertise – there'll be experts in AWS, there’ll be experts in OpenStack, and so on, and so forth.  Then came along Kubernetes, who said, “Hey! We will create this control plane which can operate across any cloud and you will have a consistent way of interacting with the controller.” Fast forward to today, you run and build an application on your local laptop and run it in production on any cloud, anywhere – with the same set of commands, the same set of interfaces, and you don't really need to learn something brand new just because you're going from a dev environment to a test environment to a production environment. We are kind of living through the dramatic effects of what that standardization has actually unleashed. We can see that Kubernetes is pretty much everywhere right now in terms of application talent. You see the same evolution happening in the ML space. We see a lot of fragmentation in terms of how tooling is being done and there is an opportunity to unleash even more machine learning creativity, which is even better for the entire world, by democratizing the power of machine learning through the standardization efforts. 10:07 Alex I would add to that and say, what's been even more challenging about MLOps in particular is – it's not just one thing that's MLOps. I think DevOps is, you test and then you deploy onto your production instances for an application. But for MLOps, we've kind of seen a whole lifecycle – I like to call it “the big eight stages”. If you look at Azure marketing, they call it “nine steps,” in AWS, they call it “four”. But it's essentially, you have data collection, you have data processing, you have feature engineering, you have data labeling, you have the model design, and you have the training optimization, and then the deployment and monitoring steps. It's a lot of different _fragmented_ pieces. Arguably speaking, for the data scientists, you're really manipulating the data and then you're explaining how the model should be trained and then from there, you almost want to hand it off, because in a production environment, you're scaling it out and you have all these various interconnects in terms of telemetry – so metadata, experimentation, logging artifacts.  All of these, in most cases, are independent tools. You very rarely see tools that can bridge across all these different steps. If you even look at things like Kubeflow, the reality is Kubeflow pipelines is the most popular product and that's just a workflow orchestrator. The experimentation system within Kubeflow isn't even that widely used. If we even go through the MLOps community Slacks or the Kubeflow, you actually see a lot of people talking about integrating MLflow with Kubeflow pipelines, which almost defeats the purpose of what Google intended when they built out these tools. And it's completely representative, like this idea of mix and match, in a larger enterprise for various reasons. The first reason is, if we look at any – let's call it a bank, or retailing company, or services company that's aiming to be more tech-centric. When we look at their ML teams, it's not one team, i.e. Lyft is a great example. I think they came and spoke a few months back.  Lyft has the ride sharing group and for a long time, they had their own autonomous team. And even within the ride sharing group, there's like three different data science orgs – many of them have different patterns and tooling approaches. And for a platform owner, their focus is, “How do we enable flexibility to do and pick what are the best tools for a use case, while still having that centralized, common-enough experience, and also, more importantly, its governance?” It's also thinking about the ability of tracking all your jobs from spend management and resource allocation that ends up being a lot more of a corporate and decision making process and not so much the traditional “in the weeds of a data science and model”. Standards bring all these fragmented pieces together. 13:01 Vishnu **That makes a ton of sense. And I think it's very interesting to hear the perspective from both sides – with sort of the engineering side really being that, there's all this opportunity to simplify repetitive tasks and enable more time to be spent on actual use cases. Really, from a platform perspective, standardization can allow different organizations to more centrally enable machine learning in different ways, without the diffuseness that happens in a lot of different organizations. It looks like Demetrios has a question, so I want to kick it to him.** 13:38 Demetrios **Yeah, I love, how you're saying Sri, the fact that you don't have to know different things for each environment that you're working in when you have that standardization. I know we've seen this a bunch in the community – people talking about how they _do_ need to translate a model, or take it out of a Jupyter Notebook, or do something where they have extra work added on top just because of the process that they've created. I can't remember who (I think it was Hendrik from Tide) I interviewed months ago and he was talking about how he had to change something from R into C, or C Sharp or something. I'm getting this horribly wrong, I'll have to go back and watch it. But they ended up doing some compiler or some translator, like F Sharp that it was, and he just talked about the headaches that that caused. It's really like what you all are trying to get at – to get rid of that, so you can spend less time worrying about that kind of stuff. And then really spend the time doing what you were talking about and enabling the creativity to really blossom and you can see more and more come out of it. So maybe, Sri, can we touch on what's going on at Intuit? How is it… like, how big is the team? How many data scientists are there? I know it's a gigantic team. They're blazing all kinds of new pathways. I see their blog and I love to read what you all are up to. So let's just get the profile breakdown of what it looks like.** 15:38 Srivathsan Sure thing. At Intuit, the data science team is roughly around 250-300 strong with data scientists and around 100-150 ML engineers who support the data scientists in building the models. There are over 400 models in production and most of them were built in the last two, two and a half years. When I joined the team, it was 30… I think 28 models in production – roughly 30. And it just kind of exploded to 400+. We also built a feature store. The feature store, I think, was similar to that of the feature store team that was on your podcast a few weeks back. There are over 10,000 features now, I lost track of how many features there are – and most of these are getting shared across the company from a feature store perspective. And we built a platform that orchestrates between data exploration, model training and then feature engineering, followed by model training and evaluation of the model, followed by productionizing the model, and planning it in prod either in an online context, where it’s directly by Intuit’s applications, or in an offline context, like batch production space. 17:17 Demetrios **That's incredible to see. The one thing that I think about when we're talking about these standards, and we really wanted to focus on the standards within the enterprise – this is something, Alex, I think you were mentioning before we hit ‘record’ – is the importance of it and like… Why is it more interesting…? Or do I have this wrong? Is it more interesting for the enterprise to have this as opposed to just those 1-2 men lone wolf teams, (or woman, 1-2 women lone wolf teams)? Like, why is it an enterprise thing?** 17:59 Alex The reason why I think – in larger organizations, they really feel this pain. It hints back at the idea that I shared earlier, that one company is actually composed of many smaller data science teams. And when these different data science teams have different tooling approaches, it's what many people have called a “tooling sprawl”. I think it’s probably more common in the Kubernetes and classic application stack. But you essentially have five different ways of doing, one telemetry and logging for applications. In MLOps, I think we kind of hit that same spot. There's now many different models servers out there and there's different ways of encapsulating your metadata, whether it’s Weights & Biases or MLflow or, you can pick and choose one of the more common Cloud ML platform solutions. Within the context of a large organization, internal standards aim to provide a common interface so you can abstract that boilerplate.  I can tell you with very high confidence that the way that SageMaker constructs its training environment, and the way that you would have it in Kubernetes, is almost night and day. SageMaker is going to hotload all your data for you – it’s gonna have very opinionated ways of passing in your parameters for your model training. Versus in Kubernetes, it's almost wild, wild west. It lets you, as the application administrator, figure out how the container gets run. But going back to the data scientist who is responsible for the model code, I don't think they really care. At most, they would like to just learn one, it would be even better if they'd have to learn at all. But this idea that, as organizations improve their architecture and production systems, i.e. moving from Airflow, because that's what mostly use for DataOps – and as they get your ModelOps practice, you're adding in Kubeflow pipelines.  For the data scientist who has created a model that is now running in production, they're going to need to go ask the ML engineer to rewire many parts of that application to now work for Kubeflow. Or if they're moving to SageMaker, or even have some SageMaker already in other parts of the organization have Kubernetes, this whole “tooling sprawl” means that you have to learn a lot of the individual paginations. I think these new ML orchestrator libraries, these internal standards have emerged within larger organizations that have hundreds of data scientists, like in Intuit’s case, that Srivathsan leads. These internal standards really provide that abstraction layer that reduces the barriers to entry for a model in prod. And if we can continue to innovate in that, or even get to a spot where – in open source, we're developing these for many use cases across a number of large customers that are trying to scale out their models in production. I think there's gonna be a lot of value and I think that's where SGT is playing a role right now. 21:03 Demetrios **So if I'm hearing you correctly, what you're saying is that we have the tooling sprawl in Kubernetes world, and there's many different ways to do things, _but_ they all have that underlying compatibility, I guess you could say. But then when you move over to the MLOps tooling sprawl, everything is doing it in their own way. So because there are so many opinions, and so many different routes that you can take, that's where you're losing time?** 21:40 Alex It's where the data scientists have a significant extra amount of work that's irrelevant to the actual model code itself. So once I have that model – either a decision tree model or a deep learning model, although don't think deep learning is as widely used in enterprises just yet – the extra layers of binding that model, in terms of figuring out where the data inputs are, making sure that logs correctly to your metadata service, being sure is picking up the hyperparameters from your optimizer, and then from there, storing that artifact. That's all what I would consider “boilerplate”. The data scientist hopefully just learns it once, but in reality, as organizations improve their infrastructure and tooling, they're relearning this over and over again. But if we look at, in the Kubernetes world, there's tools like Helm charts, and there's other very canonical provisioning mechanisms that the SREs and the platform teams then provide to their application engineers. In our landscape, that I've seen at enterprises, they're already working on these different types of systems. Our effort is, “How can we bring this out and make this commonly used?” I think startups and smaller teams are likely to pick up on this once it's been flushed out. But I really think that the biggest users are teams like Srivathsan’s, where they are already leveraging a platform like SageMaker, and also have X,Y, and Z that he can talk about as well. 23:11 Srivathsan One thing to add in addition to what Alex said – when I joined the ML team, it used to take about… let's say, it took data scientists a couple of weeks to build the model and the actual model code. It used to take them anywhere between four to six weeks to actually see it in production for all the reasons that Alex mentioned. Even though the platform was running on SageMaker, there were a lot of these things that the data scientist had to learn and tie things together. And that's where large enterprises always face challenges in bringing their ML ideas to production. 23:51 Vishnu **Yeah, that makes perfect sense. I think the way that you guys are describing the narrative here resonates a lot with what previous conversations of ours at different organizations, and with different technical leaders, have stated. I think I'm curious now to maybe get into a little bit more of the technical side, which is something our listeners are always asking for and wondering about. So from an organizational standpoint, understanding these toolings, this question of tooling sprawl, this notion of wanting to move towards time to production – how do you see, technically, the standards evolving? How should engineers maybe make use of the work that you guys are doing in terms of standardizing the different interfaces between MLOps tools?** 24:41 Alex That's a loaded question and there's a lot of different ways that we can take this first. I think our core audience today – and this will expand over time – is the teams that are building ML systems and servicing, let's call it, greater than 50 data scientists. At that point, you're not usually just using one tool, and you have a variety of different systems and mechanisms. So between the different organizations, primarily enterprise customers, and also a number of startups, and just interested MLOps enthusiasts in the field – we're working on both conversations from “how we can converge different implementations?” in terms of “How do you start a model server?” Or “How do you bind your model artifact to all the different other layers of your MLOps tooling kit (the metadata service, the artifacts store, etc.) in a common experience?” But we're also actually trying to build out some of these libraries. So Srivathsan and I have been collaborating on a project that's called mlctl – in a second, I can let him talk about that further.  But the ML platform teams that are actually facing this reach out to us and we'd love to have a conversation with you to bring out what your needs are and where we see the common overlaps. In almost every company, if you're greater than, let's call it 25 data scientists, we've seen very consistent problems that are not related to the underlying tooling stack – while I'm sure those still exists – but it's “How do you compose all these different tools together from a common experience?” And across the board, we haven't really seen much innovation in _this_ part of the MLOps training yet. So that's the most tangible piece that ML platform engineers and individuals working in larger enterprises can get from this conversation. Come participate with us, we have a number of different discussions and a very tangible way of how we're trying to solve this with one of these open source libraries that we're working on incubating and pulling out. Srivathsan, do you want to just chat about mlctl? I know it's still fairly early, but we do have a lot of interesting ideas and demos out there, coming out as well. 27:02 Srivathsan Absolutely. Mlctl is our take on how to approach this tooling sprawl and provide a common interface to our data scientists. Mlctl actually started off as an internal project and is widely used at Intuit by our data scientists to go between the various stages, like model training happens on SageMaker, model batch hosting happens on the Kubernetes model, online hosting happens on SageMaker, where they have a common interface to do all of these lifecycle stages. So what we're working on in the open source is a stripped-down version of the mlctl library, but with adding a lot more capabilities for interacting with platforms like Azure, Kubernetes, and SageMaker, and providing the capabilities so these can interact. There's a lot of interesting things that we are adding to the mlctl ecosystem, including things like, “Hey, can I get started from scratch on an ML model? And can you provide me with an opinionated way to do that?” So we build capability for data scientists, or anyone, to just come in and say, “I want to get started on ML.” Mlctl provides a way to do that. Alex mentioned the word boilerplate – we actually allow for fairly simple boilerplate that can then be imported into a notebook and then starting to innovate on top of that. And we'll also be adding support for the workflow orchestration where there are various pieces and we are brainstorming different ideas of how we would do that. There's also the notion of “How do you publish experimentation metrics?” And as your model training is happening, “How do you publish those to the MLflow [inaudible]?” Also we are also building as part of the ecosystem, libraries that allow for consistent ways of transferring this information, persisting the metadata, persisting the artifacts, into their respective systems, and mediating some of that work. Go ahead, Alex. 29:53 Alex Let me summarize this and explain what got me so excited about this mlctl project that Intuit has been working on. As a previous person and vendor – so I'm not AWS anymore, in case that wasn't clear earlier – but when I was trying to figure out how we take the AWS APIs and make it so that an enterprise's data scientists can access it. I always have to go through this middleman and that is the ML platform team. They act almost like the “gate” – the protection for the data scientists from all the vendors that are trying to annoy them and get them to use their products. And that's okay, you know? I've come to really appreciate how important they are in terms of centralizing the common efforts. What mlctl provides for those data scientists is a common interface, a one SDK, that can basically route the individual jobs. And when I say “jobs,” I mean a processing job, a model training job, a model deployment job, a batch inference job. And then, the ML engineers can compose it in a KFP, Kubeflow pipeline’, or an Airflow pipeline’s DAG, that actually runs all those jobs together.  It's this idea that the jobs themselves should be portable, because in a multi cloud setting, that's actually very valid. 76% of compute is still on premise and this was out of a Morgan Stanley survey that came out, I believe in Q2 of this year. And within MLOps, there's just so much to learn. The boilerplate really abstracts a lot of that complexity. And for any startup or any open source tool that's trying to get into an enterprise, they should think about, “What are the different gates that can obstruct it?” Because otherwise, if an open source tool has to go to market with five different data science teams just to really see that they have all of the data science in a large enterprise, that's just a significant amount of effort. It's these common SDKs that has to emerge to simplify that handoff process. 31:54 Vishnu **That’s a really impressive concept. This idea almost of a – I don't know if I’d call it a “master library,” but it sounds like that – that sits between where data scientists are trying to do machine learning and where platform teams are almost trying to control machine learning or getting get a grasp on all the different demands that can come from a decentralized (often) function like data science, where each of scientist themselves can be a consumer. Is it correct, Sri, that this project came out of Intuit as an open source project?** 32:32 Srivathsan Yes. Intuit Open Source, mlctl, and a few other tools in that ecosystem – we call it a tool called Baklava. In the open community, we built another tool called Sriracha. Alex was one of the primary drivers of that. So we’ve already started to see how the ecosystem can be built and open, and how we can make more value. 33:02 Demetrios **I’ve just got to give props to whoever named the tool Baklava after I spent the weekend in Greece, eating a lot of Baklava. I thank you for whoever that was that named it and I'll kick it back over to Vishnu.** 33:17 Vishnu So [laughs] that’s making me hungry – Baklava, Sriracha, mlctl. Okay, and all of these… 33:24 Alex We’ve got a food thing going on. Omakase – that's coming down the line. A preset menu. 33:30 Vishnu **[laughs] You need one called Pizza to confuse people. Okay, so all of these projects were open sourced out of Intuit. And I'd love to understand, particularly with mlctl, where – I think Sri, you alluded to this with certain demos and the ability to import a model and work with it – where is mlctl? Where are these projects today? And where do you see them going tomorrow? If community members in the MLOps community or elsewhere want to get involved, how can they do that? Where would be the best places to get started with each of these projects?** 34:05 Srivathsan Absolutely. So, mlctl is on GitHub and I can share the link with you. We have a Slack channel – that way the community members can come in and start participating. We have a roadmap of enabling quite a few use cases around Azure and KFP pipelines and so on. And we _need_ community participation. We really want to have more people from the community coming in and helping us build that pipeline. It's a very small team of people who are very passionate about this. So we're kind of building it out, but I would really love to have a community around this. So, talk to Alex or me and we can get you started on it. 34:58 Alex I would say if you can't find our emails just add us on LinkedIn, we’re very open to having a lot of these MLOp- related conversations. But let me just take it from my own lens – from a lot of our late night conversations around the world of MLOps. There has to be the standardization framework at some point, whether we call it a “standard” or an “opinionated framework,” I'm open to suggestions as the dialogues unfold. But the piece around gluing different systems together that are fragmented, it just makes too much sense. It just _has_ to happen – at least an effort around that – for a good period of time. Because otherwise, there's just too many tools to manage for any one individual. I think if we were to say six months from now, “Where should this be?” I think it would be premier support for two commonly used ML platforms. So I think right now we're triangulating on SageMaker, Azure ML, Kubeflow pipelines, and then also Databricks. I think those are probably the four that we hear the most across the board, time and time again, “We'd like support for the X, Y, Z.” Surprisingly, GCP hasn't really popped up as high as we thought it might. But between those four, I think that's really where we see a sweet spot right now. For the end user, in that six month time period, you should be able to define your job and you can swap out Databricks with SageMaker, or Kubeflow pipelines for orchestration with another tool with minimal effort – i.e., your code doesn't change, it should just be config settings that change. That would be the target state. And I think in a year's time, we'd really like to see enterprises that have resource scale, once again, that 25-50+ data scientists that have multiple tools within their MLOps ecosystem internally – they can use a standardization library. I know having talked to probably 25+ large enterprises, that they all have some flavor of this. Some of it is minimal capabilities, while others are actually significant, to the point of a mlctl-like feature system. I don't think there's that much value for these teams to build it all themselves because the core tenants are all the same in terms of how we're thinking about this: 1)portability of the underlying infrastructure, 2) great developer experience, really providing abstraction for the data scientist developer, and then 3), I think it's finally this idea of flexibility in terms of how you compose these tools together. The best way to get involved, if that vision resonates with any members of this MLOps community, is to reach out to us. We'd love to talk more about where there's overlap and really see if we can create some common tasks in terms of knocking out use cases that make sense for any of the community members – in their settings – and also in the larger enterprises that we're actively taking this to market with. 37:56 Vishnu **Very, very, very cool, guys. Thank you so much for sharing that about mlctl. I think one thing that really stands out to me – that you guys have reiterated throughout this conversation – is that how we work in ML just has to be…  it has to kind of coalesce, right? That process element, that boilerplate element – it's got to look more similar. It _will_ look more similar. Going back to what Sri said about how cloud infrastructure ultimately started to have a similar workflow. And I think the idea that mlctl can be one of those tools that helps with that level of coalescing is really great. I think if there's one more question that I have to kick it to Sri, it's “Can you tell us an example of how mlctl helped your workflow at Intuit or how it's in production at Intuit?” So that other companies or other people that are out there who are interested in adopting it, can understand the value to your organization in the past.** 39:01 Srivathsan I’ll take this conversation back to the top, where Alex was alluding to these different groups at Lyft. Similarly, I tend to have different business units and one of them happens to use Airflow, because they love to orchestrate their workflows using Airflow, while the other group loves to use Kubeflow. So now you have these two different technologies that are doing workflow orchestration. However, as a company, we want to make sure that our models are running with a certain level of compliance, and security, and logging, monitoring, and all the good stuff that goes with operationalizing any code. And that's where the platform team comes in and says “Hey! You use the orchestration tool of your choice, but we would like to make sure that these certain basic tenets are followed.” So that's where mlctl comes in. Mlctl is deeply integrated into both the Airflow and the Kubeflow pipelines and it provides a consistent interface no matter where you're coming from. In fact, we have a third group, which is using it from Jenkins –  just orchestrating stuff from Jenkins – and it all goes through the same rigor that you would go through if you went through a platform provider to approach your tools. I mean, the ML platform itself provides a nice UI and so on to do all of the MLOps stuff and makes it easy. But it also provides a choice by providing this library, which allows you to achieve the same level of productionalization, rigor, and maturity with the tool chain of your choice. 41:01 Demetrios **I'm wondering, Alex, about when you look at the evolution of this, and you laid out your goals, and I'm going back to the Kubernetes world, and I know that they have CIGS in the Kubernetes world. And those CIGS are mainly headed up by different companies who have a stake – more tooling companies, I would say than anything. Do you feel like it's going to adapt into that? Where it's going to be the different MLOps tooling companies who will be leading different CIGS to try and help evolve the different spaces?** 41:38 Alex I can see that definitely being the case. I'd say, what we've seen is a little different in MLOps is – from the perspective of open source, because this is where _all_ this falls under. I think that's an implicit assumption here. There's multiple forms of open source. There's what I would consider – open source is highly tied to a platform – so if you think of the AWS CLI or the Boto3 API – technically speaking, it's open source, but it's directly tied with AWS. There's another version, which is called “exhaust open source,” which is, for example, Metaflow or Feast – it came out of an enterprise that had a specific problem that could not be addressed from the existing vendors and tooling out there. They made it for themselves and they pushed it out into open source to give to other users. There's minimal to some degree of support. Airflow is another great example of this at Airbnb. And then there's what I consider “commercial open source” that's like the Prefect of the world, or Seldon Core from the Selden model-serving startup. And I think you'll see a confluence of both vendors and startups that are advocating for a vertical to grow, and as a natural extension, their product within that vertical will definitely play a role. We have different participants like Seldon actually come to our working group meetings and video with some of their open source projects. I think the other side of this equation is projects like mlctl – there's no vendor or startup behind mlctl. Maybe the answer is “yet.” But the framing here is, for an enterprise that has to leverage multiple tools, they need to have a solution. And they don't necessarily want to maintain all of it themselves, or add all the features themselves because it's a significant amount of effort and ML platform teams tend to be very small relative to the number of data scientists within an organization. I think it'd be a mix of both. And I think SGT aims to really be a place where we focus on MLOps as a nonprofit forum where we can incubate these ideas and projects, and really host an avenue to dive deep into this space. 43:58 Demetrios **That's so cool. Yeah, I've drank the Kool Aid. [chuckles] I'm fully on board with this vision. I love it. Guys, you know – I mean, we've talked about it a few times and we have the different working groups that are going on right now. We're throwing all of the recordings onto the YouTube channel. So if anyone wants to catch up on what they may have missed, you can go and check those out. And if you want to jump into a working group, they happen once a month on Tuesdays (?).** 44:31 Alex That's right, it's once a month on Tuesdays. It's really an area where different people present ideas and also, now that mlctl has really aimed to be a connection point, I think we'll be seeing a number of different presentations coming into it as well in some of these working groups for the foreseeable future. But really, I think if there's others in the committee that have ideas around this – that want to present and discuss this with the group – we'd love to have conversation and really see if there's an avenue for taking new ideas that aggregate the different pieces of the tooling stack together. That coalescing concept. 45:10 Demetrios **Yes. So cool. Well, thank you both for coming on here. Thank you, Vishnu as always, for being the incredible copilot. Or I think pilot, maybe, and I'll be the copilot. [chuckles]** 45:23 Vishnu **I don’t think so [chuckles] Okay. All right. ** 45:25 Demetrios **[laughs] So, this has been awesome. And this is one that I've been wanting to talk to you about for a while now, Alex, ever since you sold me on the vision months ago. It's finally coming to fruition and I'm very happy about it. So we got it done. We did the podcast, and you had mlctl to show for it. So it's good that we waited a little while and were able to talk about that. A great work, Sri, and all of the cool stuff that you all are doing. I am _super_ stoked to see, like, where this goes, and to be the biggest cheerleader. [chuckles] So that's about all the time we’ve got for today. I think that is it. Unless anyone else has anything else to say, we're going to end it there.** 46:12 Alex I think MLOps is an early field. And there's a lot of different ideas that we're going to see, whether it be folks like us, who are presenting on this podcast, or in very bespoke different parts of the community – all the way down to random startups that literally went from a small seed round like Snorkl, to $1 billion valuation little under two years. Very, very quick, rapid development. And I think the fact that this early community is here means a lot of us are exploring and testing new ideas and theses. The biggest thought that always crosses my mind is really longevity to see the space, which will really develop the wisdom within the members who decide to remain active in the community in terms of “How do we really add value for society with ML?” I think that's ultimately the goal here. It's not just to take models into production, it’s to move a business metric, which hopefully results in more efficiency or a whole new experience that can drive what the human condition can do, using technology. And thank you for having us, in terms of letting us show our pitch, and we should definitely continue having these types of conversations. Maybe even do like some type of Coffee Session where Srivathsan’s team can show mlctl and its current state in a couple of months. 47:24 Demetrios **Yeah, that would be really cool. That would be very cool. So we will keep that for the next one. So, everyone out here that's listening has to wait patiently. Thanks again, guys. This was awesome.**

In this episode

Alex Chung

Alex Chung

Build MLOps Interoperability, Social Good Tech

Alex is a former Senior Product Manager at AWS Sagemaker and a ML Data Strategy and Ops lead at Facebook. He's passionate on interoperability of MLOps tooling for enterprises as an avenue to accelerate the industry.

Twitter

LinkedIn

Srivathsan  Canchi

Srivathsan Canchi

Head of Engineering , Machine Learning Platform, Intuit

Alex is a former Senior Product Manager at AWS Sagemaker and a ML Data Strategy and Ops lead at Facebook. He's passionate on interoperability of MLOps tooling for enterprises as an avenue to accelerate the industry.

Twitter

LinkedIn

Demetrios Brinkmann

Demetrios Brinkmann

Host

Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.

Vishnu Rachakonda

Vishnu Rachakonda

Host

Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.