Coffee Sessions #86

Building ML/Data Platform on Top of Kubernetes

When building a platform, a good start would be to define the goals and features of that platform, knowing it will evolve. Kubernetes is established as the de facto standard for scalable platforms but it is not a fully-fledged data platform. Do ML engineers have to learn and use Kubernetes directly? They probably shouldn't. So it is up to the data engineering team to provide the tools and abstraction necessary to allow ML engineers to do their work.   The time, effort, and knowledge it takes to build a data platform is already quite an achievement. When it is built, one has to maintain it, monitor it, train people to on-call rotation, implement escalation policies and disaster recovery, optimize for usage and costs, secure it and build a whole ecosystem of tools around it (front-end, CLI, dashboards). That cost might be too high and time-consuming for some companies to consider building their own ML platform as opposed to cloud offering alternatives. Note that cloud offerings still require some of those points but most of the work is already done.  

Take-aways

* Things to know before building a platform  * ML engineers should not have to learn Kubernetes Other potential topics, side note: reliability costs money (read replica, cost of networking, monitoring) logging and distributed systems, please don't do it. Architecture decision review, why those decisions were made. Speed (the time it takes to make a change) is the one factor to measure developer experience matters Infrastructure-as-code is harder than we think multi-cloud is nice until we do the security/IAM management part number of dev/teams vs number of customers for impacting the cloud bill definition of an error, when things go wrong or not really mono repo is great staging clusters are mostly useless

In this episode

Julien Bisconti

Julien Bisconti

Site Reliability Engineer in Data Infrastructure, Spotify

Julien is a software engineer turned Site Reliability Engineer. He is a Google developer expert, certified Data Engineer on Google Cloud and Kubernetes Administrator, mentor for Woman Developer Academy and Google For Startups program. He is working on building and maintaining data/ML platform.

Twitter

LinkedIn

Demetrios Brinkmann

Demetrios Brinkmann

Host

Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.

Vishnu Rachakonda

Vishnu Rachakonda

Host

Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.