SRE is making its way into the machine learning world. Software engineering for machine learning requires reliability, performance, and maintainability. Site reliability engineering is the field that deals with reliability and ensuring constant, real-time performance. Niall Murphy, most recently Global Head of SRE at Microsoft Azure, helps us understand what SRE can do for modern ML products and teams. Building machine learning teams requires a diverse set of technical experiences, and Niall shares his thoughts on how to do that most effectively. Machine learning organizations need to start to take advantage of SRE best practices like SLOs, which Niall walks through. Production machine learning depends on high-quality software engineering, and we get Niall's take on how to ensure that in a machine learning context.
In this episode
Niall Murphy has been interested in Internet infrastructure since the mid-1990s. He has worked with all of the major cloud providers from their Dublin, Ireland offices - most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). His books have sold approximately a quarter of a million copies world-wide, most notably the award-winning Site Reliability Engineering, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin, Ireland, with his wife and two children.
Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.
Vishnu Rachakonda is the operations lead for the MLOps Community and co-hosts the MLOps Coffee Sessions podcast. He is a machine learning engineer at Tesseract Health, a 4Catalyzer company focused on retinal imaging. In this role, he builds machine learning models for clinical workflow augmentation and diagnostics in on-device and cloud use cases. Since studying bioengineering at Penn, Vishnu has been actively working in the fields of computational biomedicine and MLOps. In his spare time, Vishnu enjoys suspending all logic to watch Indian action movies, playing chess, and writing.
David is one of the organizers of the MLOps Community. He is an engineer, teacher, and lifelong student. He loves to build solutions to tough problems and share his learnings with others. He works out of NYC and loves to hike and box for fun. He enjoys meeting new people so feel free to reach out to him!