Demetrios is the founder of the MLOps Community. He likes interviewing people and playing improvised songs on the guitar.
What does a Machine Learning Engineer do at Etsy?
This is a project for the MLOps Community to fully understand what different people touching ML do at their jobs. We want to find out what their day-to-day looks like.
From the most granular to the most mundane, they tell us everything! This is our chance to bring clarity to the different parts of MLOps ranging from big companies to small startups. Check out our other posts like this here and here.
Name: Kyle Gallatin (https://github.com/kylegallatin)
Official Title: Senior Software Engineer, Machine Learning
Years in the game (years in the labor market): 6-7
Years specifically working on ML: 5
Direct reports: 0
What was your path into Machine Learning?
I kind of tripped and fell into machine learning (ML) while I was getting my master’s (in molecular biology – total waste). I ended up getting a job as a data analyst at a biotech company.
In addition to menial data entry tasks, my manager asked if I could try using R and Python to analyze our internal data. I started using techniques like PCA and fostered an interest in data science. This led me to attend a data boot camp in NYC (NYC Data Science Academy), and from there it was all ML and software.
What interests you about your current position?
My current position deals with a lot of unique challenges in the ML infrastructure space. It’s a unique blend of software engineering, systems design, and ML – which means I get to develop skills across a wide array of disciplines. Combined with the unique challenges of ML infrastructure and evolving best practices, it’s a space where I get to learn and grow in a number of different dimensions.
In contrast, as a data scientist, I’d often just kind of sit around waiting for models to finish training, only to be subsequently disappointed with the model performance. Recently, one of my colleagues was blocked on their work for an entire day just because there were no cloud TPUs available for her training job. As much as I appreciate some of the inherent downtime of being an ML practitioner, I do not envy the long cycles of iteration combined with the constant uncertainty of delivering any promising results.
What are some things that drive you crazy about your position?
As many folks know, I can be a little bit of an ML pessimist. I think ML in industry needs more of the software engineering mindset and best practices, as opposed to the theoretical mindset that is more prevalent in academia. As someone who came from a data science background, I was subject to (and struggled with) this mindset for a long time as well. However, I’m lucky in that many of the ML practitioners I work with now have both the scientist and engineering skill sets – I definitely work with a number of ML practitioners who are better engineers than me!
What does your company do?
Etsy is the global marketplace for unique and creative goods. We connect creative entrepreneurs from nearly every country around the world with buyers shopping for something special.
In my opinion, applied scientists and ML engineers at Etsy solve some of the most interesting and complex challenges in e-commerce due to Etsy’s diverse catalog of unique content. With millions of unstructured listings, improving search, ads, recommendations, and risk mitigation with ML requires developing and applying state-of-the-art algorithms and systems. Personally, one of my favorite things about working in ML at Etsy is that it drives real, tangible value for Etsy’s buyers and sellers – and as a result, Etsy supports a number of interesting and novel ML initiatives.
What is your team responsible for?
I am on the ML Platform team at Etsy, which supports our ML experiments by developing and maintaining the technical infrastructure that Etsy’s ML practitioners rely on to prototype, train, and deploy ML models at scale. Specifically, I am on the model serving squad within ML platform – which as the name implies, focuses on our real-time model deployment and inference platform.
What are some use cases you have with ML?
Etsy has many ML use cases, but some of the larger initiatives are around search, ads, recommendations, and risk. Within each of those are more focused initiatives on ranking, retrieval, NLP, computer vision, and many other areas of ML and ML systems. So basically like…everything under the sun.
On the job
What projects are you working on in the next 6 months?
For the next 6 months, I am primarily working on initiatives involving scalable approximate nearest neighbors (ANN) for candidate retrieval (a common component of information retrieval pipelines) in the search, ads, recommendations and computer vision space. Intelligent search at scale often involves multi-stage pipelines due to the sheer number of potential results. For a given query, we first fetch a set of candidates from an index and then apply a more intelligent ML model to score that set of candidates and rank them accordingly. ANN indices allow us to improve the quality of that initial set of candidates at scale over simpler solutions such as KNN or a Solr index.
What tech do you touch on a daily basis? For what?
For software engineering, I work mostly with Python and Scala. For infrastructure, I mostly work with Kubernetes, Terraform, and a number of Google Cloud services. Across the entire system, I end up working with quite a few technologies, building tools, infrastructure technologies, and markup languages. Like honestly – you name it and I’ve probably touched it somewhere in Etsy’s stack.
What are your main responsibilities?
The Model Serving squad builds and maintains the platform that enables ML practitioners to deploy ML models at a massive scale for real-time inference. Using a unique blend of ML, software, and infrastructure engineering, this team enables the efficient and reliable serving of the most impactful ML models at Etsy used for powering search, advertising, and recommendations. By partnering closely with ML teams and building state-of-the-art tooling, the Model Serving squad helps enable fast iteration on complex models that drive a better experience for both buyers and sellers on Etsy.
What do your days consist of?
Most of my week (65%?) is probably spent programming or working directly with technology in some way. This includes developing new features, fixing bugs, but also setting up build pipelines, terraforming infrastructure, investigating the root causes of errors or performance degradation, running performance tests, and other such activities. Some 15% is probably spent writing documentation or architectural design documents.
Finally, I’d say there are about 20% meetings. Most of these are agile ceremonies like standup, retro, and sprint planning. However, there are always additional meetings specific to various initiatives or discussions.
What was the last bug you smashed?
Lol, are we still saying we “smash bugs”? I think the last bug I fixed was an issue with ANN index creation in one of our Python codebases as a result of a Google API upgrade. “Smashed” is probably a little too aggressive a term for the amount of effort and scope of the fix. Most of the bugs I troubleshoot, however, are in our real-time model serving platform or individual deployed models.
What are you most proud of in your current position?
It’s a bit high-level, but I’ve been proud of how much I’ve been able to improve my technical domain expertise over the past ~2 years in my current position. I’ve gotten to supplement my half-decent communication skills with lots of lower-level software engineering and systems engineering experience I didn’t have previously. The result is that I’ve become a more well-rounded software engineer, which has made me significantly better at delivering value in the ML infrastructure space.
What did you expect the job to be like vs reality?
I think my current role exceeded my expectations in terms of the technologies I’d be exposed to and the experience I would get. Sure, some days I’m just editing a bit of YAML or Jsonnet and don’t feel like much of an engineer – but I also get to build and work with ML systems at a massive scale – and you can’t beat that kind of hands-on experience.
What are some things you enjoy most about your current position?
There’s a lot to love. In general, I love the people I work with, I love my office, and I love working at a company where ML drives tangible value – and really helps to connect Etsy buyers and sellers. More specifically though, I really do love working in the ML infrastructure space. Our teams need to have a unique blend of ML, software engineering and infrastructure skills that I think can be hard to come by or otherwise develop. The breadth of content to learn and techniques to implement is massive, and in that regard, I’d be hard-pressed to ever get bored in this position.
What kind of metrics do you follow closely?
The ML platform team is in the Machine Learning Infrastructure, Platform and Systems (MIPS) org within Etsy. This team is classified as an enablement team. We serve ML practitioner teams internally, and as an enablement team, our success is determined by the success of those internal customers. Some of our common KPIs are things like the number of experiments an ML practitioner team was able to run on our platform or the amount of time it takes an ML practitioner to bring a model from an idea to an online experiment. Any metric that captures the effectiveness of our customers using our platform is a valuable way to capture the value we drive as a team.
Can you talk to us about how you interface with the ML teams?
As a centralized group that enables ML, the MIPS team builds infrastructure and takes user requests from all ML practitioner teams within Etsy. The requests we get vary widely depending on the ML infrastructure or system in question. Folks could request platform features, guidance on leveraging ML features available in our feature service or feature streaming service, general help training or deploying a model to production, or low-level support on debugging code infrastructure issues. Given our organization’s wide range of expertise, we consult and build just as wide a range of ML tooling. We typically source and prioritize these requests via Slack, but will also have meetings with individual teams (or multiple teams) during specific initiatives.
Of course, a key challenge can be developing generalizable, reusable tooling that meets the needs of all our customer teams and ML practitioners. As I learned during our initiative to improve ML observability, even the ML ranking use cases use different metrics and have different workflows, tech stacks, and sometimes even data formats. Down to the individual applied ML scientist, folks have different skills and preferences that merit different layers of abstraction or what would constitute an “optimal” ML platform interface. We need to think critically about the tooling we develop to ensure it meets the current and future needs of all of our users.
What do you wish you knew before getting into ML?
That if ML is not driving some kind of concrete value within your organization, it can easily end up as a total waste of time and effort. Every technical problem is not also an ML problem just because you want it to be, and there is nothing more wasteful than abusing ML for something that is fundamentally a software problem. In other organizations, I’ve seen folks squander fairly large budgets on various ML proofs-of-concept with almost nothing to show for it. You can see my entire pessimistic rant on ML here, but I really wish that I (and every ML practitioner) was taught software engineering fundamentals before ML fundamentals. I think this would’ve made me far more effective at delivering good software. Fortunately, organizations like Etsy do a great job of defining and measuring the actual impact of ML on our ability to actually move the needle – which in our case means connecting buyers and sellers. One of Etsy’s guiding principles is to “minimize waste”, which I think aligns really well with my perspective on ML in an enterprise setting and ensures that we don’t needlessly squander resources that could be spent improving our marketplace for those who depend on it.
Any random stories from the job?
I think one of my funnier screw-ups was the time I took down our entire dev Kubernetes cluster. I was experimenting with a tool to scale our workloads to 0 replicas overnight and deliver significant cost savings. Since it was a dev, I threw caution to the wind, removed a –dry-run flag, and started playing with it. Little did I know the default timers would scale everything to 0 (with the exception of core Kubernetes workloads) across the whole cluster. Ingress, ML deployments, observability tools like Prometheus – all 0 replicas and all completely useless. It only took me an hour or so to get everything back up, but it was a good lesson. Since there was no customer impact (and again – it was dev) on that specific day, I still find it pretty funny.
The model serving squad is, has, and will be at war with the efficient serving of increasingly more complex models for some time. As our ML practitioners build larger and larger models, there’s more pressure on our team to support infrastructure that can scale these workloads. Of course, there’s a limit to what infrastructure can do in the face of an increasing number of features, transformations, request size, parameters, and model complexity – we can’t exactly overcome Moore’s Law with expertise alone. That being said, the ML platform team is investing a lot in observability and automated tooling to help folks identify and solve potential bottlenecks in their model-serving solutions.
Who do you admire?
Danny Devito. But I’d also say some of the people I admire most in the ML world are folks like Chip Huyen, who consistently puts forward amazing, industry-aware ML content that speaks directly to the way I think about the ML world (she’s just smarter and better at it). I also admire a ton of my co-workers. I’ve met many folks at Etsy who dwarf my intellect, emotional intelligence, communication skills, and domain knowledge – and I’m lucky to have had the opportunity to do so. My squad in particular is composed entirely of folks I vehemently respect (if you can “vehemently respect” someone) and that I love working with. Shoutout to Harshita Meena, Hassan Shamji, Sallie Walecka, Derrick Kondo, Rob Miles, and our current intern Rahul Dharani for being an absolutely amazing team with a wide breadth of talent.
Where do you want to take your career next and why?
Right now, I want to focus on some of my relatively weaker engineering competencies (such as domain expertise and low-level technical knowledge) and continue to refine my existing skills to move up the IC ladder. Although I presume the title of “Staff Engineer” might still be some ways away for myself, I am actively trying to think about the kind of Staff Engineer I want to be and what skills, in particular, I need to acquire and demonstrate to get there.
What advice do you have for someone starting now?
This may be specific to me, but I would’ve just been a software engineer from the start as opposed to my roundabout trip through data science and ML. It definitely gives me a unique perspective, but I’ve certainly had to play catch-up in the realm of software development. My “hot take” is you could give a few experienced software engineers a 1 month crash course in ML and have a group of extremely effective ML engineers – whereas teaching someone who only knows ML software will take significantly longer.
So my advice is learn to code. Learn data structures and algorithms. Learn systems design, and learn it all first. When you come back to modeling, you’ll view it differently – and it’ll be a cakewalk (is that a term? where is it from? it sounds odd…).Tags: Machine Learning at etsy, Machine Learning Engineer, What does a machine learning engineer do?