Text embeddings are very popular, but there are plenty of reasons to be concerned about their applications. There's algorithmic fairness, compute requirements as well as issues with datasets that they're typically trained on.
In this session, Vincent gives an overview of some of these properties while also talking about an underappreciated use-case for the embeddings: labeling!
ML systems can still totally fail. Maybe we should be less optimistic about 'em.
**All right, Skylar. We just had a conversation with Vincent formerly of Rasa now at Explosion. I keep confusing it – either Explosion or Explosive, which is a data quality company that was started by the same people that did spaCy. So they've got a little bit of a track record. But dude, I'm blown away by Vincent. I mean, how he's able to do everything that he does is beyond me.**
**I agree. He's a prolific speaker, blog post author, Python package pusher – across the data science space. And he's put out a lot of great work there.**
**So much good stuff, man. And I really appreciate it. I mean, we got right into it. We talked about how machine learning fails, why it is a problem to over-optimize for certain metrics. He's got so much out there that I felt like we could have talked to him for another two hours and not really scraped the surface of everything that he talks about. But I really appreciate his way of looking at the whole space. Some of the key takeaways for me – this idea of trying to… I loved his fraud example. He gave two fraud examples. But I love the fraud example of thinking, “Oh, yeah. I've got this model and it was trained on some very easy fraud examples. And then I put that into production and it catches some fraud. So I'm great. I'm getting a raise. I'm the good guy. Everything's good.” But then you start to question, “Well, yeah. But is there more that we could do here?” So that was one, but I know you two kind of geeked out there a lot on recommender systems and stuff. What were some of your takeaways?**
**Yeah. So I think the broad takeaways that I had – really, the underlying themes here were: if you're going to apply machine learning, first start with understanding the system and the data. That's what's going to help you make the right kinds of assumptions that will actually hold later. Once you have that, that helps you pick the right metric or metrics for you to focus on. And lastly, now that you can make a lot of assumptions that will hold, you can work with simpler tools and you don't have to jump straight to the very complex all the time. I think a combination of this allows you to better iterate and create things that don't have unnecessary complexity. And we geeked out on recommender systems, like you said, because this is a place where a lot of those failures pop out, because we're applying very complex techniques. And so I very much appreciated that vein of understanding the system and keeping things as simple as you can.**
**So he was pretty candid when we asked him about the cliché – what is it? – data-centric AI these days and data-centric ML. I loved how he referenced his current boss, Ines, who said “Iterate on data, not on models.” I think that is _so_ much better than data-centric AI. Data-centric AI just sounds like marketing to me. So this “iterate on data, not models” is very clear and I loved that. I hope everybody enjoys it. We're gonna jump into the conversation right now. Just as a side note, Skylar and I will both – I mean, Skylar took the lead on this – but we're going to be doing something that we have lovingly named the “pancake stacks roundtable discussions” where we get together – and it is not recorded at all. And we're just deep diving on certain people's choices and design decisions that they've made. So if you want to get involved in these (they’re monthly talks that we're having behind closed doors) get into the channel on Slack called “pancake stacks”. Without further ado, we're going to talk with Vincent. [intro music]**
**I want to ask the both of you – starting with Vincent – what was your best purchase since the pandemic has started?**
Um, I like to think the mic, definitely. I mean, so you're doing Zoom meetings more, so a really good mic tends to make a big difference. Other than that, I bought this – the cheapest Wacom tablet where you can still see where you're drawing. I will say that's been proven to be amazing just for writing better GitHub issues, because you can add images and stuff. I don't know which one of those two I will pick, but either one of those two.
**What about you, Skylar?**
**Probably similarly, these Bose headphones. They do a pretty good job of filtering out background noise, so I can take meetings even in noisy coffee shops.**
**So, mine was a milk foamer, in case you didn't catch that. That was my favorite purchase. Now, let's talk shop. Vincent, I know that you've got _a lot_ to talk about. I mean, you're a prolific man. You’ve got a ton of stuff on two websites. Actually, I don't know how many websites you have now, but you collaborate with all kinds of people like spaCy and PyData. You've got your really cool different websites, one of which – which sprung the question that I asked earlier is thismonth.rocks, that you started at the beginning of the pandemic, and it's just a place to get inspired to do different things and new things in a month, and learn new skills when you're stuck inside. **
**I just want to – let's just get into it, man. Because I know you think and you talk a lot about how machine learning can fail. And that's really what I want to center this conversation around. You have some awesome blog posts – one which is a little bit heart wrenching, because I've been in your situation, when you talked about having a stillborn, and then continually getting ads for baby stuff after the fact and how that is a really bad use case in how machine learning fails on us on some of that. Then you extrapolate it out in different ways that it can fail with people like alcoholics. So maybe we can just start there. What drives you to get into how machine learning can fail?**
It's a bit of an elaborate question, but… okay, part of me thinks, “Hey, if we understand how things fail, it's also easier for us to figure out how things can be fixed.” So naturally, I think if you're building systems, understanding failures is just part of your job, in a way. But what I've mainly noticed is that if you go to conferences and you're looking for, “what are the better tech talks?” usually seeing someone on stage giving an epic demo of something working – that's interesting. But it's usually much more interesting to sort of pay attention to someone that says, “Hey, this website went down for a moment – here's what happened and here's how you can prevent this from happening on your side.”
I've always found those stories to be just a little bit more inspiring and also just a little bit more of a take-home message – something that you can actually apply in your day-to-day. So that's typically also what I try to focus on in my blog. Whenever I see a common practice that can still break, I think that's a very interesting thing to just sort of zoom into. One thing I should say in general, though, whenever I talk about these somewhat critical topics, as long as you're not going on a rant but just trying to be honest but strict, people tend to be quite receptive. The blog post that you mentioned earlier, that's something that the website in question in the blog post, they've actually reached out to me. Their senior science tech lead actually came up to me and said, “Hey, Vincent, you're right. These are things that we'd love to dive into a bit more because this is almost a poster child of stuff that could go wrong that we didn't intend to go wrong.”
I'm actually speaking there (I think in a week or two) with our engineers about this. So I've actually found, as long as you can… don't rant – that's the bad path – but as long as you can clearly point out that this is a failure scenario, odds are that people will listen and also just go on a fixing path. So that's kind of the vibe that I'm also going for whenever I'm doing tech talks at PyData. I try to be critical, but the intent is that we fix things and that's also why I like failures. Because if you talk about them, it becomes easier to fix them.
**Yeah, that's super interesting. It's warming to hear that they reached out to you and they're working to get that fixed. It's very cool that they were able to discover your blog post and find that. One thing I'd love to dig in a little bit more is – I think a lot of your talks kind of center around failure, but failures specifically around having complex systems. So I’d love to just kind of dig into how you keep systems simple. I think a lot of your talks tend to have this running theme of “We can do things simpler.” And so I'd love to just hear you unpack that a bit. How and why do you keep things simpler?**
I mean, the simple answer is: just don't introduce any complexity unless you need to. But I guess a more overarching goal is… a bit of a weird story, I guess. My background is in operations research. If you go to my college degree, it's econometrics. But the major is operations research, which is this field of math where people try to find optimizations. And if you get good grades in that, then you understand one common theme and that is that you can optimize a system to like infinity, in the sense that your factory is allocated as such that you're most effectively using all the raw goods that come in. But the problem usually with those systems is that the price of raw goods fluctuates.
You can hyper-optimize towards a given price, but if it even deviates a little bit, the lessons of operations research is, then you've got an non-optimal system. The moment that you understand that, “Hey, optimality is a moving target,” then you’ve got to wonder, “Okay, can I just understand the system then? Because then, if I understand the system, it's easier to make assumptions that actually hold.” I like to think that I'm not necessarily going for like, “Oh, let's keep things super, super simple.” I think I'm a little bit more in favor of, “Let's just try to understand what we're doing. Because usually, if we understand what we're doing, we can get by with a heuristic.” If nothing else, that's usually a good benchmark to start with.
I think it's perfectly fine if you start with a simple system to predict – I don't know – bad behavior in an online chat or something like that. Let's say you're eBay and the conversation back and forth seems to be performed by bots. Okay. There's probably some heuristics that you can start with. And if maybe later down the line, we find that a neural system works better, I think that's great. But you should probably start with a simple thing first, if only because you can test your assumptions, you can have a good baseline, and then you have something to build _on_. What I'm less of a fan of is that people sort of say, “Oh, let's try out all these algorithms that require 16 GPUs [chuckles] and hope that we sort of magically end up in the place where I want to be.” That seems like a bit of a silly approach to me. But then again, it also depends. But I prefer to start simple, because I want to understand the system. That's usually the gist of it.
**Yeah, that definitely makes a lot of sense, and it definitely resonates with me. One of the talks that I was thinking of when I asked the opening question is – you had a talk titled “Playing by the Rules-Based System” And actually, that's the talk that led me to recommend you to come on this podcast. But, I'd love to just maybe have you speak a little bit to that for folks who maybe haven't seen it but are watching this. You talked a little bit about, “Hey, let's start with heuristics, based on an understanding of the system. Maybe later, we’ll move on to something more sophisticated that works better.” But I feel like you presented a very interesting technique, or methodology, in that talk. So, would you care to speak a little bit to that?**
Sure. There were two examples, if I recall correctly, in the talk. There was the one with detecting programming languages, and there was another one concerning fraud. I'll start with the fraud one, because I think that's the funnier one. So, suppose you work at some financial institution and management comes up to you and says, “We need a system against fraud.” Then, without looking at any data, you can probably come up with a couple of things that are pretty good to just check. So if you're younger than 16, but you earn over a million dollars a year, then I don't really need an algorithm to say, “Well, let's check that.” I mean, it's kind of fishy. If you have more than 20 bank accounts and you're a private individual, and you have five addresses spread over three countries. “Okay, yeah. Let's check that out.” We don't need an algorithm. In fact, it might be better not to start with an algorithm here because algorithms need a person who lives in five different countries with 16 different addresses to be in the dataset in order to be able to learn from it.
My proposal here was maybe what we need is a couple of these simple if/else statements where – if this ever happens, we know for sure that we can trigger the ‘fraud schedule,’ so to say, and only when the obvious cases don't trigger, then we can resort to maybe a machine learning-type model. But the benefit is that this is a well-understood system. You can just say “if this, then that.” That's pretty easy to code up, it's pretty easy to maintain, and if something goes wrong, that part is relatively well-understood. But the funny thing is, you can do these sorts of things in NLP too. Usually, it's a bit different. The other example was – if you go online, I have this spaCy course that I collaborated with the Explosion people with.
The thing I'm trying to solve there is – I'm trying to detect programming languages in text. The heuristics that you could do there is, you could say, “Well Go is a really hard programming language to detect because the word ‘go’ is in the top 10 most frequent words in English. 99.9% of the time – not about a programming language.” However, that word is usually meant as a verb and spaCy can detect parts of speech. So what you _can_ do is say, “Hey, if I ever see the word ‘go,’ but it's a noun – that's kind of a rule.” I mean, part of speech detection is also a hard problem, so spaCy doesn't always get it correct. But I can say something like, “Okay, if ‘go’ is a verb, then it's not about the programming language.” It's also a pretty easy rule that I can just go ahead and add. And I'm a pretty big advocate of starting with these kinds of systems, mainly because they're nice and well-understood, they're pretty easy to debug, and they're also kind of easy to explain to upper management, usually, because you don't have to be a tech wizard – you don't have to appreciate machine learning so much.
So that talk was really about advocating this line of work. I made a package called Human Learn, that's making it easy for you to build these rule-based systems for SciKit Learn, so that you can use grid search and all that. And recently, I've been adding some more features to that library such that you can use things like visual UI elements to do more machine learning stuff with. The next version will also have more support for what I like to call, ‘looking at machine learning more as a UI problem’. Because part of me thinks, if you understand the patterns in your data quite well, maybe you don't need the machine learning algorithm – maybe you just need to figure out better ways to find good heuristics.
**Wait, can we double click on that real fast? Machine learning as a UI problem?**
Sorry, I might not know what the slang means here – what does it mean to double click on something? As in you want to zoom in on what I mean by that?
**Yeah, sorry. Zoom in. **
Sure. Actually, there is a demo if you're interested – there's a demo on the website. You've probably seen the Iris dataset where there's like a red blob and a green blob, and like a blue blob somewhere. It's a great machine learning demo. What you can also do is just draw a circle around it and everything that falls within the circle you call ‘blue class.’ So that's _a_ thing that Human Learn currently allows you to do. But there's also some tricks you can do with parallel coordinates charts. That's a really hard thing to explain on the podcast, though. It's a really interactive visualization with lots of widgets. On Calm Code I have a really cool demo with it. But for podcasts, I'm afraid I'm gonna have to maybe refer you to the video, because that one's really hard to explain without a visual aid.
**Yeah, no worries. We'll put a link to that in the description for anybody that wants to go and have fun and play around with it. But the general theme is just – instead of trying to code something, you're interacting with the data in a way that you wouldn't necessarily interact with when you're writing different lines of code? Is that what it is?**
It's a little bit more like, “Hey, what are just some common patterns in my data? And if I understand that heuristics, maybe I can just figure out a rule.” The fraud might be a nice example. So for example, “Hey, most people that have a median income and one bank account are probably fine.” So, yes, I can look at the data, but I can do a couple of SQL queries – I could figure out this in like, 50% of my data, maybe. That's also _a way_ of approaching your first machine learning model that I've just noticed in practice is very interesting. But another way of looking at it – and this is also kind of related to the data quality – but the trick that I like to think has been moving my career forward, maybe it helps to consider two models. Just as a thought experiment, let's say we have one model that is the very complex model. Like the gradient-boosted tree, deep learning whatsawhoseit hyper-parameter – that thing. And then we have another model, which we'll just consider to be the relatively simple model. Maybe logistic regression or something with heuristics that I just mentioned – something with very much domain knowledge.
In the case of spaCy with the Go example, I had this rule – I have this rule-based system that can detect the goal programming language – and I also have a deep learning system that's learning from labels, let's say. Now, what's very interesting to do, is to train those two models and to see where they disagree. Like, just look at the examples where both models are – they're reasonably good, both models have a way of working and they arguably are not bad – but if two models that are good disagree, something interesting is happening. And sometimes it's because you find some bad labels, which are good to fix. But it's also sometimes that the deep learning model has figured out something that my rule-based system hasn't. And in the case of Go, I've noticed, sometimes people write version numbers as well, for example in StackOverflow questions, like ‘go’ and then a number to indicate it’s go version 7 or something.
So sometimes, you can also look at the data and sort of say, “Oh, wait, there's another heuristic that I can add to my rule-based system.” By that time, you'll have better labels, you'll have better rules, and by golly, you'll have version two of your deep learning system and version two of your heuristics. From there, you can once again see why they disagree. And this is a thing you can kind of just keep looping and keep iterating on and I've just noticed this is way better than grid search in terms of getting a proper model that actually can do something meaningful in a business. These kinds of approaches, where it's a little bit more iterating on data and trying to figure out the patterns in it, I honestly think that there's a good chunk of future data science if you work more along those kinds of methods.
**You know, one of the interesting things you just briefly touched on, you said, “When these two models disagree, it could be because there's maybe bad labels and whatnot.” I think, in my experience, that is a very common problem. Even detecting bad labels can be sort of difficult. And I think, especially a lot of new data scientists, maybe kind of take their labels at face value and just assume that these are correct. So, I would love to just hear more about your experience of detecting bad labels. I know you have another package, too, to do this.**
Yeah, that was package number 16. [chuckles] I think.
**[cross-talk] There was a great talk where you were saying how people don't get to label – like, the data scientists don't actually get to label. They have to have it outsourced because having data scientists doing labels is too expensive, right?**
Well, that was a funny story. There's a talk I did at PyData Eindhoven called Optimal and Paper Broken in Reality, and one of the main gists of the talk is “Hey, bad labels are everywhere. This is a problem, folks.” That was like the main thing I wanted to drive at. But I wanted to practice that talk, so I went to a consultancy company in the Netherlands, and they were more than happy to have me for my trial talk. Afterwards, you know, you start talking to some of the consultants – folks who do data – and I met this one person who was doing computer vision for infrastructure/preventive maintenance kind of stuff. Decent application – sounds good. But he was complaining to me, like “Hey, everything you said is true, but I’m not allowed to look at the labels, because the boss man thinks I'm too expensive for that.” Which kind of shocked me, but I've met a couple of senior professionals who get consultancy rates, who can’t give good models because they're not allowed to. It's a bit strange. But yeah, it's also that I've seen this website at some point – you might have seen it as well – LabelErrors.com. Have you seen that?
Just for fun, go to www.labelerrors.com. Basically, there are these very famous benchmarking datasets from academia – CIFAR, MNIST, QuickDraw, the Amazon Sentiment corpus, etc. – and as a research project, I just tried to research how many label errors there are in those datasets. They have a sound mathematical trick that allows you to say something sensible about that. With that, as well as with some volunteers on Mechanical Turk (well, not volunteers, I suppose, they paid them) but they found out that QuickDraw has like a 10% label error and Amazon Sentiment has like a 2% label error. So “if you have state of the art performance, you might be overfitting on bad labels” was one of the conclusions of that paper.
But labelerrors.com is a website you can go to right now. It's a pretty cool website, just because you can also see just how ridiculous some of the label errors are. I think you see something like a plumber and supposedly that's a hot tub, according to the label. I mean, there are some really crazy ones in there. But then, of course, you look at such a website and you want to start wondering, “Gee, is this happening to my data as well?” And the benefit of working in NLP is that you can usually as a, not necessarily domain expert, but kind of as a layman, you can just sort of look at text and say, “Hey, is that positive or sentiment?” It's a lot harder to do with a CSV file.
Then I started running my own experiments, just grabbed the first paper I found from Google that had emotions data in it and it was just too easy to find bad labels. I was actually kind of shocked and surprised. I ended up making a package just because I thought it should not be this easy to find bad labels, but apparently just a few tricks does it. I have some examples if folks are interested. But I was genuinely shocked how easy it is to find bad labels and public datasets.
**You know, a lot of the examples that you gave just now seemed like cases where most reasonable people could look at it and say, “Yeah, that's obviously bad.” Demetrios mentioned that I work in mental health, and one of the challenges there is that there’s low inter-rater reliability – meaning you have two people label the same thing and they're gonna say different things. Do you see cases like that popping up, too? Do these methods detect that kind of situation?**
Yes and no. I also want to acknowledge that labeling isn't necessarily easy. Right? And especially if you consider things like “Let's find people with a low hourly wage on the other side of the world who have a completely different culture than us and let's see if they can label sentiment datasets for us and see what happens.” I mean, that's what happened in a bunch of these situations. Detecting emotion or sentiment – that's like a super cultural phenomenon. [chuckles] You should not expect people across the world to interpret words the same way. That’s also not how language works, by the way. So in that sense, it's also not that surprising that you're going to have some label errors. And also in medicine, I imagine, there's a reason why you get a second opinion sometimes. [chuckles] In your case, I'm assuming you're dealing with mental health a little bit, so then you're probably using text that comes in – that can be hard to parse. But even with medical images, if you go to a doctor who specialized in, I don't know, MS or something like that, and you go to another doctor who specialized in cancer, you give them both the same photos of a person's head, the MS person is gonna say, “I see MS,” and the cancer person is going to say “I see cancer,” because that's their specialization. To some extent, that's just human nature. But it is still a problem if you're going to train machine learning models on it. But I don't want to suggest that labeling is inherently easy. It's actually quite tricky.
**There was something that we talked about right before we hit ‘record,’ and you were giving another fraud use case. Maybe we can jump into that again. Because I wanted to ask you a question, but I said, “Wait until we're actually recording to ask you this question.”**
So it’s a bit of a fable, this one, but I do think it has a lot of value. As the story goes, you are in a financial company and you're interested in analyzing fraud. Let's say that you're doing this on transactions. Then there's a bunch of interesting challenges, because the data set is imbalanced. Fraud cases, those are probably well labeled, but let's say the non-fraud cases, eh, there might be some fraud in there, it just hasn't been labeled that way yet. So there’s class imbalance from multiple angles. But let's say, despite all of that, we end up training a model that performs quite well. We get metrics we like, we convince upper management that it needs to go into production, engineers like it, etc. It’s in production and lo and behold, after two months, we actually catch two crooks. Let's say another month passes, we catch another crook. So everyone is getting promoted, because we apparently solve the problem. [chuckles]
Unfortunately, in this situation, where everything seems to go right, tons of stuff can still go wrong. The main thing that goes wrong in these kinds of scenarios is that typically, the labels that you have, you should always remember – those are probably the easy ones. The fraud cases that pop up in your labeled set is probably the fraud case that were easy to detect in the first place. By focusing in on the easy cases, you might get a blind eye for the much harder cases. But then you have a system in production and it's doing super well. No one's gonna complain, right? This is also yet another reason why labeling is just super hard, because there's bias coming at you from all sorts of directions and it's very much a human problem. We, humans, are not perfect. But yeah – there are tons of these stories around in the IT industry. Tons of them.
**Yeah, we talked about how, basically, how “Yeah, there may be a lot of fraud happening, we just have no idea.” It's like, if a tree falls in the woods does it make a sound? My question – my really big question for you is, “How do you not go insane just trying to look for more and more and more?” You can always think, and that you're probably rightly doing so, that there's probably someone that is getting past you. But when is ‘good’ good enough?**
I have some friends who are in the InfoSec world and some of the practices I hear of, is that they say something like, “Well, this month we are concerned with _this_ attack surface. So we're just going to do a sprint making sure that it’s not our SSH keys, let's say.” Okay. But the next month, they say, “Well, it might also be social engineering. So now we're gonna poke around the social engineering department.” And sure, you're always looking for all the things at the same time, but you can apply a similar thing here where you sort of say, “Well, there's different types of fraud that happen. There's identity fraud, and I'm pretty sure there's stolen credit cards, and there's a couple of these things.” And I can imagine you just try to round robin all of these separate things that are worth investigating from the domain knowledge that you have. I think that's a sensible way of going about it. Not necessarily perfect, but at least it sounds sensible. [chuckles] So that's the best thing I can come up with on the spot on that one. But it's hard. Yeah.
**Backtracking a little bit. When I mentioned the package that you released to find bad labels, you mentioned that it was package number 16. That kind of blows my mind a little bit as someone who hasn't released a package at all. So I just want to dig into, first of all, how do you produce so many packages? Do you have a system that helps you do that? How does that go?**
I need to take a little bit of a step back there. Not every package that I make is on PyPy and not all of them are meant to be used by everyone. But I will say – I have made a package where there's basically one or two functions in them. Maintaining one or two functions is relatively easy. So that's trick one. Then, usually, what I try to do is try to scratch my own itch, and if other people like it, that's fine – but I'm only interested in scratching my own itch. So if too many feature requests come in, I politely just say “no,” or ask them to fork it.
The other trick that I tend to use a lot is, I usually make stuff that is SciKit Learn-compatible. The really cool thing about SciKit Learn is that they have a testing client that you can go ahead and use, so it's fairly easy for you to reuse their testing tools. And if you use their testing tools, you're good. Because you know your stuff is SciKit Learn-compatible. In that sense, if you make packages kind of in this little ecosystem this way, maintaining them is relatively easy, but I do try to make it a point to make some of these packages just a little bit more minimal. They don't do _all_ the stuff – they usually provide like one or two tricks. The doubtlab package, for example, basically just wraps around SciKit Learn stuff.
The way I tried to find bad labels is just by trying to predict the classes, predict the probability estimates that come out, and I just try to figure out, “Hey, when does the classifier have a very low confidence?” That's basically one line of code, _really_. And yes, that line has a couple of tricks, but it has a couple of these one-liner tricks. So if you build packages this way, it is relatively easy. That's the main advice I have for people who want to maybe do open source stuff.
**Yeah! That's super cool to hear. So considering you release a lot of packages, even if they're very minimal, is there one that stands out to you? Sort of your “favorite” at the moment?**
Um. Definitely Human Learn – the UI stuff – is what I have in mind. That's got a special place. I like the idea behind doubtlab because as I announced this morning, I'm joining Explosions, so data quality is going to be more of my day-to-day soon, and so I'm very much interested in exploring what tricks we can add there. If I'm honest, though, the package that I really started with, together with a former colleague, is called SciKit Lego. That's just a couple of these weird little Lego bricks that you sometimes need a SciKit Learn pipeline. But what's really cool about that one is that most of the ideas don't even come from me anymore. There's a little community around like, “Hey, I've got this weird little component, but there's this moment where it's super useful. Can I please plug that in?”
There's this one component that a person made a while ago, (I forget the name) but basically, it does inflated zero regression. What it does – a lot of these regression tasks often have the number zero that comes out. So if the shop is closed on the weekend, let's say, the sales are zero. But that messes up most regression algorithms. If you're trying to predict sales, having days where it's exactly zero just messes things up. So what this component allows you to do is either add a rule or a classifier, and if that's triggered, the regressor says it's zero, and if not, then it goes to the normal regressor.
So it's tricks like that and SciKit Lego is like a little collection of some of these tricks that are a little bit weird, but if you're from industry, they actually make a lot of sense. There's also some fairness-type algorithm tricks in there as well, that are pretty interesting. Some good outlier tricks as well. I would argue that that's the most fun to maintain.
**Awesome, yeah. I've played with that package a little bit. There are a lot of good tidbits in it. So folks listening, if you haven't checked it out, we'll leave a link to that in the description. But yeah – one of the things you briefly touched on was that you announced that you're joining Explosion.ai. I wanted us to just dive into that. Why Explosion? What are you going to do there? Why are you excited to join?**
So it helps to kind of explain how I met them. A couple of years ago, there was a “spaCy in real life conference” and I kind of figured, “Hey, I've never done NLP before. This sounds interesting. spaCy’s got a cool website, better check it out.” And as an excuse to go to Berlin – Berlin's a fun town. So I went to Berlin and I met Matt and Ines there. I met them before at conferences, and I was hanging out with them for a bit. And this is _my_ version of the story, they will tell it differently. But basically what happened was, after a bit – I was there during the hackathon and all that – we went to a bar with a large group. That bar was super full, so we went to another bar. Then I walked up to Ines and I said, “Hey, this bar is way more spaCy. It's better.” Like, the worst joke ever, basically.
But I like to think that somehow that joke stuck and later that evening, Matt and Ines came up to me and said, “Hey, Vincent. You're kind of a funny guy. We think you would YouTube well. We're looking for someone to make some tutorials around spaCy, and we're very much interested in someone just showing from A to Z, not how spaCy would work, but how would you use spaCy to solve a problem?” So I did a couple of calls with them and I basically said, “I know nothing about NLP. I don't mind trying this whole YouTube thing out, but the deal is, I’ve gotta be able to ask you questions. Because I want to learn this NLP stuff. And if I can learn it from you by doing this, that's the deal.” I guess suddenly I've got the two maintainers of one of _the_ packages for NLP that can teach me stuff. So I was like, “Sounds great.” And that's how I met them.
In my free time, once in a while, I would have progress on the Programming Language Detector that I was making with spaCy. I've interacted with them on and off for two years, whenever I had time and I felt like it, I would make another episode. I think there's like five or six now on YouTube on their channel. At some point, I was investigating all this bad data stuff and I kind of figured, there’s this bigger issue than I anticipated, like “Bad labels are a _huge_ problem. Huh. We need a better tool for this.”
Then I got reminded, of course, that the folks over at Explosion made this great tool called Prodigy. So then I kind of just walked up to them and said, “Hey, I might be ready for a switch. I think it might be interesting if I do more demos on Prodigy. Here's just some ideas. Does this sound cool?” And that was the shortest job interview I've ever had. But that's it. [chuckles] That's basically how I went. I've always been impressed with what the Explosion team has done. I know Matt and Ines a little bit informally because of conferences and bars and all that. But this is the story, basically.
**You told me beforehand that you were also stoked because it's Python all the way down. Can you explain that a little bit more?**
Yeah. So I should admit, I'm joining Explosion right now and my first day hasn't started yet. So as far as the pitch goes, I might be a little bit unprepared. But I will say one part about this Prodigy product that they made that I do think deserves more attention. It is scriptable. In the end, the thing that can generate… you're gonna label and the order in which the labels appear is something that you can code yourself with Python. So if you want to do some stuff with active learning, nothing is stopping you from hooking into your favorite machine learning algorithm, your favorite backend, or your unicorn dB, to provide you these labels in the order that you like. You don't have to ask anyone's permission.
That's pretty epic because that means that I am free to do whatever weird Human Learn trick that I have at my disposal. Also, I've noticed the way you want to find bad labels, usually there's a custom thing in there. There's usually a trick that you want to just be able to plug in and that doesn't necessarily happen with a UI – you kind of want to code that. To me, that's always been my favorite feature of Prodigy besides good UI components and nice developer experience, but just the fact that I can code that – that's been proven to be a boon.
**That's awesome. I think we've covered a lot of things about packages you've written, bad data, bad labels – something I just wanted to get your quick thoughts on, as we've been talking about bad data and bad labels, the theme running in my head has been this shift, so to speak, from model-centric to data-centric AI. I want to know if you have any thoughts on that, whether you think your work is complementary to that, whether you think that this is maybe hype that's unnecessary, or whether you have any insights at all on it.**
Yeah. I remember seeing a talk from Ines, who is one of the cofounders of Explosion, and I remember her coming up with the term “Iterating not on models, but iterating on data.” That's kind of been the first time I heard about this way of thinking. I guess, the industry term for me has always been “Iterate on data instead of models,” But I suppose, nowadays, there are people who call it “data-centric versus model-centric.” It sounds cool to me, assuming it's not a marketing term – that's always the thing I'm a little bit afraid of. But I do, in general, sympathize with the whole idea of, “Hey, let's move a little bit more towards the data. Maybe that's where the real problem is.” And also these ideas of like, “Hey, if our data quality is twice as good, maybe a linear model will be fine.” I think those kinds of sounds seem good.
But you’ve always got to be a little bit careful with industry terms, because sometimes “data-centric AI” might mean a lot of stuff that I'm not aware of. [chuckles] That's also kind of the thing. I believe [inaudible] had a challenge around it a while ago – that was my impression, I think. I didn't it didn't participate in it, but I do think the move to sort of, “Hey, let's not worry too much about the Tensors that are flowing, maybe worry a bit more about the stuff that's in our database.” Yeah, that sounds fine. But I am not the kind of person who would use marketing terms too much. I suppose. That's my only complaint.
**So for you, it would be more interesting just to talk about the problem and not try to slap a little cliché whatever on it. Is that what it is?**
Yeah, I think that's maybe it. Maybe a good example is – I've worked with recommenders in the past. I've implemented a few – one for the Dutch BBC, I was also at the Dutch eBay for a bit and I was doing recommender work there. One thing you notice is that the recommender at the BBC is completely different from the recommender at the Dutch eBay, for the simple reason that if you buy something off of eBay, like someone's second hand bike – the moment you buy it, it's gone and you cannot recommend it to anyone anymore. So the way you implement a recommender is fundamentally different every place you go. [chuckles] Usually, like, you can read a book on recommenders but that's not necessarily going to help you find the best solution for the problem. Usually understanding the business and maybe diving into data quality is a better idea, yeah.
**Okay. So we've talked a lot about how machine learning fails, and I love to ask people who come on here if they have any war stories. Do you have anything? I’ve gotta imagine you got a ton, man. [cross-talk] **
I’m writing a book – so yeah. [chuckles] So I am contemplating writing a book at this phase, but I have a few war stories. I guess the main one that I like to talk about the most is just the grid search in general – like the “auto NLP.” But the way I like to talk about that, it's just a little bit more of a thought experiment that I think is interesting. So, Skylar, I'm just gonna do a little bit of a thought experiment with you, if you don't mind. Let's suppose that we both have some dice, right? Like you've got a die with 100 sides, like this mega dungeon dragons dice. And I’ve got one of them. Then we're gonna both roll and the person who has the highest number wins. If we both roll one of these dice – so far, so good. We both have an equal chance of winning. But let's now say that I say, “Ah, but I'm gonna roll two dice instead of one, and I'm going to pick the highest number.” Then you're going to say, “Well, that’s a bit cheating.” And I would say, “Well, but what if I do like 1000 of them? Would that be cheating?” Then you would rightly say, “Vincent, that's totally cheating because you're using more dice. That'd be cheating.”
So now, let's say we're doing grid search instead. You're allowed to train one model and I'm allowed to train 1000 different models with 1000 different hyperparameters – like different seeds, let's say, just to sort of keep it in the ridiculous realm. Just because I'm running 1000 hyper parameters and if there's something stochastic in my algorithm, I will always win just because I'm trying out more parameters. But I also feel like if you brag about how good your model is because you've done auto NLP on millions upon millions of grids, then you're measuring the wrong thing. In that sense, I don't think grid search is going to be awesome. I have had lots of stories where grid search has gone wrong, basically because someone had been pumping hyperparameters into the thing, hoping it would become better.
**Yeah, that certainly resonates. I remember my days at LinkedIn watching people spend a lot of their time playing with hyper parameters and filling out very complete spreadsheets with their results. You know, I always seem to come across this challenge because I’ve worked on recommendations at LinkedIn as well. People would essentially have great offline results, and then we put it in production – doesn't work so well.**
Doesn't correlate. Yeah. [cross-talk]
The thing is, people say that you have this dimensionality reduction you can do as a trick in the pipeline, right? But if we talk about the biggest hyperparameter – the biggest reduction you can ever do in dimensionality is to say, “Well, we've got this recommender problem. But we're gonna boil all that down to one single metric. That's the thing we're going to optimize for.” Because in the end, I think that's also where a large part of the problem is. Algorithms usually favor optimizing a single number, and if you're interested in optimizing a system, your concerns usually are very well beyond that. But yeah, especially if you're at LinkedIn, I can imagine just because of the sheer volume, it's hard to keep track of _everything_. So I can definitely imagine the appeal of resorting just to a single number. But yeah, it's hard. That's for sure.
**Yeah. I don't know if you've seen this blog post – there's a blog post from someone from DeepMind, I think, where they talk about one of the problems with hyperparameter tuning, specifically in deep learning, is that we often have multiple terms and people think that this alpha that you have as a knob that goes between those two and how strongly you want something to be. And in practice, that's not how it works. That kind of digs into how you need to really understand multi-objective optimization and how we're really not getting anything on the Pareto front, so we're getting suboptimal solutions all over the place. It's clear that there's a lot more work we need to do on how we have a robust methodology for working with hyperparameters. It's definitely missing.**
I'm curious what you would think of this one, then. What's your background? If I were to say “linear programming,” does that ring a bell?
Like you have constraints, as well, in a system?
When you do operations research, this is all you do. You're constantly worried about constraints in your system. And part of me is wondering, “Well, wouldn’t it be great if in machine learning, we were just able to add more constraints?” So if you're Netflix, for example, I would love to add a constraint that says “No romantic comedies for Vincent.” [chuckles] Even if I never watched one, Netflix, here's just a constraint a user telling you, “I want this to be customized to me,” because that should be more valuable to Netflix than a click, right?
Like it'd be nice. I think that's also a very interesting way of thinking about models. Like, maybe we can change the data not to be a loss function, but also to be a constraint. That sounds very interesting. It’s numerically quite hard, but from a design perspective, that seems like a very interesting thing to explore. Right?
**I haven't looked into this too closely, but I believe that the TensorFlow Lattice package was intended to do this. There's always a question of, “Hey, the constraints we want to impose are often like high level.” Like in the example you just gave, that’s a high level thing. Can you actually distill that down? The math is an open question. I'm not sure like in practice how well this package works, but I’ve definitely been meaning to take a closer look at it because that definitely resonates with me – putting constraints in. Especially coming back to a lot of what you spoke to on applying heuristics. There's a lot of places where, for a certain segment, we have a really strong heuristic, and it's like, “Why do you even need to use a model for that?” You know? So, very broadly.**
[chuckles] Yeah. So TensorFlow – I've actually played with TensorFlow Lattice for a bit. It's pretty cool. The main constraint that it allows you to add is monotonicity. So you can say stuff like, “If you smoke more, that's always bad for your health.” So for some causal stuff, you can actually grab with that. I do know some folks in the fraud industry like using that because they can sort of say, “Well, this is always a risk factor – no matter what.” Especially if you go to projects like Fairlearn – that’s an interesting example. But in ‘fairness algorithm land,’ so to say, constraints are often also being used because they want to say, “Well, we want the true positive rates between these two subgroups never to differ more than a certain number,” so to say. So constraints definitely make an entrance in that field. But there you can also wonder, what definition of fairness can you put in the constraint that way? It's also not going to be perfect, but I do believe it's an interesting area of research. That's for sure.
**Absolutely. Absolutely. Well, great. I think we've covered a lot of ground here. I really want to thank you for all of your time – all of your insights. I had a lot of fun. Hopefully you did, too. **
**I loved it. **
**We'll be looking out for what you do and your new role at Explosion. Very excited to see what new things come out from you and the other folks there. But I think with that, I think we can wrap this up. [outro music]**
In this episode
Research Advocate, Rasa
Vincent D. Warmerdam is a senior data professional who worked as an engineer, researcher, team lead, and educator in the past. He's especially interested in understanding algorithmic systems so that one may prevent failure. As such, he has a preference for simpler solutions that scale, as opposed to the latest and greatest from the hype cycle. He currently works as a Research Advocate at Rasa where he collaborates with the research team to explain and understand conversational systems better.
Outside of Rasa, Vincent is also well known for his open-source projects (scikit-lego, human-learn, doubtlab, and more), collaborations with open source projects like spaCy, his blog over at koaning.io, and his calm code educational project.
Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.
Data is a superpower, and Skylar has been passionate about applying it to solve important problems across society. For several years, Skylar worked on large-scale, personalized search and recommendation at LinkedIn -- leading teams to make step-function improvements in our machine learning systems to help people find the best-fit role. Since then, he shifted my focus to applying machine learning to mental health care to ensure the best access and quality for all. To decompress from his workaholism, Skylar loves lifting weights, writing music, and hanging out at the beach!