Demetrios is the founder of the MLOps Community. He likes interviewing people and playing improvised songs on the guitar.
The easiest way to get a good thread going on the community slack is to ask about notebooks. Here are a few of my favorite threads on Jupyter notebooks and the love/hate relationship we have of them in the MLOps community.
Should you put notebooks in production?
I find them more challenging to change manage compared to standard OOP/FP Python
I’ve heard of people maintaining notebooks with full unit test suites in production, to me it sounds like nonsense
ohhhhhh boy! this should be a good thread…
papermill is great for creating “reports” easily. As usual, though, whether it’s a good idea depends what you are using it for:
- is it just glue code to stitch some things together? that is, not much to unit test here.
- are you using it for visualizing something and that artifact?
- or are you using it instead of diving into logs?
- or does it contain source code that should be unit tested? etc.
The Netflix documentation looks pretty complicated. That looks like a pretty advanced system to stand up just to skirt around writing code outside of a notebook
yep org. size matters, what the common dev environment is, etc. (edited)
maybe @Savin Goyal knows someone that is part of the team that wrote that netflix blog to comment on it a bit?
Org size is huge, yes. If you have hundreds of scientists who have never written a unit test then something like this makes sense imo
This has all the hallmarks of Netflix engineers challenging each other to do it at a Friday pub crawl. My general question is: Why??
But if you’re a small team trying to ship productionalized AI systems, I’d stick to PyCharm and OOP. But I’m a simpleton so ¯\_(ツ)_/¯
omg, I am just halfway through the article and they just brought in docker and airflow… (I keep going)
Done! Pretty easy, isn't it? <- no, its bloody not…
In Hopsworks, we support notebooks as Jobs. Jobs can be scheduled, orchestrated by airflow. And, yes, customers run them in production. We find they are most useful after the feature store, where data scientists often follow this pattern:
(1) notebook to select features and create a train/test dataset
(1a) (optional hparam tuning notebook to find good hparams)
(2) notebook to train model
(3) notebook to evaluate mode (generate reports)
(4) notebook to deploy to model servingThe Jobs for these notebooks can be put in a simple DAG in airflow (we have UI support to generate that DAG).Typically, these notebooks do not require unit tests for code. They are version-controlled in github/gitlab. They use the github plugin for Jupyterlab, and an extension (nbdime) to diff notebooks visually.Papermill is also great as a way to generate reports from these notebook workflows.
Typically, these notebooks do not require unit tests for code@Jim Dowling I disagree with this premise, no production code should ever be untested.
I thought this is already settled issue and everyone agree that notebooks should not be anywhere near production cycle.
Typically, these notebooks do not require unit tests for code@Jim Dowling I disagree with this premise, no production code should ever be untested.
I’m not speaking for @Jim Dowling but I think he’s saying that there is no “logic” to be unit tested, which conforms to what I see people using notebooks in production for. (edited)
we use notebooks in production at PayPal for things we can afford to fail. i think the definition of production means different things to different people and that’s mostly where these discussions go awry.in the most primitive definition, “production” should just mean code that is being consumed and relied up by some end user or other service. with this definition in mind, we have some notebooks that fit and aren’t the end of the world if they break. for example, notebook runs an ETL job on a schedule and dumps the resulting dataset to Tableau. If the Tableau dataset is only used in one Dashboard and there is only a couple dozen people relying on that dashboard, going through the effort of converting the notebook to python source code, adding tests, and setting up monitoring is overkill. the dashboard users will probably be able to tell if the data looks messed up and if no one notices it stopped working, no need to fix it until someone does.we have over 10k of these types of jobs running on airflow. clearly not all are created equal. there’s probably 20% that are unnecessary and should be deleted, another 20% that should be converted to source code and have tests defined for them because they are more critical. but that middle ground of 60% is where i think notebooks in production might be OK. maybe not IDEAL, but FINE.
That’s correct @Stefan Krawczyk. How would you test the logic of “select these features” or “train this model” or “evaluate this model”?
I think many people on here have a DevOps view on MLOps that is very far from what actual Data Scientists do and want to do. They are not always software engineers and do not aspire to become one either. You can be holier-than-thou Googler and insist they become good software engineers — the PRs will not be accepted until the testing improves (more like the beating will not continue until moral improves). But if they are able to train a model and evaluate it and the inputs and outputs are validated by other teams (input data validated pre feature store) and output by ML Engineers before deployment (although i don’t think that is always necessary if they are good enough), then help them push their code to production. If it’s easier for everyone to let them push with notebooks, I’m fine with that. Just like we added a PyCharm plugin to Hopsworks — many engineers are not fine with notebooks. Why can’t we all just get along
I don’t have working experience as Data Scientist, but as someone walking his first steps through Model Experimentation, Notebooks seem to be a great tool. One may prototype and validate a thing (or two) quickly and most DS people I talk with use them a lot… Probably because NBs fit into their activities perfectly.When we think about a well designed software component it instantly comes to mind things like SOLID principles, scalability, testability, reusability, etc. Those aren’t the main goal of people when trying to find a model that fits their problem (that’s when NBs excels). I mean, when you’re developing traditional software you are used with those kind of concepts, but that’s not common talk among DS guys.That said, what’s the real pain point: the NBs or the mindset when experimenting to find the model?
I think the reasons against notebooks in production are the very core of what mlops is all about. The reason you want to test your end to end data pipelines are so your predictions don’t start leading to bad things happening. For example, say I’m predicting whether to approve an ecommerce order based on whether I think it’s fraudulent or not. If the data being passed into the model at prediction time stops looking like the data used to train the model I’m going to get in a lot of trouble.Giving data scientists permission to skip testing for that on an ongoing basis because they don’t want to be bothered writing tests is net negative for everyone.
I don’t think any of us are 100% against notebooks for anyone ever. They are really good at being a scratchpad or documenting a set of thought processes. The pain point is “getting to production” which is frankly a lot more than automating the execution of code on a schedule.
I’ve run into notebooks that are “too far gone” to productionalize as well. I think notebooks are great for getting signal if a problem is solvable with the data. But if there’s signal, I try to fast track that code to CI/CD with unit tests
I think the reasons against notebooks in production are the very core of what mlops is all about. The reason you want to test your end to end data pipelines are so your predictions don’t start leading to bad things happening.
Since I’m not really experienced with ML development, this may sound naive but… How do NBs prevent a proper testing discipline? I mean, if we remove that from the equation, would we have testable code \ models? It still seems the problem lies in the fact that most people experimenting to find a proper model aren’t worried about that at all.
Fair point. Most people experimenting aren’t concerned with what to do after the experiment looks promising… Except to complain that their idea isn’t getting used.
Idea behind tests is that you execute your code on CI server and get PR reject if tests fails, not sure how to achieve that with Notebooks.
Other common source of errors is not linear execution of cells, this does not help with reproducibility.
re: Netflix’s usage of Notebooks — We do not execute ML models (train, deploy yada yada) in production as notebooks. Our research scientists use notebooks for prototyping just like most people on this thread. Our ETL pipelines are executed as templatized notebooks (users write code as usual outside a notebook, but the workflow scheduler copies over the code in a templatized notebook and executes them via papermill) — in case of failure, you can just open the notebook to see the input and the output, which is handy.source — I work on some of this stuff at Netflix.
Our data scientists train & productize their models using Metaflow. We have tooling that helps them set up notebooks as dashboards (via papermill) to monitor their ML pipelines (not yet in open source). (edited)
Some reasons I think notebooks can be a bad idea to use in a production setting
- no proper code versioning
- state dependent execution (reproducibility can be an issue)
- no unit testing
- no linters
- no CI/CD
- no dependencies management
- everything gets cached which may lead to oversized infrastructure
- reusability of code is often done by copying, pasting and editing the same code in different notebooks
I understand it is possible to circumvent a lot these issues and for some use cases people might just ignore them but trying to maintain a large codebase based on notebooks can be a nightmare. With these things said, I think notebooks are a great tool for exploration, PoCs, and analyses in general. It can really speed up model development at the early stages and by simply importing your model pipeline as a module to your notebook you can get the best of both worlds, working with a codebase in a repo while performing lots of testing and exploration with it on a notebook and iterating between the two.
I am just really glad to hear that Netflix does not use notebooks in production. That has been a fallacy used by many data scientists to avoid stepping up their game and acquire more software engineering skills.
I’ve used both a full Notebook and Python approach in production. My feedback is that notebooks shouldn’t be used in production. The main reason is that they don’t integrate well with most Cloud ecosystem and existing quality check tools (e.g., pylint, mypy …).
However, we encourage our users to leverage notebooks for their prototypes. We also created a user guide to easily migrate from notebook to python (mostly by creating as much reusable functions as possible in their notebooks).
Personally, I would still use notebook in a scenario where I need to “get it done”. But if I need to “get it right”, I would immediately switch to Python code.
What’s your definition of “getting it done”? If I understand a significant amount of software engineering on the last decade went into justifying that the two things you mentioned are the same.
That has been a fallacy used by many data scientists to avoid steping up their game and acquire more software engineering skills.
As a piece of career advice when you hear “DSes don’t want to write code or has this workflow” do understand that some DSes want to write better code, eventually get better at it and has more opportunities in the future also that the company can force you into a workflow and limit and silo you, your incentives are not aligned.
The way I see from the above is that DSes are often seen as model making machines and a lot of tooling is just to make them faster to churn out these models. Notebooks are used as UI for them in which they hand in their work and the system takes care of the rest. Indeed if it is a UI you don’t need to (unit) test it, there are no new moving components written by the DS. Of course, you need to have anomaly detection or monitoring or other observability feature but that’s not a unit testing question. My worry with these “Data Science pin-factories” is that at no point their activity is linked to some economic incentive or product? What if the workflow cannot solve their problem? What if they consistently create bad models? You can’t solve problems purely by repeating an activity faster.
Many people here have seen the Joel Grus talk against notebooks (i see many of his arguments repeated here). Most arguments no longer hold. I recommend people to see this talk by Jeremy Howard on the current state of Notebooks and how they can be used in production pipelines. And, of course, I believe in testing. But no, i do not believe in end-to-end ML pipelines. ML pipelines have 2 parts — before the feature store (unit tested, data tested, DataOps) and after the feature store (data scientists and notebooks or Python).https://www.youtube.com/watch?v=9Q6sLbz37gk&feature=emb_logo
I personally don’t originate my aversions to notebooks from that talk. I am watching the Jeremy Howard talk and in 8 minutes he has drawn up numerous points that are not compatible with industrial-scale ML.Afaik his mission is to democratise ML but that doesn’t mean his workflow is compatible with MLEs who work 10 hours a day for weeks on models.This is my biggest problem with 1 to 1 application of academic studies to the real world and the expectations around it: courses are taught that way because of the limitations of the academic environment (time pressure, resources pressure, no need for organisational structure, short term goals). Industrial settings are completely different and that difference will be removed through onboarding/vocational training. The problem starts when the academic setting is onboarded because it looks appealing from a business perspective: “just a couple of line, instant deployment etc etc etc”.
and when I say industrial-scale ML I don’t mean scalable, efficient, lots of users etc, but I mean an ML solution framework that can solve industrial problems.
@Laszlo Sragner can you be specific in the points you disagree with Jeremy? Is it versioning, testing, etc?
I keep watching but he definitely has a very good point:
The problem is we are not living in a world of nuanced opinions. Every statement in this thread should come with a lot of qualifying statements about context and caveats.
@Jim Dowling main points: a lot of these features needs an external tool: nbdev and reviewnb
A lot of content is “look we can do this as well in a notebook, sharing into gist eg.: webserver (educational)
my experience is that literal programming and no hidden state notebook use is far less frequent. You end up with a notebook which is not version controlled and even you don’t know how to get to the result you are plotting and no docs whatsover. Kernel Restart and Run all will not help you then. And these are indeed bad practices.
In general, he is an educator who is convinced that this is the best way to teach DS/ML to most people (which I agree with). Everything in that video was educational material, which is quite understandable if your audience is students.His mission is to encourage people to get on the path and more importantly reduce discouragement from getting on the path. I consider this video part of this partly because Joel Grus’s video (as they talk about it on Twitter) is discouraging.
@Laszlo Sragner I agree, the incentives for high-quality code in a production env vs getting the job done are often misaligned. I believe an MLOps culture should aim at maximising business impact while decreasing total cost of ownership of ML software. This is a new discipline though and sometimes we tend to simply adopt principles learned from different contexts. Devops principles were introduced based on experience from classical software engineering where people have learned that good practices such as code versioning, reproducibility, reusability, governance ultimately led to increased productivity. The question is then whether these principles still hold for ML engineering. One way to address this would be to understand from first principles the lifecycle of ML code and how it differs from standard software engineering.
For a start, ML code evolves and degrades very differently. Say that a ML model should be the closest representation of reality as possible. Since this objective is unattainable (for most problems), you will be in a constant need to investigate new approaches, use new data, develop new features. It will also degrade with time as the reality changes. This alone implies that fast iteration and experimentation is very critical to maximise business impact, perhaps more critical than the total cost of ownership of the code itself. This seems rather different from standard SE, where the utility of the code seems to be less ephemeral and thus the TCO might weigh more toward maintaining the codebase.
I think one of the main reasons why MLOps emerged as a discipline is because organizations have accumulated so much debit in ML that the equation needs to be rebalanced.
I think because ML came from academia and it wasn’t very successful, it didn’t get enough attention from business and proper engineering. Now it does, but one needs to be careful to optimise the ML pipeline to fit real-world problems not just to make the old pipeline faster and more scalable/repeatable.
This alone implies that fast iteration and experimentation is very critical to maximise business impact
The speed of experimentation is not as important compared to experimenting in the right direction. At some point, problems become so hard that speed is irrelevant and organisedness and deliberatedness will be more important. (edited)
@Laszlo Sragner — those external tools are now part of many platforms. We provide built-in git and the nbdiff plugins in Jupyterlab with Hopsworks. Same for Sagemaker, i believe. So, version control is not an issue. Unit testing is still a problem, and I don’t advocate unit testing yet. Making resuable modules is also not best use of notebooks. But for reporting and visualizations, they are second to none. I don’t think it has anything to do with academia. Most of the notebook tooling is being developed by industry, like papermill.
I think point is not that notebooks are not useful, they just should not be necessary step between data and deployed model. Findings from notebook can be ported to internal library with tests and documentation, next person doing exploration can benefit from this quickly, something like internal fastai library.
But for reporting and visualizations, they are second to none.
This is exactly where we are using them: write good code, run a pipeline, load the data into an NB, visualise, repeat
My previous team met with the notebook team at Netflix (sadly they were re-orged/disbanded) and a bunch of Jupyter core contributers in 2019 to talk about the gaps between notebooks, or in this case the Jupyter ecosystem and productionize them. Sadly there were quite a bit of proprietary info so there’s no video recording, but we identified a bunch of gaps in the space and talked about a Jupyter JEP process. You can take a peek at a few of them here:
https://github.com/jupyter/enhancement-proposalsTo put it short, there’re many missing pieces (code versioning, human-friendly diff, execution lineage, dependency management etc…) that makes notebook, or at least the Jupyter environment challenging. And some of these limitations unfortunately come from Jupyter’s core decision itself.Ultimately reorg happened to both my previous team and the Netflix team so the current state of these JEPs are kinda in limbo. You can take a look at these proposals here though (and I would love to see them executed): https://github.com/jupyter/enhancement-proposalsI think folks at Noteable.io might be trying to solve that since Michelle was the lead of that Netflix team. (edited)
This is such a good thread! So many nuggets and perspectives. We can all agree that there are pros/cons to notebooks, but I think the area of opportunity for notebooks in “Production” is highly affected by the internal processes (or lack thereof) for your specific teams with respect to the level of expertise/talent available.Secondly, I strongly believe there is a place for clarifying the boundaries/guidelines of notebook design and usage IF they were to be put into a pseudo or a more formal production flow (Maybe a community challenge!?!?) which if you coalesce the comments above you start to get closer to that, again, with respect to your team composition, expertise, etc.Now, we could get really strategic and look behind the veil and attempt to answer the true underlying question…. “what is your tactical definition of production and how do notebooks fit into it?”. I almost guarantee, the team definitions of what constitutes “production” will vary.
How do you guys opertionalize training a model back and forth between notebooks and an automated pipeline? I can imagine there’s going to be a lot of versioning issues in this case.
We solved it with the radical idea of not using notebooks.
There are two types of notebooks — those where you have to run cells 1–7 and then cell 2 again and only then cells 8–32….
And those where you can “run all”. The first kind are a problem, for the latter there are many approaches which I’m sure others will be happy to suggest. Personally I’m with Laszlo Sragner
What’s the difference between a “run all” notebook and copying the code into a script and running it with python-fire?
Oh, I know the second one is not a pain in the *** to version control (edited)
Overall, I don’t think there’s a great way to do it. I think there’s a lot of value in notebooks, especially in the early stages of understanding your problem, and iterating with the data to gain some level of human understanding around what’s going on. But once you know how you’re going to try to solve your problem (at least initially), this is where it’s a good idea to move away from notebooks.Personally, my development process is to explore my data and script in notebooks until I know what my data looks like and maybe run a couple of experiments for a proof-of-concept. At that point, I actually move my development to something that’s structured as a python (or whatever language) project in Git, with pipelines to execute my code against my data.But basically, I always treat my notebooks as temporal and the script versioned in github as the source of truth. (working through some of this stuff for Pachyderm’s IDE rn)
@Demetrios you wanted stuff to ask Jeremy Howard ^
I call notebooks “sketches”. A co-worker recently called them “doodles”. Both encapsulate how I personally view them. They’re scratchpads that are best used to organize thoughts and experiments. Once that’s done then transition to a more rigorous development process (git, VS Code, etc). (edited)
- Notebook as scratch book
- Notebook as scratch book with a config object holding all params
- Migrate all codes to python modules and config object in a
- Migrate params to YAML, keep static configs in config.py (e.g., key pairs, URIs, random seeds, logging, etc).
And then train using the python script only, controlling the inputs and hyperparams using YAML files.
Or, you can versioning notebooks as
No versioning issue, just keep all the runs in a separate notebook
One pattern that has worked well for Netflix data science has been around using Notebooks as a way to inspect the output/intermediate state of ML pipelines — kind of like a quick dashboarding/analysis tool.
Notebooks are great for exploring. Can’t imagine operationalize notebooks though.
It’s the hidden state and non-linear execution of notebooks that causes a lot of headache while debugging them.
But as an educational tool and a scratch pad, Notebooks are fantastic
Notebooks/Jupyter are a format (ipynb) and a dev environment (jupyter notebook/lab). git versioning problems go away if you change the underlying format, you can use jupytext for that, it allows you to open scripts as notebooks; you can open a train.py file in jupyter, do some interactive development (which is extremely useful when dealing with a new dataset and you want to get some descriptive stats, plot some distributions, etc) Then you can orchestrate a production pipeline with those scripts (edited)
@Savin Goyal That’s exactly what we do: write your pipeline in a shell script observing code quality principles (because the entire team will look at it in the future) then load the data into a notebook for visualisation. If there is something that needs to be permanently monitored move it to metabase or a plotly-dash app.
But notebooks are strictly your eyes only not to be shared
@Oleh Kuchuk Metaflow is used for all data science work — iterating on models, building/debugging/productizing training pipelines and integrating the models with the surrounding business ecosystem.
Sharing Joel Grus’ presentation “I don’t like notebooks” in case someone hasn’t seen it yet: https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1 (edited)
I’m on Joel’s side but here’s the other side
You can’t possibly put these specific links up without the third
Do you version control your Jupyter notebooks?
was thinking of trying https://github.com/mwouts/jupytext
But nv got around doing it.
How’s its ^^ different from natively available nbconvert: https://github.com/jupyter/nbconvert
Our teams are doing things with nbconvert, to be able to push Notebooks to the repo and converting to Markdown to be able to do code-review… But it is really a pain
it has to do with the Notebook format, which includes code+data. Some isolated efforts have been made to implement a different format, which splits both. but i haven’t seen anything robust yet
and AFAIK Jupyter’s core team is not working on that
But, I think most of the code versioning tools renders .ipynb files to HTML format by default on web. So, why the need for conversion before pushing to repo?
I’ve moved to using git-lfs for all *.ipynb … basically just treats them like objects. No diffing.
You will hit problems with repo size if you don’t do this or strip the content on git hook.
That’s the moment when we should really miss RMarkdown in Python stack
Even things like making some cells hidden is not obvious in Jupyters, and in general — Markdown >>> ipynb, this format is a total nightmare. Have you seen nbconvert custom templates?
Tell Data Scientists to work with that… And of course we have some nice extensions, but when you want to export your notebook — you have to find a way to include them. And they can still not work in things like NBViewer… Jupyter environment is one big mess.
fclesio tbh, abolished notebooks for production code. We use only for EDA. (edited)
Yeah, I hear that quite a bit. What do you use instead @fclesio?
After exploration, we pick from DS the notebook and transpose it to a python script (or application) and include all versioning data + code inside DVC for version control and branching and include everything in a docker machine.For new variables (that needs some analysis or graphs) we still use notebooks; but for training we stopped for complete due to several problems of reproducibility and code review and so on.
In our Cubonacci platform we came up with a somewhat unique solution to notebooks where our notebook environment also contains an IDE at the top which contains the production code but in the Notebook you can interact with this code dynamically. This allows the normal advantages of notebooks but keep the production code used for training and serving seperate. The production code is stored in git while the notebooks are stored (and in the near future versioned) inside our platform.This flow is based on how many data scientists use notebooks for development, when a piece of code is finished it is removed from the notebook and added to a normal Python project, which you then import from directly in your notebook again.
Just resurrecting an old thread since my question is related here.We have notebooks that serve as artifacts, or outputs from a study. The executed notebooks may take a LONG time to run such that we don’t want to repeat it. But the output of the notebook is what is important, particularly for non-technical people that want to read a report.Some of these notebooks can be 50+ MB when executed due to lots of images. So a direct commit of the executed notebook can be a bit heavy on the github repo. We’d like to version-control the output as a snapshot that people can revisit, but haven’t converged on a clean process for it.Right now we:
- Write and run the notebook
- Save a HTML or PDF
- Stick HTML/PDF in a shared wiki (i.e. Confluence)
- Clear the notebook
- Commit the notebook to github
It’s a lot of steps to remember and can disincentives DSes from reporting and committing every small study they do.Any thoughts/ideas?
No production code on notebooks: that’s a must. Nothing to discuss about that (in my perspective)
We are working on “good practices” where the production-ready code is written (and tested) as any other normal piece of software. Of course, you can import and use it from a notebook.
Whatever is written on notebooks, must be “made productive” (taken out of the notebook). We have mixed teams, with data-scientists, MLDevs, backend devs (and whatever mix of those roles), so they work together to make it happen.
Still, as I said before, when you want to store/version/track a rendered Notebook as an asset, it’s not trivial.
Besides, given our organizational context, it is not easy to adopt some frameworks or platforms (such as the one mentioned Cubonacci).
@Tom Szumowski that workflow sounds very similar to the one some of our teams implement. Only thing I can say is write an automation tool (aka script) to do all that and reduce the probability of errors.
Convert all the 5 steps in one:
- when the notebook is ready, run
Can’t be that bad, even for DSes
Hi all … curious how people who work in notebooks get their code into source control like git without all the extra jazz for the notebook itself to run. Using pycharm or something else seems straight forward but just curious how people do it in the notebook world to have something ready to be build and run in production.
My usual answer: If you need to version control something, write the code in python scripts and then you don’t have this problem. Use python-fire to simplify shell script and parameter handling.
I use jupytext, which allows me to open .py files in jupyter. I can develop them as notebooks and store them in version control without issues. for orchestration, I use a tool I built, which uses papermill under the hood to run multi-stage pipelines (each stage can be a function or script). I’m very happy with this workflow and been using it in production for some time. Feel free to DM me if you’d like to know more details.
For a quick and dirty method, you can clear the notebook output and commit the notebook directly. You could automate that process if needed.