Dask What is it? Parallelism for analytics What is parallelism? Doing a lot at once by splitting tasks into smaller subtasks which can be processed in parallel (at the same time) Distributed work across multiple machines and then combining the results Helpful for CPU bound - doing a bunch of calculations on the CPU. The rate at which process progresses is limited by the speed of the CPU Concurrency? Similar but a but things don’t have to happen at the same time, they can happen asynchronously. They can overlap. Shared state Helpful to I/O bound - networking, reading from disk, etc. The rate at which a process progresses is limited by the speed of the I/O subsystem. Multi-core vs distributed Multi-core is a single processor with 2 or more cores that can cooperate through threads - multithreading Distributed is across multiple nodes communicating via HTTP or RPC Why is this hard? Python has its challenges due to GIL, other languages don't have this problem The shared state can lead to potential race conditions, deadlocks, etc Coordination work across the machines For analytics? Calculating some statistics on a large dataset can be tricky if it can’t fit in memory
In this episode
CEO, Coiled, Maintainer of Dask
Matthew is an open source software developer in the numeric Python ecosystem. He maintains several PyData libraries, but today focuses mostly on Dask a library for scalable computing. Matthew worked for Anaconda Inc for several years, then built out the Dask team at NVIDIA for RAPIDS, and most recently founded Coiled Computing to improve Python's scalability with Dask for large organizations. Matthew has given talks at a variety of technical, academic, and industry conferences. A list of talks and keynotes is available at (https://matthewrocklin.com/talks). Matthew holds a bachelor’s degree from UC Berkeley in physics and mathematics, and a PhD in computer science from the University of Chicago.
Head of Data Science Evangelism and Marketing, Coiled
I'm a data scientist, writer, educator & podcaster. My interests include promoting data & AI literacy/fluency, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC. I do many of these at DataCamp, a data science training company educating over 3 million learners worldwide through interactive courses on the use of Python, R, SQL, Git, Bash and Spreadsheets in a data science context. I have spearheaded the development of over 25 courses in DataCamp’s Python curriculum, impacting over 170,000 learners worldwide through my own courses. I host and produce the data science podcast DataFramed, in which I use long-format interviews with working data scientists to delve into what actually happens in the space and what impact it can and does have. I earned my PhD in Mathematics from the University of New South Wales, Australia and have conducted biomedical research at the Max Planck Institute in Germany and Yale University, New Haven. For more information, check out my website here: http://hugobowne.github.io/
Demetrios is one of the main organizers of the MLOps community and currently resides in a small town outside Frankfurt, Germany. He is an avid traveller who taught English as a second language to see the world and learn about new cultures. Demetrios fell into the Machine Learning Operations world, and since, has interviewed the leading names around MLOps, Data Science, and ML. Since diving into the nitty-gritty of Machine Learning Operations he felt a strong calling to explore the ethical issues surrounding ML. When he is not conducting interviews you can find him making stone stacking with his daughter in the woods or playing the ukulele by the campfire.
David is one of the organizers of the MLOps Community. He is an engineer, teacher, and lifelong student. He loves to build solutions to tough problems and share his learnings with others. He works out of NYC and loves to hike and box for fun. He enjoys meeting new people so feel free to reach out to him!