MLA 007 Jupyter Notebooks

Oct 16, 2018

Click to Play Episode

Jupyter Notebooks, originally conceived as IPython Notebooks, enable data scientists to combine code, documentation, and visual outputs in an interactive, browser-based environment supporting multiple languages like Python, Julia, and R. This episode details how Jupyter Notebooks structure workflows into executable cells - mixing markdown explanations and inline charts - which is essential for documenting, demonstrating, and sharing data analysis and machine learning pipelines step by step.

Resources

Resources best viewed here

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (3rd Edition)

Fast.ai Practical Deep Learning for Coders

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 3rd Edition

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Overview of Jupyter Notebooks

Historical Context and Scope
- Jupyter Notebooks began as IPython Notebooks focused solely on Python.
- The project was renamed Jupyter to support additional languages - namely Julia ("JU"), Python ("PY"), and R ("R") - broadening its applicability for data science and machine learning across multiple languages.
Interactive, Narrative-Driven Coding
- Jupyter Notebooks allow for the mixing of executable code, markdown documentation, and rich media outputs within a browser-based interface.
- The coding environment is structured as a sequence of cells where each cell can independently run code and display its output directly underneath.
- Unlike traditional Python scripts, which output results linearly and impermanently, Jupyter Notebooks preserve the stepwise development process and its outputs for later review or publication.

Typical Workflow Example

Stepwise Data Science Pipeline Construction
- Import necessary libraries: Each new notebook usually starts with a cell for imports (e.g., matplotlib, scikit-learn, keras, pandas).
- Data ingestion phase: Read data into a pandas DataFrame via read_csv for CSVs or read_sql for databases.
- Exploratory analysis steps: Use DataFrame methods like .info() and .describe() to inspect the dataset; results are rendered below the respective cell.
- Model development: Train a machine learning model - for example using Keras - and output performance metrics such as loss, mean squared error, or classification accuracy directly beneath the executed cell.
- Data visualization: Leverage charting libraries like matplotlib to produce inline plots (e.g., histograms, correlation matrices), which remain visible as part of the notebook for later reference.

Publishing and Documentation Features

Markdown Support and Storytelling
- Markdown cells enable the inclusion of formatted explanations, section headings, bullet points, and even inline images and videos, allowing for clear documentation and instructional content interleaved with code.
- This format makes it simple to delineate different phases of a pipeline (e.g., "Data Ingestion", "Data Cleaning", "Model Evaluation") with descriptive context.
Inline Visual Outputs
- Outputs from code cells, such as tables, charts, and model training logs, are preserved within the notebook interface, making it easy to communicate findings and reasoning steps alongside the code.
- Visualization libraries (like matplotlib) can render charts directly in the notebook without the need to generate separate files.
Reproducibility and Sharing
- Notebooks can be published to platforms like GitHub, where the full code, markdown, and most recent cell outputs are viewable in-browser.
- This enables transparent workflow documentation and facilitates tutorials, blog posts, and collaborative analysis.

Practical Considerations and Limitations

Cell-based Execution Flexibility
- Each cell can be run independently, so developers can repeatedly rerun specific steps (e.g., re-trying a modeling cell after code fixes) without needing to rerun the entire notebook.
- This is especially useful for iterative experimentation with large or slow-to-load datasets.
Primary Use Cases
- Jupyter Notebooks excel at "storytelling" - presenting an analytical or modeling process along with its rationale and findings, primarily for publication or demonstration.
- For regular development, many practitioners prefer traditional editors or IDEs (like PyCharm or Vim) due to advanced features such as debugging, code navigation, and project organization.

Summary

Jupyter Notebooks serve as a central tool for documenting, presenting, and sharing the entirety of a machine learning or data analysis pipeline - combining code, output, narrative, and visualizations into a single, comprehensible document ideally suited for tutorials, reports, and reproducible workflows.

Accelerate Your AI Strategy with TylerAI Strategy Call with Tyler

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Transcript

You're listening to machine learning applied. It's been a while. I'm very sorry everyone. I got very busy for a minute there, but I'm getting back on the horse. The next three episodes in this series are gonna be on visualization and in particular on exploratory data analysis or EDA. Before we get into the visualization stuff, this episode is going to cover Jupiter.

Jupiter notebooks. J-U-P-Y-T-E-R, and I'm sure most, if not all of you, are already pretty well versed in Jupyter Notebooks. They're pretty difficult to avoid when immersing yourself in machine learning. Lots of blog posts, GitHub repositories, tutorials, et cetera, are done by way of Jupyter Notebooks, but I haven't covered them officially.

So let's get that outta the way here. The project Jupiter was previously called I Python Notebooks. They renamed it to Jupiter, J-U-P-Y-T-E-R, because now they're supporting multiple languages and they have a much broader scope of what they're trying to accomplish. The name itself sort of encapsulates some of the languages that it supports.

For example, JU for Julia, Julia is a programming language for data science. PY for Python. I don't know what TER is. I imagine the R at the end is for the R programming language, but that just shows that the Jupyter Project is now compatible with multiple languages. We'll be focusing on Python because that's our focus in this podcast series, and the idea of a Jupyter Notebook is that you write a code file, like a Python file, but that you can publish it to the web or to GitHub, or in other locations where people might be viewing your Python file.

And that that Python file tells a story. It's more than just a script that you run on the command line. A Jupiter notebook takes that Python file and turns it from a script that you run into a piece by piece. Visual storytelling experience. So let me explain that. If you were to write a Python file where you're ingesting data from a, from a spreadsheet, like a CSV file or from a database like Postgres K you, you might use Pandas dot Reed CSV or Pandas Reed sql.

Now you might write some Python to do some exploratory data analysis of your own. You might render some charts and graphs using mat plot lib, and when you run your script to perform some simple EDA. And we will talk about exploratory data analysis, EDA in, uh, the next episode. But when you run your script to perform some of that EDA map, plot lib might pop open a window for you to look at some charts and graphs.

Okay? You'll close that window, you'll delete that code or comment it out, and you'll perform some data munging. Maybe you'll, maybe you'll replace some null values or use robust scaler or standard scaler to scale your DA to scale your data. Then you'll pipe your data into a machine learning model and run the model.

And when you run the script, you'll see the output of your model being run, whether it's in Caris or Tensor flow or psych kit, learn, maybe the output of that script might be the loss value of the loss function over time, and then eventually a means square to error or an accuracy metric if you're doing a classification task.

That'll be all printed on the command line on the terminal. You'll see these values. And things might look good to you. So you take the results of your model and you pipe it into some charts and graphs that you're going to show some business decision makers. Well, you render those images as PNG files.

Those PNG files have the charts and graphs, bar charts, pie charts, whatever. And then you email those things to, to the business users. So I went pretty fast with that process. But you get the idea that you're writing Python code in a Python script, and there's multiple phases to this process. And at various phases in the process, there might be visualizations, but those are done by popping open a window for you to look at, and then you close the window and you delete the code and you replace that with some other code.

And what you have in the end is a Python script minus all the visualizations, minus the sort of story that took you through the process of pulling in the data, exploring the data, visually running the model a few times until you're satisfied with your metrics, and then generating a report by way of visualizations that you're going to send to a business user.

That whole story from beginning to end. Well, that's a pretty important story to capture. To capture all of it. It's unfortunate that writing a script like that. Beginning to end sort of loses a lot of information along the way. Maybe in the future somebody wants to pick up your file and sort of see the the start to finish.

Or let's say you wanted to write this whole thing as a tutorial on a blog post, or you wanted to present the visualizations and outputs of the various parts of your script to a webpage so people can see just how you went about the whole process beginning to end. That's where Jupiter comes in handy.

Jupiter notebooks. A Jupyter Notebook is basically a Python file that runs in the browser as a sort of dedicated webpage. What you'll do is you'll install Jupyter Pip, install Jupyter or Conda, install Jupyter if you're using Anaconda, and then in in the terminal you'll type Jupyter Notebook and and that'll open up a Chrome tab.

And in that Chrome tab will be sort of your dashboard where you can create a new Jupyter Notebook. And the way Jupyter notebooks are structured are as cells, individual cells in each cell. You can write some code and then execute it. You can write Python code R, code, Julia Scala, et cetera. So you write some Python code.

Let's say that the first cell you write is a bunch of import statements. You import map plot lib, you import psychic, learn car, TensorFlow, blah, blah, blah, blah, blah. And then you hit control, enter, it'll execute that cell. And if there's any output from that cell, it will print the output directly under the cell.

Now, when you have a bunch of import statements, there's no output, so you hit control, enter it shows the cell is having been run, and then you move on to create a new cell. In that new cell. Maybe you'll ingest a data source. You'll read a CSV file, pandas read csv, or you'll read data from a database.

pandas.read sql. Now you may want to do some basic exploratory data analysis, which we'll talk about in the next episode. Maybe you want to see if there are any null values in your data frame, or you want to see the mean min max, median, et cetera. You might type DF info data frame.info, or DF describe parentheses, and hit control enter.

So now this is your second cell chunk of code. When you hit control, enter, it executes that code and it prints the results. Of that execution below the cell. So you'll actually see the output of running data frame info, parentheses, it'll actually print the results underneath the cell. Now let's say you create a new cell.

Okay? So you, you have your imports, you ingested your data into a data frame, and it printed some information about that data under that second cell that's still visible while you continue to work on your third cell. Maybe in the third cell, you build up a neural network using Caris, and you pipe in your data frame to train your model and you hit control enter.

And the output of training a Caris model is line by line. The loss as it improves over time, including some speed metrics, how long it took to run over a batch of data, and then by the end. Either you've reached the maximum epoch or you have some early stopping callback or such, it'll print your main metric like mean squared error or accuracy.

So all that stuff will be printed under your third cell when you hit control enter. So you step back and you look at your Jupiter notebook and what you see is each chunk of code. That's relevant to a specific step in your pipeline, in your process. First, you imported your stuff, then you ingested your data.

Then you built and ran a model, and under each chunk of code under each cell is the output of running that cell. Now this has a lot of value because rather than writing multiple scripts that you have to run independently and discard the output, you get to keep the output after running each phase of your pipeline.

And if you wanted to publish this Jupyter Notebook online, you wanted to put it on a blog post, a tutorial, or even just commit it to GitHub. GitHub actually has a Jupyter Notebook rendering engine in the same way that they have syntax highlighting for Python and JavaScript and all that. If you were to open a Jupyter Notebook in GitHub, it'll actually render the Jupyter Notebook properly in your browser, and you'll get to see not only, not only the code in the cells.

Of that Jupyter Notebook, but the last output of each cell's run in that Jupyter Notebook. So if you were to write a Jupyter Notebook file, sell by cell, where you ingest the data, you pipe it through a machine learning model, and you output the results of the machine learning model, and all that output is captured under each cell, and you commit that to GitHub.

Somebody else were to come along to your GitHub repository, click one of your Jupiter notebooks. They'll get to see the stepwise process, including the output from each cell's execution, so you can sort of build up a story of the process of your data pipeline in one of these Jupiter notebooks. And it's not just limited to code.

There are a few other things that you can do. One is that these cells don't have to be code cells. They can be marked down cells. So rather than just having a code block in Python, you know, a whole bunch of lines with, with hashtags indicating that this is a code block and, and you're talking your way through this process, you can actually have markdown, which allows you to have a much more elegant presentation of each of the phases of your file.

So if you wanted to separate the data ingestion from the EDA from the model building, you might have a markdown cell above each of those cells describing what you're doing in each of those phases. And you're allowed to use markdown formatting such as double hashtags for an H two and a bolded list items, and even inline images and videos if necessary.

Another very important. Piece of Jupiter notebooks is that it allows you to inline render charts and graphs using charting libraries like map plot lib. We'll talk about map plot lib in one of the next two episodes in this three part series. But in short map plot lib is the core charting library for charting your data, which may be data that you imported from an Excel spreadsheet into your Pandas data frame.

You can use map plot lib to plot the data in your data frame. You could do line charts, bar charts, histograms, correlation, matrices. We'll, we'll talk about all this stuff in a bit. One command you might use is df dot core CORR parentheses, which will plot a correlation matrix and Jupiter notebooks allow you to, to render that plot, render that chart in line directly as the output of executing that cell.

So captured in the output of your Jupyter Notebook are these charts and graphs. So you're telling the story of developing your machine learning model from beginning with the data to ending with running your machine learning model, including in the exploratory data analysis phase. Rendering charts and graphs that allow you to visualize your data.

Are there outliers? How do different features interact with each other? How is one feature distributed in a histogram, et cetera? So that's Jupyter Notebook. Jupyter Notebook is like writing a Python file, a Python script, but rather than just executing it beginning to end. As you would with a Python script, you execute it in chunks in a Jupiter notebook, and each chunk or cell will capture the output right under the cell so that you can build up this story that took you from beginning to end in this file.

And that includes markdown cells, which allows you to describe what's coming in in subsequent cells. It includes inline map. Plot lib charts, which allows you to visually perform exploratory data analysis. And then you can publish your results to a blog post or to GitHub, or a tutorial or what have you.

And one final nicety of Jupyter Notebooks is that you can execute these cells independent of each other. They don't have to be sequential. So let's say you do your imports up here, you execute that cell, okay, now all of your imports are imported, and then you do your data ingestion. Maybe you're reading SQL from a database into your data frame.

And let's say that that data is really heavy. It takes a very long time to. Pull down. So you execute that cell, it takes five, 10 minutes, and now that cell is executed well. As you continue to write cells down below, let's say you're building up your model and piping the data into your model to train it.

Well, you had a bug. Something went wrong with the model. So you delete the model code, you delete that cell. But your cell with the data is still already run, which allows you to recreate a new cell to try again with your model and just execute that one cell. That way you can sort of build up this flow piecemeal one at a time without having to execute it beginning to end every time you wanna run your file.

Now I should mention that Jupyter Notebooks are primarily used for storytelling. I keep using that word, telling a story piece by piece through the development of your model, from data to execution, all that stuff. Jupyter Notebooks are primarily used for storytelling, so, so they're primarily used for publishing, so somebody else can look at it.

I find that most machine learning developers or data analysts, when they're doing their work just for them, they'll be using Pi Charm or Vim or emax or whatever IDE or text editor and running these Python scripts in the traditional sense on their computer. It's just easier that way. And these IDs, like pie charm in particular, I'm a huge fan of pie charm.

Have tons of tooling built into them where you can get signatures of functions. You can drill into functions with control click, you can set debugger break points and run the Python debugger on a file. All these things that really make your personal development workflow work for you. That's what I find a lot of machine learning developers will do their own thing in pi, in a traditional Python script in pie charm.

But then once they want to build up their story and publish their results for other eyes to look at, they'll create a Jupiter notebook and then commit that to get or publish it to a blog. So that's Jupiter notebook. Next episode, we'll unpack this idea of exploratory data analysis and start introducing some of these charts like correlation matrices and histograms.

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.