Python charting libraries - Matplotlib, Seaborn, and Bokeh - explaining, their strengths from quick EDA to interactive, HTML-exported visualizations, and clarifies where D3.js fits as a JavaScript alternative for end-user applications. It also evaluates major software solutions like Tableau, Power BI, QlikView, and Excel, detailing how modern BI tools now integrate drag-and-drop analytics with embedded machine learning, potentially allowing business users to automate entire workflows without coding.
You're listening to machine learning applied. In this episode, we're gonna talk about specific charting, plotting, graphing, utilities in Python, outside of Python, dedicated software, all that stuff. We'll cover, map plot, lib, Seaborn Boca D three. Okay. Those are code libraries for generating plots and graphs.
And then we'll talk about software packages for charting and graphing, things like Tableau, click View, power bi, and Excel. So in the last couple episodes, we worked through the EDA part of the BI pipeline, the exploratory data analysis phase of the business intelligence pipeline. Remember, EDA comes right before cleaning your data or some people.
Put those two together. We got e, d, A and munging. That's all just EDA. And then your cleaned up data from that phase goes into your machine learning model, and oftentimes a developer will build this story from beginning to end of the BI pipeline where they're ingesting the data, they're performing EDA, they're designing the machine learning model.
They'll do this all in a Jupyter Notebook and they'll execute each cell sell by cell. And each cell's execution then will capture the output under each cell, and then you can save that Jupyter Notebook and publish it online. And other people can look at the entire process, including each output from each phase on GitHub, on a blog post, on a tutorial, whatever.
Now in the steps of EDA that. Chart and plot and graph that render charts and graphs. These developers will use any number of libraries. The most common library used for plotting is called mat plot Lib, M-A-T-P-L-O-T-L-I-B, mat plot Lib. Map plot Lib is a library that lets you chart and graph basic charts.
It's pretty low level. You have to handhold guide, map, plot lib through the process of creating a plot. There's a lot of lines of code that goes into generating a single plot. But it gets the job done. It is sort of the base library for other wrapper libraries that we'll talk about in a bit. So it's sort of the core plotting library in the Python ecosystem.
Most tutorials when they're doing EDA or they're just teaching you machine learning basics, they'll use map plot lib. Most developers when they're designing their machine learning model and they just want some quick and dirty EDA plots, a correlation matrix, something like that, they'll use mat plot lib.
Now this is very, very nice. Pandas wraps around mat plot lib. It will call mat plot lib functions for you so that you don't have to write all this boilerplate code, all this scaffolding for generating a plot, because like I said, generating a plot in mat, plot lib is no small feat. So for example, if you want to generate a correlation matrix, which we talked about in the last episode, you would just type DF core, CORR parentheses, you hit.
Control enter pandas will take your data, it'll pop it into a bunch of boilerplate map plot lib code for generating a correlation matrix with some sensible defaults around the font and the size of the plot and all that stuff, and it will render it for you. Really nice. Very quick and dirty. Most machine learning developers doing the EDA step, they just want to quick and dirty.
They just wanna look at some correlation matrices. They just wanna look at some scatter plots, some histograms, most of the bare necessities that you could ever imagine using mat plot lib for in your EDA phase. Preparing for machine learning is available directly on your PANDAS data frame. So three of the ones that we mentioned from the last episode, we have core, we'll generate a correlation matrix, hiss, we'll generate a histogram.
Scatter will generate a scatter plot. All these things. So you know what I recommend is getting into the swing of EDA, getting used to the whole process by using Pandas directly so that you don't get bogged down. I. Trying to learn to write map plot lib code because it can get pretty hairy, get into the swing of things, just calling the functions on panda's data frames.
It's really simple, quick and dirty, and it will help you get into the habit of exploring your data. Before piping into a machine learning model, which is actually a habit that a lot of machine learning developers lack. A lot of machine learning developers will skip the EDA phase. It's kind of like how many web developers tend to skip unit testing even though it has so much value and they'll get bit in the butt later if they don't do that step Well, machine learning developers will benefit greatly from having an EDA phase before they write their model.
And a lot of times machine learning developers will skip that step because it's so hairy. It takes so long to do well. If you just call these functions on pandas, it will help you get into the habit because it's just so simple. It's just so easy to write DF core df, his df scatter. That way you can get into the habit, and then later if you want to get better at designing these charts.
You know, you want to really dive into the fonts or the colors or the sizing, whatever the case may be. Then you can learn map plot lib separately, and write your own custom plots. And that might be handy. For example, if you're generating some reports, some charts for business users down the pike. So map plot lib is the core pandas wraps around it, which provides some high level functions.
So it's really easy to call net plot lib without writing boilerplate code. Another wrapper around. Mat plot lib is seaborn, S-E-A-B-O-R-N. Seaborn. Seaborn is to mat plot lib what Caris is to tensor flow. So Seaborn wraps around mat plot lib, and just makes it easier to work with. Just simple as that. So if you want to design your own custom charts, you know you need something more than what Pandas provides as one line function calls you want.
You still want to design your own charts, but map plot lib is a little bit too hairy for you. Well, you can step it up a level and use a wrapper library called Seaborn, which. Exposes the map plot, lib API. In simpler terms, it sets some saner defaults, things around fonts and sizes. In fact, a lot of times what you'll see developers do is just import seaborn after importing map plot lib and, and never use seaborn itself.
They, they just use map plot lib function calls while importing seaborn, just the act of importing it. Actually sets a lot of these defaults on fonts and color palettes and chart sizes and stuff like that. So the mere act of importing seaborn already sets a bunch of defaults, but then additionally, you can use seaborn function calls, which sort of wrap up your map plot lib codes, so it's easier to work with fewer lines of code when using seaborne.
So matte plot, lib, pandas, seaborn. This is sort of the holy trinity. These, a lot of times people will use all three together. Seaborn for wrapping mat plot lib, sane defaults, better colors, better fonts, easier to write. Custom plotting code and pandas wrapper functions for the real quick and dirty common EDA tasks like correlation, matrices and histograms.
You may, you may see all three used together. Or you may see people not use Seaborn. They don't care if things are pretty, they just want a quick look at the data distribution. I don't need seaborn. I'm just gonna use pandas wrappers on map, plot, lib functions, et cetera. Alright, that's over here. All three of those.
We take those and we push them over aside for now, because the next one we're talking about is a much bigger, fatter, heavy hitter. Framework for plotting and charting. It's called Boca. I don't know if that's how you pronounce it. That's how I always pronounce it. It's B-O-K-E-H. Boca and Boca is a much more powerful charting library than Matt Pot lives, seaborne and the pandas wrapper functions.
Boca allows you to have interactive, beautiful charts and graphs with cross filtering and all that stuff. Uh, in your Jupyter Notebooks or Boca will actually generate, can generate HTML files. It will export these charts and graphs into HTML files that you can then publish to the web, or if interaction with your data is sort of a very.
Uh, core component, very essential. It's not, we're not just doing EDA anymore. Now, maybe we want the business users to be able to, to explore the data on their own. You might use Boca, you can export graphs into HTML, tie it to a Boca server that is connected to your database directly or to a spreadsheet, and now you have a dedicated app for exploring the data that business users can use, data analysts can use, et cetera.
So Boca has all the standard, uh, plotting, charting, graphing, utilities that you might get in SIBO and mat plot lib. But in addition, it has. Interactivity. So if you wanted to generate a histogram in your Jupyter Notebook using Boca, you would write the function call to generating the histogram control enter.
It will generate a histogram below the cell, but now you can actually hover over those bars with your mouse and the bar will like pop out a little bit. Change color. It might display a number, maybe the number of items. In that bar or the number that that bar represents, et cetera. You can zoom into that graph with the, with the mouse wheel, you can pan the graph around.
So Boca tends to be for more permanent data analysis fixtures rather than simple EDA tasks. So Boca is less common for. I need to just look at my data to determine how I'm going to handle outliers and null values. Boca is more for, I need to connect a data visualization utility to a live data source and from time to time slice and dice my data and multiple people might be using this.
Um, stuff like that. There's a very powerful component built into Boca called cross filtering. What that allows you to do, it's, you can imagine it like a facets filter on amazon.com where you say, you know, I want items in this department. I'm looking for a computer. It has to have an a core I seven, uh, between this price range, anything.
Which was released in the last year and a half. So cross filtering is sort of slicing and dicing your data on multiple of the columns, and it will hone in filtering your data based on, uh, multiple column filters. Spoke a very, very powerful utility. So the map plot, lib, pandas, seaborn, triumvirate. Over here, I call that like step one.
This is simple tasks, ED, a, rendering, graphs, maybe even publishing reports. I don't mean to say that they're not very limited. They have very powerful APIs and you can generate any graph under the sun that you can imagine using their APIs. They're not limited in that sense. What I mean to say is. That they're static.
You might use them for a one-off thing. Um, publishing a report to your Jupyter Notebook or some simple EDA Boca is more for really fundamental, ongoing permanent data exploration. You would, you would use it to sort of. Generate an app that is built around your data source so that, so that various people can slice and dice data and then take another step in the on steroids direction.
We have D three Js. Now, D three may be a little bit of a weird one to pull into this conversation. D three is a charting library. It's kind of like map plot lib, but it's not in Python, it's in JavaScript. So map plot, lib, seaborn, and Boca are all python. You write Python and they generate plots, whether generating a plot inline Jupiter notebook like Matt plot lib or Boca.
Or exporting an HTML app connected to a data source which Boca can do. Well, D three is is a bit different than that. D three is intended for writing charts and graphs for an app. I mean, D three is meant for users, for end users, if you're gonna write a mobile app or you're gonna write a website and that's connected to a data source.
In the traditional sense of web development, where you have an app server written in node js or Python, connected to a data source in Postgres, and your client page, whether it's written in React or view js or something is interacting with the server by way of rest calls or GraphQL pulling it down into the client webpage, and then D three takes that data.
And then generates graphs. Interactive, not interactive. The sky's the limit because you actually write the code for generating the charts in, in D three, you, you cut, you hand write the code. It is a lot of boilerplate. You're gonna write hundreds of lines of code to generate plots and graphs in D three, but it's not intended for EDA.
It's not intended for your business users, it's intended for end users. So if your data. Pipeline. You take in data, you, you look at it with EDA, you clean it up, you pipe it into a machine learning model, and then that machine learning model does something for users in the same way that Pandora uses machine learning to.
Find new songs for users, or Amazon uses machine learning to recommend new products to users and stuff like that. I'm talking end users here. D three is intended for writing the front end code of an app that is going to be used by an end user. I. So that seems a little bit weird that I brought that in here.
The only reason I mention that is I think a lot of people see the word D three, D three js thrown around the internet. It's a very popular plotting library, and they're over here in the doing their machine learning code and they're wondering, well, I'm using Matt plot lib, but these guys are talking about D three.
Should I be using D three? Very likely not D three, like I said, it's intended for end users. It's intended for building a full fledged app. Mobile app, web app, what have you. Uh, it's not meant for exploring data in your data analysis pipeline. It's meant for much more heavy hitting, production ready, beautiful end user charts and graphs.
Okay, those are some charting, plotting libraries available to you. In, in the code world and there are many more, of course. Um, those are the most popular ones that I see, uh, that is map plot, lib, seaborn. Boca and D three, and these are libraries in code for coding up a pl, a plot, a chart. You write code to generate a chart.
Push all that over here and enter a new world of software, software packages that you download that help you generate charts. Click and click and drag. Point and click. Some common software packages are Tableau, T-A-B-L-E-A-U, Tableau. That's probably the most popular in my opinion. Another one is Power bi.
Power BI, by Microsoft bi as in business intelligence, uh, remember the Business Intelligence pipeline is the beginning to end. Purpose of data science in the first place. You take in your data, you look at it, you munge it, you machine learning it, and then you generate reports for your business users. So that implies the, the word Power bi.
Microsoft's Power BI software package implies that these software packages, Tableau, power, bi ClickView, et cetera, are more than just for EDA. They're more than just. Looking at data and generating graphs. A lot of these tools have built into them, um, very powerful tooling for the whole bi pipeline, including in some cases machine learning.
Some of these software packages will let you actually generate a machine learning model without having to write any code. The idea of these software packages, you download them, let's, let's say, let's say Tableau. You download Tableau Community Edition. You double click the icon, it opens up, it's a software package.
It's a drag and drop software package where you type in your data source. Maybe it's a spreadsheet or a, or a database. It connects to the data source and then it shows you. All of your columns and their data types and some basic information on those columns, including number of Knolls, some basics on the data distribution, mean median max, all that stuff over here on the left.
And then you can drag, you can click and drag some of those columns into the middle. Pain of this software like Tableau, you drag it into the middle and it will sort of populate a chart and maybe, maybe it'll automatically decide based on the distribution of this data to create a histogram. And you, there's something over here on the top right.
You can click a dropdown and you can select what type of chart you actually want to look at. I didn't want a histogram. I wanted to scatter plot, click it transforms into a scatter plot, and then you can go back over to your columns on the left and you can drag another column and drop it. Into the middle and it will interact with the previous column that you had dropped there.
Maybe it will generate a correlation plot or it will create a cross filter sort of situation. Very powerful tools. No code necessary. These can be used by the business users. They, they can be used by the data analysts. If you just wanted to be quick and nimble, even as a machine learning engineer in the exploratory data analysis phase, you could use this software.
If you didn't like writing map plot lib code, you could, you can load up your data into one of these software packages and, um, mess around with it. Look at its distribution. Look for outliers, look for holes. Cross filter your data column against a column in order to sort of explore, zoom in and figure out what kind of data points based on certain filters and things like this.
And again, like I mentioned, implied by the word Power bi, which is one of the software packages, is that. Exploring data in this fashion is not the only piece of these software packages. A lot of these packages, especially now in the, in this age of machine learning, the reign of machine learning models and neural networks, a lot of these packages have built into them machine learning models that you can just pull in your data source.
You can choose to look at some graphs like I just described, or you can choose not to, and you can skip to the next step in the software package, which is pipe your data into a machine learning model so that you can predict some target column from your data source and it will maybe try grab bag, a handful of these machine learning models like gradient boosting, neural networks, linear regression.
Try to find the model with the best metric. Maybe lowest means square to error and generate some reports for you automatically. And now you have a machine learning model that you can use for predicting test data in the wild. It's very powerful, very interesting stuff. You know, it sort of implies that, uh, our jobs are on the line here where we're writing code to code our way through the bi pipeline, including EDA and machine learning.
Well, these software packages like Tableau, power bi, and Click View, uh, by the way, that's QLIK view. These packages may in the near future be able to do the entire BI pipeline for you. Click and drag business users can do this. No code necessary. I should also mention Excel. By the way, Excel has some EDA tooling inside of it.
I mean, a lot of the data sources that you're gonna be pulling into your machine learning. Project are be gonna be coming from CSV files or Excel files. Many data sources will come from a database, like a Postgres database or maybe some Spark or Hadoop setup. Um, but most of the time, as has been my experience, you're working with Excel files, CSV, or Excel files.
Well, you can pull in these Excel files into Pandas, you know, pd, Reed csv, and then call things like DF Core or DF his in order to generate some of these plots inline Jupiter or. You can open up the Excel file itself, your data source, and there's a Excel has a bunch of charting utilities built right into Excel itself.
Now, these will be very simplistic. Uh, you're not gonna get a whole lot of power out of Excel in the, in the charting and EDA space. But if, if what you're doing is some simple histograms or scatter plots, yeah, you could do that directly in Excel. So charting tools. I talked about coding libraries, uh, Python libraries like Matt Plot, lib, Boca, and Seaborn.
Uh, a JA JavaScript library called D three js, which is more on the consumer side of things, building an app for users. Those are the programming libraries. And then over here we have software packages that you download. They're click and drag utilities for EDA, for machine learning, for bi, all that stuff.
In a future episode, I'll dive more deeply into, uh, these BI packages and the BI pipeline. So I'll talk more specifically about the separate steps of the pipeline. I talked about EDA in these last couple episodes, and we all know the machine learning steps, so that's two steps of a pipeline. So I'll talk about the other steps.
Um, in a future episode, and I will talk more deeply about these software packages like Tableau and compare them against each other and see where one shines over the other. So stay tuned for that.