MLA 008 Exploratory Data Analysis (EDA)

Oct 26, 2018

Click to Play Episode

Exploratory data analysis (EDA) sits at the critical pre-modeling stage of the data science pipeline, focusing on uncovering missing values, detecting outliers, and understanding feature distributions through both statistical summaries and visualizations, such as Pandas' info(), describe(), histograms, and box plots. Visualization tools like Matplotlib, along with processes including imputation and feature correlation analysis, allow practitioners to decide how best to prepare, clean, or transform data before it enters a machine learning model.

Resources

Resources best viewed here

StatQuest - Machine Learning

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 3rd Edition

Show Notes

Never Run Out of ML ContentGenerate Your Own Episodes

Want to go deeper on a topic this podcast didn't cover? Generate your own episodes—AI agents, transformers, diffusion models, whatever you're curious about. They appear right in your podcast app.Turn any ML topic into a podcast episode in your app.See the Workflow →See How →

EDA in the Data Science Pipeline

Position in Pipeline: EDA is an essential pre-processing step in the business intelligence (BI) or data science pipeline, occurring after data acquisition but before model training.
Purpose: The goal of EDA is to understand the data by identifying:
- Missing values (nulls)
- Outliers
- Feature distributions
- Relationships or correlations between variables

Data Acquisition and Initial Inspection

Data Sources: Data may arrive from various streams (e.g., Twitter, sensors) and is typically stored in structured formats such as databases or spreadsheets.
Loading Data: In Python, data is often loaded into a Pandas DataFrame using commands like pd.read_csv('filename.csv').
Initial Review:
- df.info(): Displays data types and counts of non-null entries by column, quickly highlighting missing values.
- df.describe(): Provides summary statistics for each column, including count, mean, standard deviation, min/max, and quartiles.

Handling Missing Data and Outliers

Imputation:
- Missing values must often be filled (imputed), as most machine learning algorithms cannot handle nulls.
- Common strategies: impute with mean, median, or another context-appropriate value.
- For example, missing ages can be filled with the column's average rather than zero, to avoid introducing skew.
Outlier Strategy:
- Outliers can be removed, replaced (e.g., by nulls and subsequently imputed), or left as-is if legitimate.
- Treatment depends on whether outliers represent true data points or data errors.

Visualization Techniques

Purpose: Visualizations help reveal data distributions, outliers, and relationships that may not be apparent from raw statistics.
Common Visualization Tools:
- Matplotlib: The primary Python library for static data visualizations.
- Visualization Methods:
  - Histogram: Ideal for visualizing the distribution of a single variable (e.g., age), making outliers visible as isolated bars.
  - Box Plot: Summarizes quartiles, median, and range, with 'whiskers' showing min/max; useful for spotting outliers and understanding data spread.
  - Line Chart: Used for time-series data, highlighting trends and anomalies (e.g., sudden spikes in stock price).
  - Correlation Matrix: Visual grid (often of scatterplots) comparing each feature against every other, helping to detect strong or weak linear relationships between features.

Feature Correlation and Dimensionality

Correlation Plot:
- Generated with df.corr() in Pandas to assess linear relationships between features.
- High correlation between features may suggest redundancy (e.g., number of bedrooms and square footage) and inform feature selection or removal.
Limitations:
- While correlation plots provide intuition, automated approaches like Principal Component Analysis (PCA) or autoencoders are typically superior for feature reduction and target prediction tasks.

Data Transformation Prior to Modeling

Scaling:
- Machine learning models, especially neural networks, often require input features to be scaled (normalized or standardized).
- StandardScaler (from scikit-learn): Standardizes features, but is sensitive to outliers.
- RobustScaler: A variant that compresses the influence of outliers, keeping data within interquartile ranges, simplifying preprocessing steps.

Summary of EDA Workflow

Initial Steps:
- Load data into a DataFrame.
- Examine data types and missing values with df.info().
- Review summary statistics with df.describe().
Visualization:
- Use histograms and box plots to explore feature distributions and detect anomalies.
- Leverage correlation matrices to identify related features.
Data Preparation:
- Impute missing values thoughtfully (e.g., with means or medians).
- Decide on treatment for outliers: removal, imputation, or scaling with tools like RobustScaler.
Outcome:
- Proper EDA ensures that data is cleaned, features are well-understood, and inputs are suitable for effective machine learning model training.

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

Transcript

You're listening to machine learning applied. This is the second of the visualization episodes. In this one, we're gonna talk about exploratory data analysis, A-K-A-E-D-A, as well as some charting fundamentals. So exploratory data analysis, I threw that phrase around a lot in the last episode without describing it.

EDA is part of a larger pipeline for your, for your machine learning process or your data science process. This whole umbrella of what's called, uh, business intelligence. Bi or even just data science. It's sort of the, the A to Z, the the beginning to end pipeline of what you're working on. The whole reason you have a machine learning model in the first place is part of a pipeline.

The first part of this pipeline is gonna be getting your data from some data source, maybe some data stream like Twitter or some sensors, and then you're going to maybe convert that into something that can be stored on a database. You might be cleaning up your data. You would be visualizing your data and determining how it will fit into your machine learning model.

That's what's called exploratory data analysis. E-D-A-E-D-A is looking at your data, figuring out if there's holes in your data, the way that it's distributed, how you're gonna have to fix it up, tidy it up, et cetera, before it hits the machine learning model. Okay? Then it hits the machine learning model, and then you have some results, maybe some information that's going to go to the the business decision makers.

So you output results with your machine learning model, and then you might maybe generate some visualizations or reports on those results and then deliver them to the business people. This whole pipeline from beginning to end, it's called business intelligence, the Business intelligence pipeline bi.

You'll see that word a lot, and we'll talk about BI in a future episode. We'll talk about each of the steps of this pipe pipeline and sort of. All the tasks involved in the BI pipeline is sort of the umbrella term data science. We've got the database people, database administrators, their data scientists.

We have the data mungers and the cleaner uppers, and the charting and graphing and all that stuff. That's all data science. We have the machine learning models, building the machine learning models. That's data science. And then finally we have the reporting and the analytics data analysis. That's data science.

So the, so the conceptual umbrella of all of this is data science and the stepwise pipeline of each of these. Parts of the process is called business intelligence. So BI data sciences are kind of synonymous terms with a different spin. Data science is the concept, it's the field, and BI is the process.

It's, it's the steps. And like I said, one of those steps, the one we're most familiar with. By way of this podcast series is machine learning building the machine learning model that's right in the middle. The model building step of the bi pipeline is in the middle of this whole pipeline, and the step before it is called.

EDA exploratory data analysis. Now, EDA is looking at the data, looking at the data that came to you from these streams or from the database or the spreadsheet. You load up your spreadsheet, you load up your data from the database, and you look at it. You try to figure out are there holes in the data? Are, are there noles?

That you have to fill. We call this imputing. Imputing data. Is is filling in the holes? Are there outliers in the data? Let's say one feature in your spreadsheet. Okay, rows and columns. One of the columns has data where a handful of entries are way over here on the left or way over here on the right.

Maybe that's corrupted data. Maybe those outliers are bad entries. Somebody added an extra zero on accident or there was a corrupted reading from a sensor. Maybe those outliers are legitimate and should be taken under consideration. So all that stuff is part of the data cleaning, data munging step of the BI pipeline.

It's, it's still sort of under the EDA umbrella. A uh, another major aspect of EDA is not just determining if there's holes and outliers, but visualizing them, visualizing where the outliers are. So there are different types of charts and graphs. That you can generate and render, like if you're using a Jupyter Notebook, for example, you would use a library called Map Plot Lib, which we'll cover in the next episode.

Map plot lib to generate a chart of a feature in your dataset. So if you ingested a spreadsheet or a database into a data frame. You have data frame now, and that data frame has columns. Let's say it's the housing data set. We've got distance to downtown, number of bedrooms, number of bathrooms, square footage, et cetera.

If you wanted to visualize the distance to downtown on a graph, you might render a histogram. And if you render that in map plot lib in line, in your Jupyter notebook, you just write uh, df dot distance to downtown dot. Hiss parentheses, and we'll cover that next episode. It will render underneath your, your cell in your Jupyter Notebook.

It'll render a histogram of that column. It'll be a bell curve, uh, Gaussian distribution, but button bars. So, so that's a sort of broad overview of EDA, but let's talk about some of these specifics and how you would handle these specifically in Python code and map plot lib and such. So again, broad picture of EDA is looking at your data, determining if you need to do something about your data before it hits the machine learning model.

Because your machine learning model typically wants data cleaned up and in a specific format. So for example, a neural network generally wants your features to be standard scaled with outliers removed if necessary, and no null values. So you have to scale things, you have to. Impute things. That's all part of the EDA process.

So what I like to do, first things first is I'll load up a spreadsheet into a data frame. So PD Reed csv, that's Pandas Reed csv, and then in parentheses, the file name. And you assign that to a data frame, DF equals PD dot reed csv blah, blah, blah. So now you have a data frame, df, and that data frame has a whole bunch of functions that you can call as part of this EDA step.

One of the functions is info df, info parentheses, and then you hit control. Enter inside your Jupiter notebook cell. And it will print right under your cell the information of your data frame. And that information is going to have the data types of the columns. It's gonna have the number of null entries.

In each of those columns, and that's the thing I'm looking for right away. I'm looking for the number of nulls in each column. That's very important because almost always you gotta clean up those nulls. You have to impute them with something. Okay, but what are you going to fill them with? We don't know just yet.

We have to continue this EDA process. So then I'll create a new Jupyter Notebook cell and I will type df. Describe parentheses, control enter, and it will describe the data frame in in a new way. What this will show is for each column, it will show the number of entries, okay, so the number of rows.

Basically, it'll show the mean value of numerical columns, the standard deviation, the min and the max. The quartiles, you know, the 20th per 25th percentile, 50th percentile, 75th percentile, et cetera. So now you have some statistics on your data frame columns. Info gives you the data types and number of Knolls.

Describe, gives you some statistics. I think it's a bit confusing. They should all be wrapped up into one function call that gives you everything all at once. But there you have it info and describe. So that gives you some. Numbers and details about your data frame, and that could be useful. So if you have a bunch of knolls in your data, a bunch of holes in your, in your data frame, you gotta fill them with something.

Well, there's a few options. What you're gonna fill them with you, maybe you can fill them with the mean value of that column. So let's say that we have a person and we have an age column for people, and you have a handful of missing ages. Well, we need to fill those ages in with something. You can't have knolls most of the time in your machine learning pipeline, so we gotta fill them in with something we don't wanna fill them in with zero.

That's probably a bad idea because that's going to actually mean something to the machine learning model. It'll mean this person is zero years old, and that might skew the way the model interprets this particular row. So maybe a smarter age to fill the nulls in is the mean. The, the average age, which is probably 30 or some, some somewhere in the middle.

So printing the information about your data frame, whether there are holes in the data and what types of statistics are available on that data, gives you a sense of how you might want to fill the holes in your data. Now, if you're good at reading statistics like visual, looking at these numbers and visualizing what they imply, you might also be able to just from that information.

Determine what you're gonna do with outliers in the data. You might be able to look at that 25th or 75th percentile and compare that to the min and the max and determine what you're gonna do with outliers. I think very few people are that proficient at looking at numbers in order to determine outliers and the implications of those outliers.

And so most people then turn to charts and graphs. It's much easier to visualize the distribution of your data than to just eyeball the numbers, the statistics of your data and determine what you're gonna do with it from there. So we, we turn to charts and graphs, and for this we use a library called Map Plot Lib.

And we will talk about mat plot lib in the next episode compared to some alternatives like Seaborn and Boca. But just assume for now that the primary library you use in Python for charts and graphs is mat plot lib and mat plot lib allows you to chart things. And one thing you might chart is the way that a column is distributed.

The data distribution, and the way you might chart this is with a histogram, a histogram sort of bins up some common numbers in your data and then prints them. So the most common type of data distribution you'll see in the wild is the normal distribution, the bell curve, the Gaussian distribution. And so when you type df dot age, dot his.

Parentheses that will call the underlying map plot lib function for generating a histogram and rendering it under your Jupiter notebook cell. It will render a histogram, and that histogram will probably look like a bell curve, but instead of being a smooth bell curve, it'll be a bunch of bars, like a bar chart, but shaped like a bell curve.

That's what a histogram is. It's, it's a bend up version of a smooth distribution. In this case, in most cases, a normal distribution. And what you might see in that histogram, it looks normal. It's a healthy looking bell curve. But way over here on the right is a bar at like 150 years old. Wait a minute.

What's that? What's going on over there? Well, that might indicate that there were some errors in the data entry, so those outliers should probably be removed. So. Using a histogram, you can visually spot the outliers and then you can determine what you're gonna do with them. Another way of looking at data distributions in order to spot outliers is a box plot and a box plot.

You may have seen from a statistics class, it's a box in the middle. It, it has a line separating the top half of the box from the bottom half of the box. These are the quartiles. We have a box in the middle and the top of the box. Is the third quartile. The bottom of the box is the first quartile. The separator line is the median of your data.

Okay, we have a box, and now we have a whisker going out of the top of the box to the maximum value and a whisker going down from the bottom of the box to the minimum value. And so this visualization also helps you look at the data distribution and determine what sort of is the situation with the outliers.

Now what are you gonna do with the outliers? Okay, so you looked at your data distribution on a, on a plot in your Jupiter notebook by way of map plot lib, and you get a histogram, or you get a box plot or something like this, and you determine that you have outliers or that your data is distributed in some way that you need to fix.

What do you do? Well, there's a handful of ways for fixing data. One way is to remove the outliers, completely remove the rows themselves. If there's outliers, we just don't even consider that row that data point. Another thing to do is to replace the outlier with null and then use your imputing strategy from before to replace the null.

So you just cut out the outlier and you replace it with the mean value or something like that. I. Sometimes the outliers are legitimate. Sometimes you may look at the distribution in a plot and be like, no, actually that means something. I'm gonna just keep it as is. One thing, when you are piping data into a neural network, you generally want that data to be standard scaled.

You'll see this done a lot. A psychic Learn class called Standard Scaler. You can just standard scale all your data. Well, there's another Psychic Learn class called Robust Scaler. It's very similar to a standard scaler, except that what it does with outliers is it squishes them into the first or third quartile.

So where a standard scaler will be affected by outliers and like we were just talking about, what are you gonna do with the outliers? Maybe you'll cut out the rows or you'll replace the outliers with the mean value or something like this. Then you standard scale, there's like three steps in that process.

A robust scaler from Psychic Learn allows you to skip all that. And what it will do is when it encounters an outlier, it will sort of squish it. Into the iners, into the standard range of data and treat it like it was a, a high value, yes, but not an outlier, just a standard high value, very powerful tool. I use psychic learns robust scaler all the time.

It's kind of a, a quick and dirty approach to handling outliers and then scaling all in a one-two punch so you don't have to do too much munging. Okay, so box plots and histograms, those are visualization tools that allow you to kind of look at your data and that will help you determine what you're gonna do with your data.

What other types of plots can we use to look at our data? Well, if you have a time series, while you're definitely gonna use a line chart, everybody knows what a line chart is. I mean, the standard charts out there that almost everybody's familiar with, we got line charts. We have pie charts. Everybody knows what a pie chart is.

Bar charts, pie and bar are generally not useful Charts for EDA, they're a little bit too simplistic, but a line chart, even though it's one of those 1 0 1 charts, uh, is very valuable. You'll use it for time series information, you know, uh, stock tickers or weather changes and stuff like that. So you might use that to visualize your time series data to determine if there's outliers there.

Let's say you have a stock and it generally trends in, in an upward direction, but there's this crazy spike halfway through the chart that goes to a million dollars. Well, you know, that's an outlier. That definitely means there is a, a, a recording error, and so you can learn from that visualization that you have an outlier or two, and you might choose what to do with those outliers.

Either robust, scale them into the inlays or cut them out and replace them with some mean value. And then finally there's this thing called a correlation plot. Correlation plot. This one's a little bit hard to describe in audio. Um, you'll have to look this one up. A correlation plot takes your features of your data, okay?

Your columns, and it plots them in a grid of cells. N number of columns by number of columns. So let's say you have 10 features, distance to downtown, number of bedrooms, number of bathrooms, square footage, blah, blah, blah. You have 10 features. It'll, it'll create a grid of 10 by 10 features going to the right and features going down to the bottom.

And so these grids, each cell in the grid compares each feature to each other feature. So every feature is compared to every feature, including itself. And, and so a correlation matrix, you're trying to look, you're, you're visualizing the correlation between features, and these will be in a, in a scatterplot.

Each cell of this grid will be a scatterplot. So a correlation looks like this. A, a perfect correlation is gonna be a correlation of one. And what that looks like on a scatterplot is a line. That goes 45 degree angle from bottom left of the cell to top right of the cell. A perfect correlation is a straight line, 45 degree angle slope of one, but more likely when you have a decent correlation.

One feature generally trends with another feature. Okay, so for example, number of bedrooms and number of bathrooms. Square footage of the house. All three of those things are generally gonna trend with each other. They're gonna have strong correlation with each other. What you'll see is a scatterplot, a cloud of dots that looks to be going generally in a 45 degree angle from bottom left to top right.

No correlation. Something that has nothing to do with the other things will just be a blob of, of dots, scatterplot blob. So you know, distance to downtown and number of bedrooms. Maybe that has very little correlation. Maybe that's not true in reality, but pretend that there's no correlation between distance to downtown and number of bedrooms.

What you'll see is just a blob of dots. So that's what a correlation matrix gets you. It shows you in what way are certain features correlated with other features. And the way you generate that in pandas is df dot core CORR, parentheses, and you hit control. Enter in your Jupiter notebook cell and it will render a correlation grid under that cell.

And what might you do with this correlation grid? Well, determining sort of what features correlate with, with what other features is sort of half the battle of machine learning in the first place. You're trying to determine what features can directly predict the target column price of the house. In this case, for example.

Well, if number of bedrooms and number of bathrooms and square footage and distances downtown, all those things are directly correlated. For example, with the column price, you know, that gives you some information. It might not really mean much for what you're going to do with the data or how you're going to mung the data before it gets piped into your machine learning model.

But hey, it might be very informative just from a, an intuitive level on the data analysis side of things, from data science, just eyeballing that correlation matrix and be like, Hey. Price is pretty much correlated with this, that, and the other thing, or you know, number of bedrooms is pretty much correlated with number of bathrooms.

Maybe we can remove one of those features because we need to slim down on the number of features for the performance of our model. And if one is so strongly correlated with another that they pretty much determine each other, then you can choose to remove features, for example. Now I personally haven't found correlation matrices very valuable.

Um, if I wanted to pare down features, I would use an automated approach like principle component analysis, PCA, or an auto encoder. This is called dimensionality reduction. I. And if I wanted to determine a target variable from features, I'd use a machine learning model, like a neural network or a linear regression model.

So sort of what a correlation matrix does for you can be automated with other machine learning tools, but it still might be handy to eyeball in a graphical sense, and that's what that correlation matrix does for you. So that's EDA. Exploratory data analysis is the step before machine learning where you're preparing your data.

Well, first you're looking at your data in order to determine what to do with your data, and then you do something about that. And that's the sort of data wrangling, data munging. Step where you're gonna be filling in the nulls, you're going to be scaling your features, you're gonna be removing or replacing outliers.

You might be removing features completely if they have strong correlation with other features and all, and things like this. To summarize some of the things we covered in EDA, uh, DF info and DF describe, those are panda's data frame functions that will give you information on the statistics of the data frame, the number of knolls and the data types of those columns.

Then you have various map plot lib, plotting, utilities for visualizing your data. We've got things like line charts, bar charts, pipe scatter, and then, uh, histograms are very valuable for viewing the distribution of your columns. You'll use, you'll see histograms used very frequently as well as correlation plots, correlation matrices.

These are all plotting functions. You can call directly from the map plot lib library, or straight from your pandas data frame. Pandas wraps around map plot lib so that it will call map plot lib for you when you call a function on pandas. We'll get into that in the next episode. The interaction between Pandas and Matt plot lib.

And again, EDA is one step of the overarching business intelligence pipeline. Bi, there's a whole bunch of stuff in the beginning surrounding getting your data. And then you look at your data, that's EDA, and then you munge clean process, whatever you wanna call it, your data. Some people consider that part of the EDA phase.

Some people actually separate that out into, its. Its own dedicated phase. Then you put your cleaned up data into the machine learning model. That will generate some information, some you'll create some reports. You'll send those reports to the business users. That's bi. Next episode, we'll talk about some of these specific charting utilities like Map, plot, lib, Boca, Seaborn, D three, Tableau, all those things.