Exploratory data analysis (EDA)
Contents
4.5. Exploratory data analysis (EDA)#
Exploratory data analysis, or EDA, means just that. We are going look at and summarize our idea. The data cleaning that we’ve already done, the looking for missings and understanding our variables, is actually part of EDA. In these notes, however, we’ll focus on grouping, or aggregating, and summarizing our data.
This type of data work is sometimes called split-apply-combine, where we split, or group our data, apply a function (e.g. mean), and then combine the summary statistics back together.
We are starting to ask questions of our data. For example,
What variables do we have and what are their data types? Are they numeric or text?
Which variables are continuous? Which are discrete?
Are any variables categorical? These variables might be either numeric or text.
Do we have any datetime variables?
How are our variables related to each other?
How can we summarize them?
We’ll graphically summarize our data in the next section. We’ll look more at datetime variables when we get to our work on financial time series.
We want to eventually be able to open up a new data, see what questions can be answered, and then have the tools to actually answer the questions. Sometimes we want to ask questions that data can’t answer - we need to recognize when that’s the case too.
DataCamp, of course, has a tutorial on some EDA ideas, perhaps more related to what people are doing in machine learning.
Chapter 10 of Python for Data Analysis discusses aggregation and summarizing our data.
Exploratory Data Analysis from Coding for Economists has data aggregation and summary examples. That chapter also covers the logic of opening up a data set and seeing what you have to work with. I am following along with a lot of that material in these notes.
4.5.1. Set-up#
Let’s bring in some data on NC breweries. We’ll again use my Github URL to download the data directly: https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv.
We’ll bring in our usual libraries. I’ll also use pyjanitor
, in case we need to clean anything.
I’ll name the DataFrame nc.
# Set-up
import numpy as np
import pandas as pd
import janitor
nc = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv')
nc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 251 non-null object
1 City 251 non-null object
2 Type 251 non-null object
3 Beer Count 251 non-null int64
4 Est 251 non-null int64
5 Status 251 non-null object
6 URL 251 non-null object
dtypes: int64(2), object(5)
memory usage: 13.9+ KB
Looks pretty good. I’ll use pyjanitor
to clean up the variable names and get rid of any spaces and capitalizations. We’re transforming them into something called snake case, which gets rid of capital letters and uses “_” as spaces.
nc = nc.clean_names()
nc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 251 non-null object
1 city 251 non-null object
2 type 251 non-null object
3 beer_count 251 non-null int64
4 est 251 non-null int64
5 status 251 non-null object
6 url 251 non-null object
dtypes: int64(2), object(5)
memory usage: 13.9+ KB
See what it did? Let’s take a peak at the first few rows.
nc.head()
name | city | type | beer_count | est | status | url | |
---|---|---|---|---|---|---|---|
0 | 217 Brew Works | Wilson | Microbrewery | 10 | 2017 | Active | https://www.ratebeer.com//brewers/217-brew-wor... |
1 | 3rd Rock Brewing Company | Trenton | Microbrewery | 12 | 2016 | Active | https://www.ratebeer.com//brewers/3rd-rock-bre... |
2 | 7 Clans Brewing | Cherokee | Client Brewer | 1 | 2018 | Active | https://www.ratebeer.com//brewers/7-clans-brew... |
3 | Andrews Brewing Company | Andrews | Microbrewery | 18 | 2014 | Active | https://www.ratebeer.com//brewers/andrews-brew... |
4 | Angry Troll Brewing | Elkin | Microbrewery | 8 | 2017 | Active | https://www.ratebeer.com//brewers/angry-troll-... |
We can also change the data types. For example, let’s get brewery type to be something called a category. Categories tell pandas
that some of your variables are in discrete groups. This will be helpful when working with our data. I’ll also change a few others to be strings. They were objects before, which pandas
can work with as text, but might as well tell pandas
that we want text.
nc = nc.assign(
name=nc['name'].astype('string'),
city=nc['city'].astype('string'),
type=nc['type'].astype('category'),
status=nc['status'].astype('category'),
url=nc['url'].astype('string'),
)
nc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 251 non-null string
1 city 251 non-null string
2 type 251 non-null category
3 beer_count 251 non-null int64
4 est 251 non-null int64
5 status 251 non-null category
6 url 251 non-null string
dtypes: category(2), int64(2), string(3)
memory usage: 10.7 KB
I’m using a method called assign()
that creates (assigns) a new column. In this case, I’m creating new name, city, etc. variables from the existing ones, but am changing their type as I go.
Notice this new type of syntax I’m using. assign()
can create more than one new column at once. I’ve put each new column on a new line with proper indentation. It will still run without that indentation, but the spacing makes it much easier to read.
Finally, let’s just check for missing data. Doesn’t seem like we have any from that info()
.
nc.isna().sum()
name 0
city 0
type 0
beer_count 0
est 0
status 0
url 0
dtype: int64
4.5.2. Exploring NC breweries#
We’ll start with describe()
. The will give us summary statistics for our numeric variables.
nc.describe()
beer_count | est | |
---|---|---|
count | 251.000000 | 251.000000 |
mean | 32.960159 | 2012.155378 |
std | 43.723385 | 8.749158 |
min | 1.000000 | 1900.000000 |
25% | 10.000000 | 2011.000000 |
50% | 18.000000 | 2014.000000 |
75% | 38.000000 | 2016.000000 |
max | 424.000000 | 2018.000000 |
Anything stand out to you? The variable est is the year that the brewery was established. The minimum is 1900. That seems very early? Is that right? A beer count of 424 seems high, but could be right.
We can make that table look a little better.
sum_table = nc.describe().round(1)
sum_table
beer_count | est | |
---|---|---|
count | 251.0 | 251.0 |
mean | 33.0 | 2012.2 |
std | 43.7 | 8.7 |
min | 1.0 | 1900.0 |
25% | 10.0 | 2011.0 |
50% | 18.0 | 2014.0 |
75% | 38.0 | 2016.0 |
max | 424.0 | 2018.0 |
We can select just a single column to describe.
sum_table_beer = nc['beer_count'].describe().round(1)
sum_table_beer
count 251.0
mean 33.0
std 43.7
min 1.0
25% 10.0
50% 18.0
75% 38.0
max 424.0
Name: beer_count, dtype: float64
This doesn’t look as good, right? It’s because, by picking out just that one column, we’ve created a pandas
series, rather than a DataFrame.
type(sum_table_beer)
pandas.core.series.Series
We can make it a DataFrame again using .to_frame()
sum_table_beer.to_frame()
beer_count | |
---|---|
count | 251.0 |
mean | 33.0 |
std | 43.7 |
min | 1.0 |
25% | 10.0 |
50% | 18.0 |
75% | 38.0 |
max | 424.0 |
Going back to the original summary DataFrame, we can flip things around, or transpose, the table to make it look better. This is usually how summary statistics are presented.
sum_table = sum_table.T
sum_table
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
beer_count | 251.0 | 33.0 | 43.7 | 1.0 | 10.0 | 18.0 | 38.0 | 424.0 |
est | 251.0 | 2012.2 | 8.7 | 1900.0 | 2011.0 | 2014.0 | 2016.0 | 2018.0 |
Let’s look at the variable type. What types of breweries are in our data? We can use .value_counts()
to do counts by category.
nc['type'].value_counts()
Microbrewery 165
Brewpub/Brewery 41
Brewpub 33
Client Brewer 9
Commercial Brewery 3
Name: type, dtype: int64
Mainly microbreweries.
4.5.2.1. Group and .agg#
I wonder what type of brewery has the largest number of beers on average? This is where we can start using the idea of split-apply-combine.
nc.groupby('type')['beer_count'].mean()
type
Brewpub 44.939394
Brewpub/Brewery 37.707317
Client Brewer 5.000000
Commercial Brewery 10.666667
Microbrewery 31.315152
Name: beer_count, dtype: float64
This code is grouping our data by type, pulling out beer_count, and calculating the mean.
Note
You first group by your variables. Then, you need to tell Python what variable you want to aggregate. For grouping by more than one variable, you need to use a list inside of the ()
.
We can use the .agg
method to include more than one summary statistic. The np.
means that the function is being pulled from the numpy
package. Pay attention to the syntax - there are no ()
after the np.mean
, np.median
, and np.std
functions now, because they are being used inside of the .agg
function. They are also grouped together with []
as a list.
The .agg
function is the same thing as .aggregate
, which you’ll see in our textbook.
You could also pull multiple columns to summarize, but this data set really only has one numeric variable of interest.
nc.groupby('type')['beer_count'].agg([np.mean, np.median, np.std]).round(1)
mean | median | std | |
---|---|---|---|
type | |||
Brewpub | 44.9 | 21.0 | 77.3 |
Brewpub/Brewery | 37.7 | 28.0 | 36.3 |
Client Brewer | 5.0 | 2.0 | 7.0 |
Commercial Brewery | 10.7 | 9.0 | 6.7 |
Microbrewery | 31.3 | 18.0 | 36.6 |
An interesting Python note: See how I used np.
inside of .agg
? This tells .agg
that I am passing a function from numpy
, like np.mean
. You can also use the name of the function and pass the arguments 'mean'
, 'median'
, and 'std'
to .agg
. You might see that if you’re looking at other examples.
I also added that round(1)
to get only one decimal place. You can also make your table and then style it in separate lines. This code will style just that table. You can also set global style options that will affect all of your results.
sum_table_group = nc.groupby('type')['beer_count'].agg([np.mean, np.median, np.std])
sum_table_group.style.format(precision=0, na_rep='-')
mean | median | std | |
---|---|---|---|
type | |||
Brewpub | 45 | 21 | 77 |
Brewpub/Brewery | 38 | 28 | 36 |
Client Brewer | 5 | 2 | 7 |
Commercial Brewery | 11 | 9 | 7 |
Microbrewery | 31 | 18 | 37 |
The code above reads from left to right. You do something, have a .
, do something else, etc. For example, we are taking the DataFrame nc, grouping it by type, pulling out beer_count, and then aggregating using three different statistics. I am saving this to a new DataFrame, called sum_table_group. Finally, I am using style.format
to change the look at the DataFrame that is being displayed as a table in out notebook.
We can also change the column header names if we want. Here’s the same code, but with .rename()
.
sum_table_group = nc.groupby('type')['beer_count'].agg([np.mean, np.median, np.std]).rename(columns={'mean': 'avg', 'median': '50th', 'std': 'sigma'})
sum_table_group.style.format(precision=0, na_rep='-')
avg | 50th | sigma | |
---|---|---|---|
type | |||
Brewpub | 45 | 21 | 77 |
Brewpub/Brewery | 38 | 28 | 36 |
Client Brewer | 5 | 2 | 7 |
Commercial Brewery | 11 | 9 | 7 |
Microbrewery | 31 | 18 | 37 |
Let’s look at how to do this using some different syntax. Something more vertical. Note that ()
surrounding all of the code. This is called method chaining.
(
nc.groupby('type')
.mean()['beer_count']
.to_frame()
.style
.format(precision=0)
.set_caption('Beer Count by Brewery Type')
)
beer_count | |
---|---|
type | |
Brewpub | 45 |
Brewpub/Brewery | 38 |
Client Brewer | 5 |
Commercial Brewery | 11 |
Microbrewery | 31 |
We are starting with the nc DataFrame and grouping by type. I am calculating the mean of beer_count. This actually creates a series, not a DataFrame. You can’t style a series and make it look good in your output. So, I use to.frame()
to put it back into a DataFrame that can be styled. I then use different style functions on that DataFrame. Both format
and set_caption
are functions found as part of style
.
I can also use the .agg
function in this way. I’ll again pass the list of numpy
functions for .agg
to use when summarizing the beer_count variable. Now, note that .agg
creates a DataFrame, not a series, so I don’t need the to.frame()
in there to get back to a DataFrame for styling.
Note
It’s important to keep track of the data types that functions expect and the types that they return. What goes in and what comes out?
(
nc.groupby('type')
.agg([np.mean, np.median, np.std])['beer_count']
.rename(columns={'mean': 'avg', 'median': '50th', 'std': 'sigma'})
.style
.format(precision=0)
.set_caption('Beer Count by Brewery Type')
)
/var/folders/kx/y8vj3n6n5kq_d74vj24jsnh40000gn/T/ipykernel_38483/3939152139.py:2: FutureWarning: ['name', 'city', 'status', 'url'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
nc.groupby('type')
avg | 50th | sigma | |
---|---|---|---|
type | |||
Brewpub | 45 | 21 | 77 |
Brewpub/Brewery | 38 | 28 | 36 |
Client Brewer | 5 | 2 | 7 |
Commercial Brewery | 11 | 9 | 7 |
Microbrewery | 31 | 18 | 37 |
4.5.2.2. Grouping by more than one variable#
Let’s go back to just the mean. I’ll group by two variables now. We give .groupby()
a list of what to group by. Note the square brackets.
nc.groupby(['type', 'status'])['beer_count'].mean().to_frame()
beer_count | ||
---|---|---|
type | status | |
Brewpub | Active | 49.206897 |
Closed | 14.000000 | |
Brewpub/Brewery | Active | 40.405405 |
Closed | 12.750000 | |
Client Brewer | Active | 2.857143 |
Closed | 12.500000 | |
Commercial Brewery | Active | 10.666667 |
Closed | NaN | |
Microbrewery | Active | 33.173333 |
Closed | 12.733333 |
That looks pretty good. You can see the two columns being used as groups.
Let’s style this table a bit more. The first line groups, creates the means, and puts all of this back into a DataFrame. The second line formats the DataFrame.
The summary DataFrame, sum_table_mean, is our table.
sum_table_mean = nc.groupby(['type', 'status'])['beer_count'].mean().to_frame()
sum_table_mean.fillna('-').style.format(precision=0).set_caption('Beer Count by Brewery Type and Status')
beer_count | ||
---|---|---|
type | status | |
Brewpub | Active | 49 |
Closed | 14 | |
Brewpub/Brewery | Active | 40 |
Closed | 13 | |
Client Brewer | Active | 3 |
Closed | 12 | |
Commercial Brewery | Active | 11 |
Closed | - | |
Microbrewery | Active | 33 |
Closed | 13 |
We can try our more vertical style too.
(
nc.groupby(['type', 'status'])
.mean()['beer_count']
)
type status
Brewpub Active 49.206897
Closed 14.000000
Brewpub/Brewery Active 40.405405
Closed 12.750000
Client Brewer Active 2.857143
Closed 12.500000
Commercial Brewery Active 10.666667
Closed NaN
Microbrewery Active 33.173333
Closed 12.733333
Name: beer_count, dtype: float64
See how that doesn’t look as good? We’ve made a series again. We need to make it a DataFrame and style it.
(
nc.groupby(['type', 'status'])
.mean()['beer_count']
.to_frame()
.fillna('-')
.style
.format(precision=0)
.set_caption('Beer Count by Brewery Type and Status')
)
beer_count | ||
---|---|---|
type | status | |
Brewpub | Active | 49 |
Closed | 14 | |
Brewpub/Brewery | Active | 40 |
Closed | 13 | |
Client Brewer | Active | 3 |
Closed | 12 | |
Commercial Brewery | Active | 11 |
Closed | - | |
Microbrewery | Active | 33 |
Closed | 13 |
4.5.2.3. GroupBy objects#
We’ve been splitting our data into groups. Let’s look at GroupBy
objects themselves. These are created by the groupby
function. Our textbook discusses these, starting pg. 130. Essentially, it’s another way of doing what we’ve already done in these notes.
Let’s create a GroupBy
object with the brewery data and take an look.
nc_groups = nc.groupby('type')
print(type(nc_groups))
print(nc_groups.size())
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
type
Brewpub 33
Brewpub/Brewery 41
Client Brewer 9
Commercial Brewery 3
Microbrewery 165
dtype: int64
This GroupBy
object has all of our data, but grouped by type. We can calculate the mean of all numeric data in the GroupBy
.
nc_groups.mean()
beer_count | est | |
---|---|---|
type | ||
Brewpub | 44.939394 | 2010.333333 |
Brewpub/Brewery | 37.707317 | 2013.487805 |
Client Brewer | 5.000000 | 2009.555556 |
Commercial Brewery | 10.666667 | 2016.000000 |
Microbrewery | 31.315152 | 2012.260606 |
We can also use .agg
again to get multiple statistics. I’ll put out just beer_count and get the sum, mean, min, and max by group type. So, very similar to above, but now we are dealing with a DataFrame that’s “pre-grouped” in a sense. Also, note again how I use np.mean
, but just sum
. The sum
function is base Python, but mean
comes from numpy
, so I need the np.mean
.
nc_groups['beer_count'].agg([sum, np.mean, min, max])
sum | mean | min | max | |
---|---|---|---|---|
type | ||||
Brewpub | 1483 | 44.939394 | 1 | 424 |
Brewpub/Brewery | 1546 | 37.707317 | 3 | 165 |
Client Brewer | 45 | 5.000000 | 1 | 23 |
Commercial Brewery | 32 | 10.666667 | 5 | 18 |
Microbrewery | 5167 | 31.315152 | 1 | 279 |
4.5.3. Outliers#
Our summary statistics can give us a sense of the distribution of the data. We’ll be plotting the actual distributions in the coming chapters. Sometimes, we might want to work with just a portion of the data, after removing some extreme examples.
We’ll go back to the original nc data, before we grouped it. I’ll mask, or screen, for only active breweries and then only keep the breweries with a beer count less than the 95th percentile threshold.
nc = nc[nc['status'] == 'Active'] # Active breweries only
outliers = nc['beer_count'].quantile(.95) # Outlier threshold is at the 95th percentile of beer count
outliers
123.75
The 95th percentile of beers on tap is 123.75. Let’s only keep breweries with fewer beers than that. I won’t save the DataFrame, but I’ll sort it so that you can see that the outliers have been removed.
nc[nc['beer_count'] < outliers].sort_values(by='beer_count', ascending=False) # Remove outliers
name | city | type | beer_count | est | status | url | |
---|---|---|---|---|---|---|---|
96 | Highland Brewing Company | Asheville | Microbrewery | 123 | 1994 | Active | https://www.ratebeer.com//brewers/highland-bre... |
98 | Hi-Wire Brewing | Asheville | Microbrewery | 113 | 2013 | Active | https://www.ratebeer.com//brewers/hi-wire-brew... |
148 | Olde Hickory Brewery | Hickory | Microbrewery | 108 | 1996 | Active | https://www.ratebeer.com//brewers/olde-hickory... |
116 | Legion Brewing | Charlotte | Brewpub | 107 | 2016 | Active | https://www.ratebeer.com//brewers/legion-brewi... |
8 | Ass Clown Brewing Company | Cornelius | Microbrewery | 106 | 2011 | Active | https://www.ratebeer.com//brewers/ass-clown-br... |
... | ... | ... | ... | ... | ... | ... | ... |
101 | Hoppy Trout Brewing Company | Andrews | Brewpub | 1 | 2016 | Active | https://www.ratebeer.com//brewers/hoppy-trout-... |
2 | 7 Clans Brewing | Cherokee | Client Brewer | 1 | 2018 | Active | https://www.ratebeer.com//brewers/7-clans-brew... |
168 | Running Brewing Company | Raleigh | Microbrewery | 1 | 2016 | Active | https://www.ratebeer.com//brewers/running-brew... |
165 | Riverbend Malt House | Asheville | Client Brewer | 1 | 2012 | Active | https://www.ratebeer.com//brewers/riverbend-ma... |
68 | Eurisko Beer Company | Asheville | Microbrewery | 1 | 2018 | Active | https://www.ratebeer.com//brewers/eurisko-beer... |
214 rows × 7 columns
4.5.4. skimpy#
Adding in some new libraries really lets us see why using something like Python can improve our workflow. The skimpy
library gives us a nice summary of our data. We can install skimpy
via Anaconda. To use this package, you’ll need to type the following in the terminal (Mac) or command prompt (Windows). You’ll need to restart VS Code.
python -m pip install skimpy
This should install skimpy in your Anaconda distribution of Python. Remember, you actually have several versions of Python on your computer. This is why you have that choice of Python kernel when you’re running code.
I’m going to bring back the Zillow uw DataFrame, just because it has more variables to look at.
You can read about this package here.
import pandas as pd
from skimpy import skim
uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')
skim(uw)
╭──────────────────────────────────────────────── skimpy summary ─────────────────────────────────────────────────╮ │ Data Summary Data Types │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ │ │ │ Number of rows │ 2610 │ │ float64 │ 19 │ │ │ │ Number of columns │ 24 │ │ string │ 4 │ │ │ └───────────────────┴────────┘ │ int64 │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━━━━━━┳━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┡━━━━━━━━━━━━━╇━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━┩ │ │ │ AllHomes_Ti │ 0 │ 0 │ 31000 │ 700000 │ 0 │ 730 │ 4700 │ 35000000 │ █ │ │ │ │ er1 │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Ti │ 0 │ 0 │ 31000 │ 700000 │ 0 │ 770 │ 4500 │ 35000000 │ █ │ │ │ │ er2 │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Ti │ 0 │ 0 │ 31000 │ 700000 │ 14 │ 780 │ 4600 │ 35000000 │ █ │ │ │ │ er3 │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Al │ 0 │ 0 │ 92000 │ 2100000 │ 110 │ 2400 │ 13000 │ 110000000 │ █ │ │ │ │ lTiers │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Tie │ 0 │ 0 │ 1300 │ 14000 │ 0 │ 51 │ 550 │ 590000 │ █ │ │ │ │ r1 │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Tie │ 0 │ 0 │ 1200 │ 13000 │ 0 │ 59 │ 540 │ 540000 │ █ │ │ │ │ r2 │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Tie │ 0 │ 0 │ 1600 │ 16000 │ 0 │ 100 │ 750 │ 730000 │ █ │ │ │ │ r3 │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_All │ 0 │ 0 │ 4100 │ 43000 │ 100 │ 250 │ 2000 │ 1900000 │ █ │ │ │ │ Tiers │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Tot │ 0 │ 0 │ 280000000 │ 2800000000 │ 0 │ 7600000 │ 110000000 │ 12000000000 │ █ │ │ │ │ alValue_T │ │ │ │ │ │ │ │ 0 │ │ │ │ │ UWHomes_Tot │ 0 │ 0 │ 440000000 │ 4500000000 │ 0 │ 14000000 │ 170000000 │ 20000000000 │ █ │ │ │ │ alValue_T │ │ │ │ │ │ │ │ 0 │ │ │ │ │ UWHomes_Tot │ 0 │ 0 │ 1300000000 │ 14000000000 │ 0 │ 46000000 │ 510000000 │ 60000000000 │ █ │ │ │ │ alValue_T │ │ │ │ │ │ │ │ 0 │ │ │ │ │ UWHomes_Tot │ 0 │ 0 │ 2000000000 │ 21000000000 │ 5800000 │ 84000000 │ 840000000 │ 92000000000 │ █ │ │ │ │ alValue_A │ │ │ │ │ │ │ │ 0 │ │ │ │ │ UWHomes_Med │ 0 │ 0 │ 400000 │ 400000 │ 42000 │ 190000 │ 470000 │ 5400000 │ █ │ │ │ │ ianValue_ │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Ti │ 0 │ 0 │ 0.24 │ 0.3 │ 0 │ 0.025 │ 0.35 │ 1 │ █▂▁▁▁▁ │ │ │ │ er1_Share │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Ti │ 0 │ 0 │ 0.23 │ 0.29 │ 0 │ 0.03 │ 0.33 │ 1 │ █▂▁▁▁▁ │ │ │ │ er2_Share │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Ti │ 0 │ 0 │ 0.27 │ 0.27 │ 0 │ 0.061 │ 0.4 │ 1 │ █▃▂▁▁▁ │ │ │ │ er3_Share │ │ │ │ │ │ │ │ │ │ │ │ │ AllHomes_Al │ 0 │ 0 │ 0.25 │ 0.28 │ 0.00064 │ 0.047 │ 0.36 │ 1 │ █▂▁▁▁▁ │ │ │ │ lTiers_Sh │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Sha │ 0 │ 0 │ 0.29 │ 0.18 │ 0 │ 0.15 │ 0.37 │ 1 │ ▆█▆▁▁ │ │ │ │ reInTier1 │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Sha │ 0 │ 0 │ 0.27 │ 0.12 │ 0 │ 0.2 │ 0.34 │ 0.89 │ ▃▇█▁ │ │ │ │ reInTier2 │ │ │ │ │ │ │ │ │ │ │ │ │ UWHomes_Sha │ 0 │ 0 │ 0.44 │ 0.21 │ 0 │ 0.31 │ 0.56 │ 1 │ ▂▆█▄▂▁ │ │ │ │ reInTier3 │ │ │ │ │ │ │ │ │ │ │ │ └─────────────┴────┴──────┴────────────┴─────────────┴─────────┴──────────┴───────────┴─────────────┴────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ column_name ┃ NA ┃ NA % ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ RegionType │ 0 │ 0 │ 1 │ 2600 │ │ │ │ RegionName │ 0 │ 0 │ 1 │ 2600 │ │ │ │ StateName │ 1 │ 0.038 │ 1 │ 2600 │ │ │ │ MSAName │ 1500 │ 59 │ 1 │ 2600 │ │ │ └──────────────────────────┴────────────┴──────────────┴────────────────────────────┴────────────────────────┘ │ ╰────────────────────────────────────────────────────── End ──────────────────────────────────────────────────────╯
4.5.5. ydata profiling#
If you want something more in-depth, try ydata-profiling
. We can install ydata-profiling
using pip. To use this package, you’ll need to type the following in the terminal (Mac) or command prompt (Windows). You’ll need to restart VS Code.
python -m pip install ydata_profiling
You can read all about it on their Github page. It’s ridiculous what it can do.
Let’s add the import
needed to bring ProfileReport
from pandas_profiling
.
from ydata_profiling import ProfileReport
/opt/anaconda3/lib/python3.9/site-packages/seaborn/rcmod.py:82: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
if LooseVersion(mpl.__version__) >= "3.0":
/opt/anaconda3/lib/python3.9/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
other = LooseVersion(other)
This will create the report.
profile = ProfileReport(uw, title='Zillow Housing Data for Flood Risk', minimal=True)
We can now view the report in a Jupyter widget. It will take a couple of minutes to summarize everything.
#profile.to_widgets()
You can also create an HTML file. Using the code below, the file will end up in the same folder as your .ipynb
notebook. I’ve commented it out so that it doesn’t run and create large file everytime.
#profile.to_file("nc_profile.html")
Finally, I’ll create a profile that will appear in this .ipynb
.
profile.to_notebook_iframe()
/opt/anaconda3/lib/python3.9/site-packages/ydata_profiling/profile_report.py:479: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
from IPython.core.display import display
/opt/anaconda3/lib/python3.9/site-packages/ydata_profiling/report/presentation/flavours/widget/correlation_table.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
from IPython.core.display import display
/opt/anaconda3/lib/python3.9/site-packages/ydata_profiling/report/presentation/flavours/widget/duplicate.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
from IPython.core.display import display
/opt/anaconda3/lib/python3.9/site-packages/ydata_profiling/report/presentation/flavours/widget/sample.py:1: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython display
from IPython.core.display import display