Cleaning our data#

Cleaning your data and dealing with dates. That’s the life of anyone who deals with data, especially in finance. Most data sets are messy. This is especially true if you or your firm is the one collecting the data and not just purchasing something.

Then, once your data is (reasonably) clean and error free, you start creating the variables that will go into your statistical/machine learning models. This is called feature engineering.

DataCamp has an article on cleaning data in pandas.

Chapter 7 of Python for Data Analysis for more data cleaning tips.

And, as always, I like the Coding for Economists text.

Working with Data has pandas examples. Note the discussion of *tidy data.
Data Transformation has data aggregation and summary examples.
Missing Values covers what to do with missing values in your data.

You’ll see all of these steps when we get our actual models.

Data Integrity#

Before you even start to clean your data, you have to know how your data were created.

What does this mean in practice? Read the data manual if you are using data constructed by someone else! Did you pull data from FRED? How did the BLS construct that time series? Did anything change? Can you really compare a number from 2023 to 2019, or did a definition change?

Can you compare Super Bowl viewership over time?

Underwater exploration#

We’ll start by bringing in our Zillow data again.

import numpy as np
import pandas as pd

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')
uw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 RegionType                    2610 non-null   object 
 RegionName                    2610 non-null   object 
 StateName                     2609 non-null   object 
 MSAName                       1071 non-null   object 
 AllHomes_Tier1                2610 non-null   float64
 AllHomes_Tier2                2610 non-null   float64
 AllHomes_Tier3                2610 non-null   float64
 AllHomes_AllTiers             2610 non-null   float64
 UWHomes_Tier1                 2610 non-null   float64
 UWHomes_Tier2                 2610 non-null   int64  
UWHomes_Tier3                 2610 non-null   float64
UWHomes_AllTiers              2610 non-null   float64
UWHomes_TotalValue_Tier1      2610 non-null   float64
UWHomes_TotalValue_Tier2      2610 non-null   float64
UWHomes_TotalValue_Tier3      2610 non-null   float64
UWHomes_TotalValue_AllTiers   2610 non-null   float64
UWHomes_MedianValue_AllTiers  2610 non-null   float64
AllHomes_Tier1_ShareUW        2610 non-null   float64
AllHomes_Tier2_ShareUW        2610 non-null   float64
AllHomes_Tier3_ShareUW        2610 non-null   float64
AllHomes_AllTiers_ShareUW     2610 non-null   float64
UWHomes_ShareInTier1          2610 non-null   float64
UWHomes_ShareInTier2          2610 non-null   float64
UWHomes_ShareInTier3          2610 non-null   float64
dtypes: float64(19), int64(1), object(4)
memory usage: 489.5+ KB

Once of the most important steps in data cleaning is just looking at what we have. What are the variables? What are their types? How many unique values of each variable do we have? Any missings? Do we see anything unexpected?

This page has a nice summary.

We can select a column and change the variable type using .astype().

uw['UWHomes_Tier2'] = uw['UWHomes_Tier2'].astype('float64')

This code is actually taking the column, changing its type, and saving it to the DataFrame with the same column name. Note that we could changed the name of the column.

Counting, missings, and dupes#

When you first bring in some data, you often want to do some simple checks. How many observations do you have? Are you missing anything? Are their duplicate observations? How do we define duplicate observations?

You can’t do any data analysis until you understand the structure of your data. What do you have? Why do you have it? How do you want to use it?

Let’s start with counts. We can count the number of times a category appears in a variable. Note that this is only useful for variables that have an sense of category. You would do this for any of the home count or dollar value variables, for example.

We can select a categorical variable, like RegionType and then use .value_counts() from pandas.

uw['RegionType'].value_counts()

Zip       1247
City      1017
County     227
MSA         95
State       23
Nation       1
Name: RegionType, dtype: int64

We should look for missing values for each variable. isna() returns a TRUE or FALSE for each value, depending on whether or not it is NaN, or missing. We can then take those 1/0, true or false, values and add them up with sum().

uw.isna().sum()

RegionType                         0
RegionName                         0
StateName                          1
MSAName                         1539
AllHomes_Tier1                     0
AllHomes_Tier2                     0
AllHomes_Tier3                     0
AllHomes_AllTiers                  0
UWHomes_Tier1                      0
UWHomes_Tier2                      0
UWHomes_Tier3                      0
UWHomes_AllTiers                   0
UWHomes_TotalValue_Tier1           0
UWHomes_TotalValue_Tier2           0
UWHomes_TotalValue_Tier3           0
UWHomes_TotalValue_AllTiers        0
UWHomes_MedianValue_AllTiers       0
AllHomes_Tier1_ShareUW             0
AllHomes_Tier2_ShareUW             0
AllHomes_Tier3_ShareUW             0
AllHomes_AllTiers_ShareUW          0
UWHomes_ShareInTier1               0
UWHomes_ShareInTier2               0
UWHomes_ShareInTier3               0
dtype: int64

You can see the logic in how these functions work. We take our DataFrame uw and we send it to isna(). This actually creates another DataFrame, that we then pass to sum(). Check it out.

uw.isna()

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	False	False	True	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
1	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
3	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
4	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2605	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2606	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2607	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2608	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False
2609	False	False	False	True	False	False	False	False	False	False	...	False	False	False	False	False	False	False	False	False	False

2610 rows × 24 columns

The same sort of logic applies to the method unique(). This one gives an array of each unique value in a column or set of columns. This will save an array of unique values that appear in the RegionName column.

unique_regions = uw.RegionName.unique()

I’m using the .ColumnName construction to pull the variable.

We can count the number of unique values for a variable.

uw.RegionName.nunique()

We can filter on one variable and count another. We’re looking for unexpected things, just trying to get a sense for what we have.

uw[uw['RegionType'] == 'MSA'].MSAName.nunique()

This syntax works too! We are pulling the column name out using a list defined by the column names inside of [].

uw[uw['RegionType'] == 'MSA']['MSAName'].nunique()

We have 95 unique MSA is our data. Seems reasonable. MSAs are like city-regions.

Finally, we can just drop any rows with missing values using .dropna() from pandas. If we don’t specify a column, then it will drop a row if ANY value is missing. I don’t actually want to drop missings here, so I’m not going to save my work with a uw = .

You can read more here.

uw.dropna(axis = 0)

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
251	MSA	Aberdeen, WA	Washington	Aberdeen, WA	11606.0	11751.0	11749.0	35106.0	4119.0	2450.0	...	1.734113e+08	7.791588e+08	88327.0	0.354903	0.208493	0.065537	0.209053	0.561248	0.333833	0.104919
252	MSA	Astoria, OR	Oregon	Astoria, OR	4745.0	5228.0	5568.0	15541.0	811.0	376.0	...	9.138231e+07	3.620604e+08	237912.0	0.170917	0.071920	0.040409	0.090856	0.574363	0.266289	0.159348
253	MSA	Egg Harbor Township, NJ	New Jersey	Egg Harbor Township, NJ	34185.0	34117.0	34088.0	102390.0	9116.0	9575.0	...	5.576646e+09	1.030631e+10	264441.0	0.266667	0.280652	0.252200	0.266510	0.334066	0.350887	0.315047
254	MSA	Baltimore, MD	Maryland	Baltimore, MD	298318.0	312603.0	316502.0	927423.0	2293.0	2910.0	...	3.698818e+09	5.153664e+09	336801.0	0.007686	0.009309	0.021861	0.013071	0.189160	0.240059	0.570780
255	MSA	East Falmouth, MA	Massachusetts	East Falmouth, MA	45345.0	45358.0	45258.0	135961.0	1651.0	1395.0	...	4.205784e+09	5.349099e+09	588650.0	0.036410	0.030755	0.073004	0.046705	0.260000	0.219685	0.520315
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1358	City	Town of Salisbury	Massachusetts	Boston, MA	1207.0	1207.0	1207.0	3621.0	397.0	402.0	...	1.484370e+08	3.900701e+08	342026.0	0.328915	0.333057	0.221210	0.294394	0.372420	0.377111	0.250469
1359	City	Town of Duxbury	Massachusetts	Boston, MA	1835.0	1835.0	1837.0	5507.0	68.0	33.0	...	1.909504e+08	2.389202e+08	749341.0	0.037057	0.017984	0.056614	0.037225	0.331707	0.160976	0.507317
1360	City	New Shoreham	Rhode Island	Providence, RI	506.0	506.0	506.0	1518.0	57.0	25.0	...	9.383493e+07	1.654886e+08	1072092.0	0.112648	0.049407	0.094862	0.085639	0.438462	0.192308	0.369231
1361	City	Yarmouth	Massachusetts	East Falmouth, MA	5178.0	5178.0	5180.0	15536.0	472.0	184.0	...	4.531706e+08	6.061486e+08	332409.0	0.091155	0.035535	0.102124	0.076274	0.398312	0.155274	0.446414
1362	City	Town of Hingham	Massachusetts	Boston, MA	2462.0	2462.0	2462.0	7386.0	48.0	33.0	...	8.266223e+07	1.309589e+08	731420.0	0.019496	0.013404	0.021121	0.018007	0.360902	0.248120	0.390977

1071 rows × 24 columns

We now have 1071 rows, since MSAName had so many missing values. But - that was OK given how this data are constructed. You’ll see another example with stock data below.

Now, I’ll just drop observations if AllHomes_Tier1 is missing.

uw.dropna(subset = ['AllHomes_Tier1'], axis = 0)

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2605	Zip	98592	Washington	NaN	464.0	470.0	493.0	1427.0	8.0	29.0	...	7.340711e+07	8.271541e+07	436490.5	0.017241	0.061702	0.273834	0.120533	0.046512	0.168605	0.784884
2606	Zip	98595	Washington	NaN	558.0	571.0	598.0	1727.0	119.0	104.0	...	1.376119e+07	4.036570e+07	131441.0	0.213262	0.182137	0.095318	0.162131	0.425000	0.371429	0.203571
2607	Zip	98612	Washington	NaN	365.0	376.0	409.0	1150.0	91.0	79.0	...	1.440674e+07	5.028400e+07	230495.0	0.249315	0.210106	0.092910	0.180870	0.437500	0.379808	0.182692
2608	Zip	32081	Florida	NaN	1318.0	1328.0	1350.0	3996.0	91.0	42.0	...	9.111212e+07	1.325055e+08	385196.0	0.069044	0.031627	0.097778	0.066316	0.343396	0.158491	0.498113
2609	Zip	33578	Florida	NaN	5975.0	4537.0	3232.0	13744.0	8.0	6.0	...	5.685202e+07	5.911551e+07	363355.0	0.001339	0.001322	0.038366	0.010041	0.057971	0.043478	0.898551

2610 rows × 24 columns

Nothing gets dropped, since there were no missing values.

Unique and missing values are important. So is the idea of duplicates. Does a row (an observation) contain the same values as another row? That could be all of the values across all of the variables, or just a particular column (e.g. ticker), or a set of columns (e.g. ticker and date).

We can see if our data has any duplicates by using .duplicated() from pandas.

You can read more here.

I’m going to filter my data to only include RegionType='Zip' and then look for duplicate Zip codes. The Zip codes are found in RegionName. I’ll save this to a new DataFrame called dupes. The argument keep='first will keep the first duplicate, if any.

I am then going to mask my subset of the uw data. So, I filter to get just Zip codes. Then, I mask on the dupes DataFrame. This DataFrame is just an array of True/False values. By masking, I’m telling Python to only keep values that are True. There are no True values, so no duplicate Zips and the resulting DataFrame is empty.

dupes = uw[uw['RegionType'] == 'Zip'].duplicated(subset=['RegionName'], keep='first')

uw[uw['RegionType'] == 'Zip'][dupes]

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3

0 rows × 24 columns

We can use drop_duplicates() to look at our data and drop rows that are duplicate observations. We use the subset= argument to tell it what columns to look at for duplicates. We can leave out this argument if we want to look across all columns.

The argument keep= will tell it which duplicate to keep in the data (e.g. the first one or the last one).

You can read more about the syntax here.

uw.drop_duplicates()

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2605	Zip	98592	Washington	NaN	464.0	470.0	493.0	1427.0	8.0	29.0	...	7.340711e+07	8.271541e+07	436490.5	0.017241	0.061702	0.273834	0.120533	0.046512	0.168605	0.784884
2606	Zip	98595	Washington	NaN	558.0	571.0	598.0	1727.0	119.0	104.0	...	1.376119e+07	4.036570e+07	131441.0	0.213262	0.182137	0.095318	0.162131	0.425000	0.371429	0.203571
2607	Zip	98612	Washington	NaN	365.0	376.0	409.0	1150.0	91.0	79.0	...	1.440674e+07	5.028400e+07	230495.0	0.249315	0.210106	0.092910	0.180870	0.437500	0.379808	0.182692
2608	Zip	32081	Florida	NaN	1318.0	1328.0	1350.0	3996.0	91.0	42.0	...	9.111212e+07	1.325055e+08	385196.0	0.069044	0.031627	0.097778	0.066316	0.343396	0.158491	0.498113
2609	Zip	33578	Florida	NaN	5975.0	4537.0	3232.0	13744.0	8.0	6.0	...	5.685202e+07	5.911551e+07	363355.0	0.001339	0.001322	0.038366	0.010041	0.057971	0.043478	0.898551

2610 rows × 24 columns

uw.drop_duplicates(subset=['RegionName', 'StateName'], keep='last')

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2605	Zip	98592	Washington	NaN	464.0	470.0	493.0	1427.0	8.0	29.0	...	7.340711e+07	8.271541e+07	436490.5	0.017241	0.061702	0.273834	0.120533	0.046512	0.168605	0.784884
2606	Zip	98595	Washington	NaN	558.0	571.0	598.0	1727.0	119.0	104.0	...	1.376119e+07	4.036570e+07	131441.0	0.213262	0.182137	0.095318	0.162131	0.425000	0.371429	0.203571
2607	Zip	98612	Washington	NaN	365.0	376.0	409.0	1150.0	91.0	79.0	...	1.440674e+07	5.028400e+07	230495.0	0.249315	0.210106	0.092910	0.180870	0.437500	0.379808	0.182692
2608	Zip	32081	Florida	NaN	1318.0	1328.0	1350.0	3996.0	91.0	42.0	...	9.111212e+07	1.325055e+08	385196.0	0.069044	0.031627	0.097778	0.066316	0.343396	0.158491	0.498113
2609	Zip	33578	Florida	NaN	5975.0	4537.0	3232.0	13744.0	8.0	6.0	...	5.685202e+07	5.911551e+07	363355.0	0.001339	0.001322	0.038366	0.010041	0.057971	0.043478	0.898551

2571 rows × 24 columns

You want to keep track of the number of observations (rows) when you’re dropping duplicates. In this case, we go from 2,610 rows to 2,571.

Sorting our data#

Another basic task that we often face is sorting our data. You can sort numeric or string data, by variable using .sort_values() from pandas.

The ascending= argument can be True or False and tells us how to sort. You can pass it a list of how to sort if you’re sorting on multiple variables.

I’l sort the data by RegionType first. Then, I’ll sort the data by RegionType, RegionName, and AllHomes_Tier1

You can read more here.

uw.sort_values(by = 'RegionType', ascending = True)

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
1304	City	Ortley Beach	New Jersey	New York, NY	767.0	763.0	766.0	2296.0	746.0	724.0	...	447937059.0	879315914.0	348092.0	0.972621	0.948886	0.953003	0.958188	0.339091	0.329091	0.331818
896	City	Smyrna	Delaware	Dover, DE	4464.0	4464.0	4464.0	13392.0	128.0	39.0	...	9078029.0	29936443.0	110029.0	0.028674	0.008737	0.005376	0.014262	0.670157	0.204188	0.125654
895	City	Smithfield	Virginia	Virginia Beach, VA	2251.0	2251.0	2252.0	6754.0	20.0	20.0	...	68564829.0	76200890.0	379356.5	0.008885	0.008885	0.065719	0.027835	0.106383	0.106383	0.787234
894	City	Secaucus	New Jersey	New York, NY	1586.0	1584.0	1586.0	4756.0	1106.0	537.0	...	277046340.0	864994299.0	391859.0	0.697352	0.339015	0.288777	0.441758	0.526416	0.255593	0.217991
893	City	Seaside Park	New Jersey	New York, NY	672.0	672.0	672.0	2016.0	443.0	538.0	...	359297149.0	827539382.0	558632.0	0.659226	0.800595	0.595238	0.685020	0.320782	0.389573	0.289645
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1775	Zip	22307	Virginia	NaN	1483.0	1097.0	1970.0	4550.0	181.0	215.0	...	45692041.0	153788304.0	248578.0	0.122050	0.195989	0.031472	0.100659	0.395197	0.469432	0.135371
1774	Zip	21921	Maryland	NaN	5270.0	5284.0	5403.0	15957.0	12.0	47.0	...	41149066.0	53223597.0	316059.0	0.002277	0.008895	0.017583	0.009651	0.077922	0.305195	0.616883
1773	Zip	21903	Maryland	NaN	784.0	851.0	856.0	2491.0	10.0	37.0	...	21432532.0	30859486.0	260047.5	0.012755	0.043478	0.064252	0.040947	0.098039	0.362745	0.539216
1780	Zip	22578	Virginia	NaN	458.0	448.0	462.0	1368.0	73.0	159.0	...	118131703.0	184974026.0	452094.0	0.159389	0.354911	0.370130	0.294591	0.181141	0.394541	0.424318
2609	Zip	33578	Florida	NaN	5975.0	4537.0	3232.0	13744.0	8.0	6.0	...	56852018.0	59115506.0	363355.0	0.001339	0.001322	0.038366	0.010041	0.057971	0.043478	0.898551

2610 rows × 24 columns

uw.sort_values(by = ['RegionType', 'RegionName', 'AllHomes_Tier1'], ascending = [True, True, False])

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
690	City	Abbeville	Louisiana	Lafayette, LA	1995.0	1995.0	1997.0	5987.0	196.0	512.0	...	30967445.0	100185101.0	103417.0	0.098246	0.256642	0.095643	0.150159	0.218020	0.569522	0.212458
811	City	Aberdeen	Washington	Aberdeen, WA	2799.0	2799.0	2799.0	8397.0	2412.0	1145.0	...	64053930.0	389836601.0	87694.0	0.861736	0.409075	0.091104	0.453972	0.632739	0.300367	0.066894
346	City	Absecon	New Jersey	Egg Harbor Township, NJ	1097.0	1097.0	1097.0	3291.0	40.0	55.0	...	18714933.0	33321640.0	180979.0	0.036463	0.050137	0.060164	0.048921	0.248447	0.341615	0.409938
939	City	Aiea	Hawaii	Honolulu, HI	4209.0	4209.0	4211.0	12629.0	0.0	146.0	...	35306798.0	137032636.0	708881.0	0.000000	0.034688	0.007837	0.014174	0.000000	0.815642	0.184358
1151	City	Alva	Florida	Fort Myers, FL	589.0	589.0	590.0	1768.0	6.0	38.0	...	59575578.0	69446593.0	369560.0	0.010187	0.064516	0.164407	0.079751	0.042553	0.269504	0.687943
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2603	Zip	98584	Washington	NaN	4708.0	4694.0	4766.0	14168.0	8.0	41.0	...	98174201.0	108265864.0	366002.0	0.001699	0.008735	0.047839	0.019551	0.028881	0.148014	0.823105
2604	Zip	98588	Washington	NaN	469.0	479.0	479.0	1427.0	17.0	35.0	...	51494123.0	60982770.0	351648.0	0.036247	0.073069	0.229645	0.113525	0.104938	0.216049	0.679012
2605	Zip	98592	Washington	NaN	464.0	470.0	493.0	1427.0	8.0	29.0	...	73407109.0	82715412.0	436490.5	0.017241	0.061702	0.273834	0.120533	0.046512	0.168605	0.784884
2606	Zip	98595	Washington	NaN	558.0	571.0	598.0	1727.0	119.0	104.0	...	13761194.0	40365702.0	131441.0	0.213262	0.182137	0.095318	0.162131	0.425000	0.371429	0.203571
2607	Zip	98612	Washington	NaN	365.0	376.0	409.0	1150.0	91.0	79.0	...	14406737.0	50284001.0	230495.0	0.249315	0.210106	0.092910	0.180870	0.437500	0.379808	0.182692

2610 rows × 24 columns

Renaming columns#

While cleaning our data, we might want to rename our columns, or variables. You can do this automatically with something like pyjanitor (see below). You can also do it when importing the data using .read_csv from pandas, if you know that you’ll want to change the name.

It is also easy enough to do using .rename from pandas. You give it a mapping of old name to new name. I’ll rename MSAName for fun.

Note the inplace=True argument. You’ll see this a lot for pandas methods. This makes it so that we don’t have to do a uw = uw.rename(). Instead, we save over the old DataFrame with the new one with the new column name.

You can read more here.

uw.rename(columns = {'MSAName':'msaname'}, inplace = True)

Dropping columns#

We can also simply drop a column of data. You could do this via data subsets (e.g. .iloc), but this method lets you use .drop() to specify the name of the column. It’s a little clearer doing things this way.

The axis=1 option means that you are dropping the column named. I’ve picked a column at random to drop. Notice that I’m saving my work back to the uw DataFrame.

uw = uw.drop('UWHomes_TotalValue_Tier3', axis=1)

Missing data example#

Finally, let’s try some more complicated code. I found this example here.

First, we will create a DataFrame that has the total number of missing values for each variable. We can sort the data using sort_values(). The ascending=False option will have the variable with the largest number of missings at the top.

total = uw.isna().sum().sort_values(ascending=False)

By the way, you’ll also see a isnull() method that does the same thing as isna().

Next, let’s create a DataFrame that that has the percentage of values that are missing for each variable. This is neat one - we are creating a DataFrame of values (total number missing) for the numerator and another DataFrame of values (total number) for the denominator. Then, we are dividing two DataFrames, giving us another DataFrame of the resulting division. We then sort.

percent = (uw.isnull().sum()/uw.isnull().count()).sort_values(ascending = False)

We can use a new function called concat from pandas that combines data, either as rows (stacking) or columns (combining). We’ll combine columns, with means concatenating along axis=1. We’ll name both columns. We can do this because each DataFrame has the same index created by pandas, all of our variable names. So, there’s a one-to-one correspondence between the two DataFrames.

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Missing Percent'])

Let’s take the percents and multiply all of them by 100, just to make them look like percents. And to show you, again, that you can.

missing_data['Missing Percent'] = missing_data['Missing Percent'] * 100

For the last step, we can filter and just get the variable names where more than 10% of our data are missing.

missing_data[missing_data['Missing Percent'] > 10]

	Total	Missing Percent
msaname	1539	58.965517

Back to stocks#

We can bring back the stock data too, as that data has some missing values.

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

Why are there missing values? Holidays and weekends, when trading doesn’t take place.

prices.isna().sum()

AAPL.O    78
MSFT.O    78
INTC.O    78
AMZN.O    78
GS.N      78
SPY       78
.SPX      78
.VIX      78
EUR=       0
XAU=       5
GDX       78
GLD       78
dtype: int64

We can drop these rows. We’ll specify axis=0, or rows.

prices = prices.dropna(axis=0)
prices.isna().sum()

AAPL.O    0
MSFT.O    0
INTC.O    0
AMZN.O    0
GS.N      0
SPY       0
.SPX      0
.VIX      0
EUR=      0
XAU=      0
GDX       0
GLD       0
dtype: int64

prices.head(15)

	AAPL.O	MSFT.O	INTC.O	AMZN.O	GS.N	SPY	.SPX	.VIX	EUR=	XAU=	GDX	GLD
Date
2010-01-04	30.572827	30.950	20.880	133.900	173.08	113.33	1132.99	20.04	1.4411	1120.00	47.71	109.80
2010-01-05	30.625684	30.960	20.870	134.690	176.14	113.63	1136.52	19.35	1.4368	1118.65	48.17	109.70
2010-01-06	30.138541	30.770	20.800	132.250	174.26	113.71	1137.14	19.16	1.4412	1138.50	49.34	111.51
2010-01-07	30.082827	30.452	20.600	130.000	177.67	114.19	1141.69	19.06	1.4318	1131.90	49.10	110.82
2010-01-08	30.282827	30.660	20.830	133.520	174.31	114.57	1144.98	18.13	1.4412	1136.10	49.84	111.37
2010-01-11	30.015684	30.270	20.950	130.308	171.56	114.73	1146.98	17.55	1.4513	1152.60	50.17	112.85
2010-01-12	29.674256	30.070	20.608	127.350	167.82	113.66	1136.22	18.25	1.4494	1127.30	48.35	110.49
2010-01-13	30.092827	30.350	20.960	129.110	169.07	114.62	1145.68	17.85	1.4510	1138.40	48.86	111.54
2010-01-14	29.918542	30.960	21.480	127.350	168.53	114.93	1148.46	17.63	1.4502	1142.85	48.60	112.03
2010-01-15	29.418542	30.860	20.800	127.140	165.21	113.64	1136.03	17.91	1.4382	1129.90	47.42	110.86
2010-01-19	30.719969	31.100	21.050	127.610	166.86	115.06	1150.23	17.58	1.4298	1137.70	47.69	111.52
2010-01-20	30.246398	30.585	21.080	125.780	167.79	113.89	1138.04	18.68	1.4101	1111.30	45.73	108.94
2010-01-21	29.724542	30.010	20.840	126.620	160.87	111.70	1116.48	22.27	1.4090	1094.75	43.75	107.37
2010-01-22	28.249972	28.960	19.910	121.430	154.12	109.21	1091.76	27.31	1.4137	1092.60	43.79	107.17
2010-01-25	29.010685	29.320	20.320	120.310	154.98	109.77	1096.78	25.41	1.4150	1097.85	43.19	107.48

Pyjanitor#

We are going to look at a fun package that is based on something from the R statistical programming language, called pyjanitor.

To use this package, you’ll need to type the following in the terminal (Mac) or cmd terminal (Windows).

pip install pyjanitor

This will install pyjanitor using pip. Again, if you’re using Windows, make sure that your terminal is cmd and not Powershell. pip works in Powershell, but you have to change a system setting first.You’ll need to restart VS Code once you’ve installed it.

If you’re using Google Colab, there’s currently a bug in pyjanitor that requires you to install a previous version. You can do this in a Jupyter cell. Note that you need to use !pip when using a terminal command inside of a Jupyter notebook.

!pip install pyjanitor==0.23.1

There are even finance specific tools.

We can bring in janitor and treat it as part of pandas using the import function.

import janitor

pyjanitor lets us have an interesting workflow. We can read in our data set, remove columns, drop missings, and rename columns, all in one series of steps.

prices = (
    pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)
    .remove_columns(['GLD'])
    .dropna()
    .rename_column('AAPL.O', 'AAPL')
    .rename_column('MSFT.O', 'MSFT')
)

There are also some built-in, general functions. clean_names() does what it says. For example, it sets all characters in a variable name to lower case.

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices = prices.clean_names()

Note the syntax - you start with the DataFrame and then apply the function to it.

You might see another syntax, though. You can import the functions directly and then include the DataFrame as an argument to that function. Both ways are detailed in the pyjanitor instructions.

from janitor import clean_names, remove_empty
 
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices = clean_names(prices) # See the difference?

The method flag_nulls creates a new variable that will have a 1 if any of the variables specified are missing. In this case, I didn’t specify anything, so it will look across all of the variables. If any variable is NaN, then that row gets a 1. Notice the any.

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices.flag_nulls()

	AAPL.O	MSFT.O	INTC.O	AMZN.O	GS.N	SPY	.SPX	.VIX	EUR=	XAU=	GDX	GLD	null_flag
Date
2010-01-01	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1.4323	1096.35	NaN	NaN	1
2010-01-04	30.572827	30.950	20.88	133.90	173.08	113.33	1132.99	20.04	1.4411	1120.00	47.71	109.80	0
2010-01-05	30.625684	30.960	20.87	134.69	176.14	113.63	1136.52	19.35	1.4368	1118.65	48.17	109.70	0
2010-01-06	30.138541	30.770	20.80	132.25	174.26	113.71	1137.14	19.16	1.4412	1138.50	49.34	111.51	0
2010-01-07	30.082827	30.452	20.60	130.00	177.67	114.19	1141.69	19.06	1.4318	1131.90	49.10	110.82	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2018-06-25	182.170000	98.390	50.71	1663.15	221.54	271.00	2717.07	17.33	1.1702	1265.00	22.01	119.89	0
2018-06-26	184.430000	99.080	49.67	1691.09	221.58	271.60	2723.06	15.92	1.1645	1258.64	21.95	119.26	0
2018-06-27	184.160000	97.540	48.76	1660.51	220.18	269.35	2699.63	17.91	1.1552	1251.62	21.81	118.58	0
2018-06-28	185.500000	98.630	49.25	1701.45	223.42	270.89	2716.31	16.85	1.1567	1247.88	21.93	118.22	0
2018-06-29	185.110000	98.610	49.71	1699.80	220.57	271.28	2718.37	16.09	1.1683	1252.25	22.31	118.65	0

2216 rows × 13 columns

Finally, simple way to see if we have any rows of duplicate data. This will happen surprisingly (and unfortunately) often when we start merging data together.

prices.get_dupes()

	AAPL.O	MSFT.O	INTC.O	AMZN.O	GS.N	SPY	.SPX	.VIX	EUR=	XAU=	GDX	GLD
Date

Data Analysis in Finance

Cleaning our data

Contents

Cleaning our data#

Data Integrity#

Underwater exploration#

Counting, missings, and dupes#

Sorting our data#

Renaming columns#

Dropping columns#

Missing data example#

Back to stocks#

Pyjanitor#