Cleaning our data#

Cleaning your data and dealing with dates. That’s the life of anyone who deals with data, especially in finance. Most data sets are messy. This is especially true if you or your firm is the one collecting the data and not just purchasing something.

Then, once your data is (reasonably) clean and error free, you start creating the variables that will go into your statistical/machine learning models. This is called feature engineering.

DataCamp has an article on cleaning data in pandas.

Chapter 7 of Python for Data Analysis for more data cleaning tips.

And, as always, I like the Coding for Economists text.

You’ll see all of these steps when we get our actual models.

Data Integrity#

Before you even start to clean your data, you have to know how your data were created.

What does this mean in practice? Read the data manual if you are using data constructed by someone else! Did you pull data from FRED? How did the BLS construct that time series? Did anything change? Can you really compare a number from 2023 to 2019, or did a definition change?

Can you compare Super Bowl viewership over time?

Underwater exploration#

We’ll start by bringing in our Zillow data again.

import numpy as np
import pandas as pd

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')
uw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RegionType                    2610 non-null   object 
 1   RegionName                    2610 non-null   object 
 2   StateName                     2609 non-null   object 
 3   MSAName                       1071 non-null   object 
 4   AllHomes_Tier1                2610 non-null   float64
 5   AllHomes_Tier2                2610 non-null   float64
 6   AllHomes_Tier3                2610 non-null   float64
 7   AllHomes_AllTiers             2610 non-null   float64
 8   UWHomes_Tier1                 2610 non-null   float64
 9   UWHomes_Tier2                 2610 non-null   int64  
 10  UWHomes_Tier3                 2610 non-null   float64
 11  UWHomes_AllTiers              2610 non-null   float64
 12  UWHomes_TotalValue_Tier1      2610 non-null   float64
 13  UWHomes_TotalValue_Tier2      2610 non-null   float64
 14  UWHomes_TotalValue_Tier3      2610 non-null   float64
 15  UWHomes_TotalValue_AllTiers   2610 non-null   float64
 16  UWHomes_MedianValue_AllTiers  2610 non-null   float64
 17  AllHomes_Tier1_ShareUW        2610 non-null   float64
 18  AllHomes_Tier2_ShareUW        2610 non-null   float64
 19  AllHomes_Tier3_ShareUW        2610 non-null   float64
 20  AllHomes_AllTiers_ShareUW     2610 non-null   float64
 21  UWHomes_ShareInTier1          2610 non-null   float64
 22  UWHomes_ShareInTier2          2610 non-null   float64
 23  UWHomes_ShareInTier3          2610 non-null   float64
dtypes: float64(19), int64(1), object(4)
memory usage: 489.5+ KB

Once of the most important steps in data cleaning is just looking at what we have. What are the variables? What are their types? How many unique values of each variable do we have? Any missings? Do we see anything unexpected?

This page has a nice summary.

We can select a column and change the variable type using .astype().

uw['UWHomes_Tier2'] = uw['UWHomes_Tier2'].astype('float64')

This code is actually taking the column, changing its type, and saving it to the DataFrame with the same column name. Note that we could changed the name of the column.

Counting, missings, and dupes#

When you first bring in some data, you often want to do some simple checks. How many observations do you have? Are you missing anything? Are their duplicate observations? How do we define duplicate observations?

You can’t do any data analysis until you understand the structure of your data. What do you have? Why do you have it? How do you want to use it?

Let’s start with counts. We can count the number of times a category appears in a variable. Note that this is only useful for variables that have an sense of category. You would do this for any of the home count or dollar value variables, for example.

We can select a categorical variable, like RegionType and then use .value_counts() from pandas.

uw['RegionType'].value_counts()
Zip       1247
City      1017
County     227
MSA         95
State       23
Nation       1
Name: RegionType, dtype: int64

We should look for missing values for each variable. isna() returns a TRUE or FALSE for each value, depending on whether or not it is NaN, or missing. We can then take those 1/0, true or false, values and add them up with sum().

uw.isna().sum()
RegionType                         0
RegionName                         0
StateName                          1
MSAName                         1539
AllHomes_Tier1                     0
AllHomes_Tier2                     0
AllHomes_Tier3                     0
AllHomes_AllTiers                  0
UWHomes_Tier1                      0
UWHomes_Tier2                      0
UWHomes_Tier3                      0
UWHomes_AllTiers                   0
UWHomes_TotalValue_Tier1           0
UWHomes_TotalValue_Tier2           0
UWHomes_TotalValue_Tier3           0
UWHomes_TotalValue_AllTiers        0
UWHomes_MedianValue_AllTiers       0
AllHomes_Tier1_ShareUW             0
AllHomes_Tier2_ShareUW             0
AllHomes_Tier3_ShareUW             0
AllHomes_AllTiers_ShareUW          0
UWHomes_ShareInTier1               0
UWHomes_ShareInTier2               0
UWHomes_ShareInTier3               0
dtype: int64

You can see the logic in how these functions work. We take our DataFrame uw and we send it to isna(). This actually creates another DataFrame, that we then pass to sum(). Check it out.

uw.isna()
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 False False True True False False False False False False ... False False False False False False False False False False
1 False False False True False False False False False False ... False False False False False False False False False False
2 False False False True False False False False False False ... False False False False False False False False False False
3 False False False True False False False False False False ... False False False False False False False False False False
4 False False False True False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2605 False False False True False False False False False False ... False False False False False False False False False False
2606 False False False True False False False False False False ... False False False False False False False False False False
2607 False False False True False False False False False False ... False False False False False False False False False False
2608 False False False True False False False False False False ... False False False False False False False False False False
2609 False False False True False False False False False False ... False False False False False False False False False False

2610 rows × 24 columns

The same sort of logic applies to the method unique(). This one gives an array of each unique value in a column or set of columns. This will save an array of unique values that appear in the RegionName column.

unique_regions = uw.RegionName.unique()

I’m using the .ColumnName construction to pull the variable.

We can count the number of unique values for a variable.

uw.RegionName.nunique()
2496

We can filter on one variable and count another. We’re looking for unexpected things, just trying to get a sense for what we have.

uw[uw['RegionType'] == 'MSA'].MSAName.nunique()
95

This syntax works too! We are pulling the column name out using a list defined by the column names inside of [].

uw[uw['RegionType'] == 'MSA']['MSAName'].nunique()
95

We have 95 unique MSA is our data. Seems reasonable. MSAs are like city-regions.

Finally, we can just drop any rows with missing values using .dropna() from pandas. If we don’t specify a column, then it will drop a row if ANY value is missing. I don’t actually want to drop missings here, so I’m not going to save my work with a uw = .

You can read more here.

uw.dropna(axis = 0)
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
251 MSA Aberdeen, WA Washington Aberdeen, WA 11606.0 11751.0 11749.0 35106.0 4119.0 2450.0 ... 1.734113e+08 7.791588e+08 88327.0 0.354903 0.208493 0.065537 0.209053 0.561248 0.333833 0.104919
252 MSA Astoria, OR Oregon Astoria, OR 4745.0 5228.0 5568.0 15541.0 811.0 376.0 ... 9.138231e+07 3.620604e+08 237912.0 0.170917 0.071920 0.040409 0.090856 0.574363 0.266289 0.159348
253 MSA Egg Harbor Township, NJ New Jersey Egg Harbor Township, NJ 34185.0 34117.0 34088.0 102390.0 9116.0 9575.0 ... 5.576646e+09 1.030631e+10 264441.0 0.266667 0.280652 0.252200 0.266510 0.334066 0.350887 0.315047
254 MSA Baltimore, MD Maryland Baltimore, MD 298318.0 312603.0 316502.0 927423.0 2293.0 2910.0 ... 3.698818e+09 5.153664e+09 336801.0 0.007686 0.009309 0.021861 0.013071 0.189160 0.240059 0.570780
255 MSA East Falmouth, MA Massachusetts East Falmouth, MA 45345.0 45358.0 45258.0 135961.0 1651.0 1395.0 ... 4.205784e+09 5.349099e+09 588650.0 0.036410 0.030755 0.073004 0.046705 0.260000 0.219685 0.520315
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1358 City Town of Salisbury Massachusetts Boston, MA 1207.0 1207.0 1207.0 3621.0 397.0 402.0 ... 1.484370e+08 3.900701e+08 342026.0 0.328915 0.333057 0.221210 0.294394 0.372420 0.377111 0.250469
1359 City Town of Duxbury Massachusetts Boston, MA 1835.0 1835.0 1837.0 5507.0 68.0 33.0 ... 1.909504e+08 2.389202e+08 749341.0 0.037057 0.017984 0.056614 0.037225 0.331707 0.160976 0.507317
1360 City New Shoreham Rhode Island Providence, RI 506.0 506.0 506.0 1518.0 57.0 25.0 ... 9.383493e+07 1.654886e+08 1072092.0 0.112648 0.049407 0.094862 0.085639 0.438462 0.192308 0.369231
1361 City Yarmouth Massachusetts East Falmouth, MA 5178.0 5178.0 5180.0 15536.0 472.0 184.0 ... 4.531706e+08 6.061486e+08 332409.0 0.091155 0.035535 0.102124 0.076274 0.398312 0.155274 0.446414
1362 City Town of Hingham Massachusetts Boston, MA 2462.0 2462.0 2462.0 7386.0 48.0 33.0 ... 8.266223e+07 1.309589e+08 731420.0 0.019496 0.013404 0.021121 0.018007 0.360902 0.248120 0.390977

1071 rows × 24 columns

We now have 1071 rows, since MSAName had so many missing values. But - that was OK given how this data are constructed. You’ll see another example with stock data below.

Now, I’ll just drop observations if AllHomes_Tier1 is missing.

uw.dropna(subset = ['AllHomes_Tier1'], axis = 0)
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2605 Zip 98592 Washington NaN 464.0 470.0 493.0 1427.0 8.0 29.0 ... 7.340711e+07 8.271541e+07 436490.5 0.017241 0.061702 0.273834 0.120533 0.046512 0.168605 0.784884
2606 Zip 98595 Washington NaN 558.0 571.0 598.0 1727.0 119.0 104.0 ... 1.376119e+07 4.036570e+07 131441.0 0.213262 0.182137 0.095318 0.162131 0.425000 0.371429 0.203571
2607 Zip 98612 Washington NaN 365.0 376.0 409.0 1150.0 91.0 79.0 ... 1.440674e+07 5.028400e+07 230495.0 0.249315 0.210106 0.092910 0.180870 0.437500 0.379808 0.182692
2608 Zip 32081 Florida NaN 1318.0 1328.0 1350.0 3996.0 91.0 42.0 ... 9.111212e+07 1.325055e+08 385196.0 0.069044 0.031627 0.097778 0.066316 0.343396 0.158491 0.498113
2609 Zip 33578 Florida NaN 5975.0 4537.0 3232.0 13744.0 8.0 6.0 ... 5.685202e+07 5.911551e+07 363355.0 0.001339 0.001322 0.038366 0.010041 0.057971 0.043478 0.898551

2610 rows × 24 columns

Nothing gets dropped, since there were no missing values.

Unique and missing values are important. So is the idea of duplicates. Does a row (an observation) contain the same values as another row? That could be all of the values across all of the variables, or just a particular column (e.g. ticker), or a set of columns (e.g. ticker and date).

We can see if our data has any duplicates by using .duplicated() from pandas.

You can read more here.

I’m going to filter my data to only include RegionType='Zip' and then look for duplicate Zip codes. The Zip codes are found in RegionName. I’ll save this to a new DataFrame called dupes. The argument keep='first will keep the first duplicate, if any.

I am then going to mask my subset of the uw data. So, I filter to get just Zip codes. Then, I mask on the dupes DataFrame. This DataFrame is just an array of True/False values. By masking, I’m telling Python to only keep values that are True. There are no True values, so no duplicate Zips and the resulting DataFrame is empty.

dupes = uw[uw['RegionType'] == 'Zip'].duplicated(subset=['RegionName'], keep='first')

uw[uw['RegionType'] == 'Zip'][dupes]
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3

0 rows × 24 columns

We can use drop_duplicates() to look at our data and drop rows that are duplicate observations. We use the subset= argument to tell it what columns to look at for duplicates. We can leave out this argument if we want to look across all columns.

The argument keep= will tell it which duplicate to keep in the data (e.g. the first one or the last one).

You can read more about the syntax here.

uw.drop_duplicates()
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2605 Zip 98592 Washington NaN 464.0 470.0 493.0 1427.0 8.0 29.0 ... 7.340711e+07 8.271541e+07 436490.5 0.017241 0.061702 0.273834 0.120533 0.046512 0.168605 0.784884
2606 Zip 98595 Washington NaN 558.0 571.0 598.0 1727.0 119.0 104.0 ... 1.376119e+07 4.036570e+07 131441.0 0.213262 0.182137 0.095318 0.162131 0.425000 0.371429 0.203571
2607 Zip 98612 Washington NaN 365.0 376.0 409.0 1150.0 91.0 79.0 ... 1.440674e+07 5.028400e+07 230495.0 0.249315 0.210106 0.092910 0.180870 0.437500 0.379808 0.182692
2608 Zip 32081 Florida NaN 1318.0 1328.0 1350.0 3996.0 91.0 42.0 ... 9.111212e+07 1.325055e+08 385196.0 0.069044 0.031627 0.097778 0.066316 0.343396 0.158491 0.498113
2609 Zip 33578 Florida NaN 5975.0 4537.0 3232.0 13744.0 8.0 6.0 ... 5.685202e+07 5.911551e+07 363355.0 0.001339 0.001322 0.038366 0.010041 0.057971 0.043478 0.898551

2610 rows × 24 columns

uw.drop_duplicates(subset=['RegionName', 'StateName'], keep='last')
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2605 Zip 98592 Washington NaN 464.0 470.0 493.0 1427.0 8.0 29.0 ... 7.340711e+07 8.271541e+07 436490.5 0.017241 0.061702 0.273834 0.120533 0.046512 0.168605 0.784884
2606 Zip 98595 Washington NaN 558.0 571.0 598.0 1727.0 119.0 104.0 ... 1.376119e+07 4.036570e+07 131441.0 0.213262 0.182137 0.095318 0.162131 0.425000 0.371429 0.203571
2607 Zip 98612 Washington NaN 365.0 376.0 409.0 1150.0 91.0 79.0 ... 1.440674e+07 5.028400e+07 230495.0 0.249315 0.210106 0.092910 0.180870 0.437500 0.379808 0.182692
2608 Zip 32081 Florida NaN 1318.0 1328.0 1350.0 3996.0 91.0 42.0 ... 9.111212e+07 1.325055e+08 385196.0 0.069044 0.031627 0.097778 0.066316 0.343396 0.158491 0.498113
2609 Zip 33578 Florida NaN 5975.0 4537.0 3232.0 13744.0 8.0 6.0 ... 5.685202e+07 5.911551e+07 363355.0 0.001339 0.001322 0.038366 0.010041 0.057971 0.043478 0.898551

2571 rows × 24 columns

You want to keep track of the number of observations (rows) when you’re dropping duplicates. In this case, we go from 2,610 rows to 2,571.

Sorting our data#

Another basic task that we often face is sorting our data. You can sort numeric or string data, by variable using .sort_values() from pandas.

The ascending= argument can be True or False and tells us how to sort. You can pass it a list of how to sort if you’re sorting on multiple variables.

I’l sort the data by RegionType first. Then, I’ll sort the data by RegionType, RegionName, and AllHomes_Tier1

You can read more here.

uw.sort_values(by = 'RegionType', ascending = True)
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
1304 City Ortley Beach New Jersey New York, NY 767.0 763.0 766.0 2296.0 746.0 724.0 ... 447937059.0 879315914.0 348092.0 0.972621 0.948886 0.953003 0.958188 0.339091 0.329091 0.331818
896 City Smyrna Delaware Dover, DE 4464.0 4464.0 4464.0 13392.0 128.0 39.0 ... 9078029.0 29936443.0 110029.0 0.028674 0.008737 0.005376 0.014262 0.670157 0.204188 0.125654
895 City Smithfield Virginia Virginia Beach, VA 2251.0 2251.0 2252.0 6754.0 20.0 20.0 ... 68564829.0 76200890.0 379356.5 0.008885 0.008885 0.065719 0.027835 0.106383 0.106383 0.787234
894 City Secaucus New Jersey New York, NY 1586.0 1584.0 1586.0 4756.0 1106.0 537.0 ... 277046340.0 864994299.0 391859.0 0.697352 0.339015 0.288777 0.441758 0.526416 0.255593 0.217991
893 City Seaside Park New Jersey New York, NY 672.0 672.0 672.0 2016.0 443.0 538.0 ... 359297149.0 827539382.0 558632.0 0.659226 0.800595 0.595238 0.685020 0.320782 0.389573 0.289645
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1775 Zip 22307 Virginia NaN 1483.0 1097.0 1970.0 4550.0 181.0 215.0 ... 45692041.0 153788304.0 248578.0 0.122050 0.195989 0.031472 0.100659 0.395197 0.469432 0.135371
1774 Zip 21921 Maryland NaN 5270.0 5284.0 5403.0 15957.0 12.0 47.0 ... 41149066.0 53223597.0 316059.0 0.002277 0.008895 0.017583 0.009651 0.077922 0.305195 0.616883
1773 Zip 21903 Maryland NaN 784.0 851.0 856.0 2491.0 10.0 37.0 ... 21432532.0 30859486.0 260047.5 0.012755 0.043478 0.064252 0.040947 0.098039 0.362745 0.539216
1780 Zip 22578 Virginia NaN 458.0 448.0 462.0 1368.0 73.0 159.0 ... 118131703.0 184974026.0 452094.0 0.159389 0.354911 0.370130 0.294591 0.181141 0.394541 0.424318
2609 Zip 33578 Florida NaN 5975.0 4537.0 3232.0 13744.0 8.0 6.0 ... 56852018.0 59115506.0 363355.0 0.001339 0.001322 0.038366 0.010041 0.057971 0.043478 0.898551

2610 rows × 24 columns

uw.sort_values(by = ['RegionType', 'RegionName', 'AllHomes_Tier1'], ascending = [True, True, False])
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
690 City Abbeville Louisiana Lafayette, LA 1995.0 1995.0 1997.0 5987.0 196.0 512.0 ... 30967445.0 100185101.0 103417.0 0.098246 0.256642 0.095643 0.150159 0.218020 0.569522 0.212458
811 City Aberdeen Washington Aberdeen, WA 2799.0 2799.0 2799.0 8397.0 2412.0 1145.0 ... 64053930.0 389836601.0 87694.0 0.861736 0.409075 0.091104 0.453972 0.632739 0.300367 0.066894
346 City Absecon New Jersey Egg Harbor Township, NJ 1097.0 1097.0 1097.0 3291.0 40.0 55.0 ... 18714933.0 33321640.0 180979.0 0.036463 0.050137 0.060164 0.048921 0.248447 0.341615 0.409938
939 City Aiea Hawaii Honolulu, HI 4209.0 4209.0 4211.0 12629.0 0.0 146.0 ... 35306798.0 137032636.0 708881.0 0.000000 0.034688 0.007837 0.014174 0.000000 0.815642 0.184358
1151 City Alva Florida Fort Myers, FL 589.0 589.0 590.0 1768.0 6.0 38.0 ... 59575578.0 69446593.0 369560.0 0.010187 0.064516 0.164407 0.079751 0.042553 0.269504 0.687943
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2603 Zip 98584 Washington NaN 4708.0 4694.0 4766.0 14168.0 8.0 41.0 ... 98174201.0 108265864.0 366002.0 0.001699 0.008735 0.047839 0.019551 0.028881 0.148014 0.823105
2604 Zip 98588 Washington NaN 469.0 479.0 479.0 1427.0 17.0 35.0 ... 51494123.0 60982770.0 351648.0 0.036247 0.073069 0.229645 0.113525 0.104938 0.216049 0.679012
2605 Zip 98592 Washington NaN 464.0 470.0 493.0 1427.0 8.0 29.0 ... 73407109.0 82715412.0 436490.5 0.017241 0.061702 0.273834 0.120533 0.046512 0.168605 0.784884
2606 Zip 98595 Washington NaN 558.0 571.0 598.0 1727.0 119.0 104.0 ... 13761194.0 40365702.0 131441.0 0.213262 0.182137 0.095318 0.162131 0.425000 0.371429 0.203571
2607 Zip 98612 Washington NaN 365.0 376.0 409.0 1150.0 91.0 79.0 ... 14406737.0 50284001.0 230495.0 0.249315 0.210106 0.092910 0.180870 0.437500 0.379808 0.182692

2610 rows × 24 columns

Renaming columns#

While cleaning our data, we might want to rename our columns, or variables. You can do this automatically with something like pyjanitor (see below). You can also do it when importing the data using .read_csv from pandas, if you know that you’ll want to change the name.

It is also easy enough to do using .rename from pandas. You give it a mapping of old name to new name. I’ll rename MSAName for fun.

Note the inplace=True argument. You’ll see this a lot for pandas methods. This makes it so that we don’t have to do a uw = uw.rename(). Instead, we save over the old DataFrame with the new one with the new column name.

You can read more here.

uw.rename(columns = {'MSAName':'msaname'}, inplace = True)

Dropping columns#

We can also simply drop a column of data. You could do this via data subsets (e.g. .iloc), but this method lets you use .drop() to specify the name of the column. It’s a little clearer doing things this way.

The axis=1 option means that you are dropping the column named. I’ve picked a column at random to drop. Notice that I’m saving my work back to the uw DataFrame.

uw = uw.drop('UWHomes_TotalValue_Tier3', axis=1)

Missing data example#

Finally, let’s try some more complicated code. I found this example here.

First, we will create a DataFrame that has the total number of missing values for each variable. We can sort the data using sort_values(). The ascending=False option will have the variable with the largest number of missings at the top.

total = uw.isna().sum().sort_values(ascending=False)

By the way, you’ll also see a isnull() method that does the same thing as isna().

Next, let’s create a DataFrame that that has the percentage of values that are missing for each variable. This is neat one - we are creating a DataFrame of values (total number missing) for the numerator and another DataFrame of values (total number) for the denominator. Then, we are dividing two DataFrames, giving us another DataFrame of the resulting division. We then sort.

percent = (uw.isnull().sum()/uw.isnull().count()).sort_values(ascending = False)

We can use a new function called concat from pandas that combines data, either as rows (stacking) or columns (combining). We’ll combine columns, with means concatenating along axis=1. We’ll name both columns. We can do this because each DataFrame has the same index created by pandas, all of our variable names. So, there’s a one-to-one correspondence between the two DataFrames.

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Missing Percent'])

Let’s take the percents and multiply all of them by 100, just to make them look like percents. And to show you, again, that you can.

missing_data['Missing Percent'] = missing_data['Missing Percent'] * 100

For the last step, we can filter and just get the variable names where more than 10% of our data are missing.

missing_data[missing_data['Missing Percent'] > 10]
Total Missing Percent
msaname 1539 58.965517

Back to stocks#

We can bring back the stock data too, as that data has some missing values.

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

Why are there missing values? Holidays and weekends, when trading doesn’t take place.

prices.isna().sum()
AAPL.O    78
MSFT.O    78
INTC.O    78
AMZN.O    78
GS.N      78
SPY       78
.SPX      78
.VIX      78
EUR=       0
XAU=       5
GDX       78
GLD       78
dtype: int64

We can drop these rows. We’ll specify axis=0, or rows.

prices = prices.dropna(axis=0)
prices.isna().sum()
AAPL.O    0
MSFT.O    0
INTC.O    0
AMZN.O    0
GS.N      0
SPY       0
.SPX      0
.VIX      0
EUR=      0
XAU=      0
GDX       0
GLD       0
dtype: int64
prices.head(15)
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY .SPX .VIX EUR= XAU= GDX GLD
Date
2010-01-04 30.572827 30.950 20.880 133.900 173.08 113.33 1132.99 20.04 1.4411 1120.00 47.71 109.80
2010-01-05 30.625684 30.960 20.870 134.690 176.14 113.63 1136.52 19.35 1.4368 1118.65 48.17 109.70
2010-01-06 30.138541 30.770 20.800 132.250 174.26 113.71 1137.14 19.16 1.4412 1138.50 49.34 111.51
2010-01-07 30.082827 30.452 20.600 130.000 177.67 114.19 1141.69 19.06 1.4318 1131.90 49.10 110.82
2010-01-08 30.282827 30.660 20.830 133.520 174.31 114.57 1144.98 18.13 1.4412 1136.10 49.84 111.37
2010-01-11 30.015684 30.270 20.950 130.308 171.56 114.73 1146.98 17.55 1.4513 1152.60 50.17 112.85
2010-01-12 29.674256 30.070 20.608 127.350 167.82 113.66 1136.22 18.25 1.4494 1127.30 48.35 110.49
2010-01-13 30.092827 30.350 20.960 129.110 169.07 114.62 1145.68 17.85 1.4510 1138.40 48.86 111.54
2010-01-14 29.918542 30.960 21.480 127.350 168.53 114.93 1148.46 17.63 1.4502 1142.85 48.60 112.03
2010-01-15 29.418542 30.860 20.800 127.140 165.21 113.64 1136.03 17.91 1.4382 1129.90 47.42 110.86
2010-01-19 30.719969 31.100 21.050 127.610 166.86 115.06 1150.23 17.58 1.4298 1137.70 47.69 111.52
2010-01-20 30.246398 30.585 21.080 125.780 167.79 113.89 1138.04 18.68 1.4101 1111.30 45.73 108.94
2010-01-21 29.724542 30.010 20.840 126.620 160.87 111.70 1116.48 22.27 1.4090 1094.75 43.75 107.37
2010-01-22 28.249972 28.960 19.910 121.430 154.12 109.21 1091.76 27.31 1.4137 1092.60 43.79 107.17
2010-01-25 29.010685 29.320 20.320 120.310 154.98 109.77 1096.78 25.41 1.4150 1097.85 43.19 107.48

Pyjanitor#

We are going to look at a fun package that is based on something from the R statistical programming language, called pyjanitor.

To use this package, you’ll need to type the following in the terminal (Mac) or cmd terminal (Windows).

pip install pyjanitor

This will install pyjanitor using pip. Again, if you’re using Windows, make sure that your terminal is cmd and not Powershell. pip works in Powershell, but you have to change a system setting first.You’ll need to restart VS Code once you’ve installed it.

If you’re using Google Colab, there’s currently a bug in pyjanitor that requires you to install a previous version. You can do this in a Jupyter cell. Note that you need to use !pip when using a terminal command inside of a Jupyter notebook.

!pip install pyjanitor==0.23.1

There are even finance specific tools.

We can bring in janitor and treat it as part of pandas using the import function.

import janitor

pyjanitor lets us have an interesting workflow. We can read in our data set, remove columns, drop missings, and rename columns, all in one series of steps.

prices = (
    pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)
    .remove_columns(['GLD'])
    .dropna()
    .rename_column('AAPL.O', 'AAPL')
    .rename_column('MSFT.O', 'MSFT')
)

There are also some built-in, general functions. clean_names() does what it says. For example, it sets all characters in a variable name to lower case.

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices = prices.clean_names()

Note the syntax - you start with the DataFrame and then apply the function to it.

You might see another syntax, though. You can import the functions directly and then include the DataFrame as an argument to that function. Both ways are detailed in the pyjanitor instructions.

from janitor import clean_names, remove_empty
 
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices = clean_names(prices) # See the difference?

The method flag_nulls creates a new variable that will have a 1 if any of the variables specified are missing. In this case, I didn’t specify anything, so it will look across all of the variables. If any variable is NaN, then that row gets a 1. Notice the any.

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices.flag_nulls()
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY .SPX .VIX EUR= XAU= GDX GLD null_flag
Date
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN 1.4323 1096.35 NaN NaN 1
2010-01-04 30.572827 30.950 20.88 133.90 173.08 113.33 1132.99 20.04 1.4411 1120.00 47.71 109.80 0
2010-01-05 30.625684 30.960 20.87 134.69 176.14 113.63 1136.52 19.35 1.4368 1118.65 48.17 109.70 0
2010-01-06 30.138541 30.770 20.80 132.25 174.26 113.71 1137.14 19.16 1.4412 1138.50 49.34 111.51 0
2010-01-07 30.082827 30.452 20.60 130.00 177.67 114.19 1141.69 19.06 1.4318 1131.90 49.10 110.82 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2018-06-25 182.170000 98.390 50.71 1663.15 221.54 271.00 2717.07 17.33 1.1702 1265.00 22.01 119.89 0
2018-06-26 184.430000 99.080 49.67 1691.09 221.58 271.60 2723.06 15.92 1.1645 1258.64 21.95 119.26 0
2018-06-27 184.160000 97.540 48.76 1660.51 220.18 269.35 2699.63 17.91 1.1552 1251.62 21.81 118.58 0
2018-06-28 185.500000 98.630 49.25 1701.45 223.42 270.89 2716.31 16.85 1.1567 1247.88 21.93 118.22 0
2018-06-29 185.110000 98.610 49.71 1699.80 220.57 271.28 2718.37 16.09 1.1683 1252.25 22.31 118.65 0

2216 rows × 13 columns

Finally, simple way to see if we have any rows of duplicate data. This will happen surprisingly (and unfortunately) often when we start merging data together.

prices.get_dupes()
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY .SPX .VIX EUR= XAU= GDX GLD
Date