4.3. pandas#

Now that we know how to get our data in, let’s look a little more at pandas and what it can do. To do this, let’s bring in some data from Zillow.

The data frame we will be working with contains data from Zillow.com, the housing web site, on the number of homes, the median value of homes, etc. at-risk for flooding due to rising sea levels.

You can find many different Zillow data sets here. The “uw” here stands for “under water” in a very literal sense. I first thought they were talking about mortgages!

As mentioned, take a look at Chapter 5 of Python for Data Analysis for an introduction to pandas.

import numpy as np
import pandas as pd

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')

We are going to use read_csv to get this data off of my Github page. pandas also comes with read_excel, read_json, read_html, and other ways to get formatted data into Python.

4.3.1. Looking at our data#

Let’s peak at this data. You can do that below, or using the .info method. We’ll get variable names, the number of non-missing (null) observations, and the type of data. One number variable, UWHomes_Tier2, got brought in as an integer, rather than a float. When you look at the data below, notice how that one column doesn’t have a .0 on any of the numbers. Integers can’t have decimal places.

uw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RegionType                    2610 non-null   object 
 1   RegionName                    2610 non-null   object 
 2   StateName                     2609 non-null   object 
 3   MSAName                       1071 non-null   object 
 4   AllHomes_Tier1                2610 non-null   float64
 5   AllHomes_Tier2                2610 non-null   float64
 6   AllHomes_Tier3                2610 non-null   float64
 7   AllHomes_AllTiers             2610 non-null   float64
 8   UWHomes_Tier1                 2610 non-null   float64
 9   UWHomes_Tier2                 2610 non-null   int64  
 10  UWHomes_Tier3                 2610 non-null   float64
 11  UWHomes_AllTiers              2610 non-null   float64
 12  UWHomes_TotalValue_Tier1      2610 non-null   float64
 13  UWHomes_TotalValue_Tier2      2610 non-null   float64
 14  UWHomes_TotalValue_Tier3      2610 non-null   float64
 15  UWHomes_TotalValue_AllTiers   2610 non-null   float64
 16  UWHomes_MedianValue_AllTiers  2610 non-null   float64
 17  AllHomes_Tier1_ShareUW        2610 non-null   float64
 18  AllHomes_Tier2_ShareUW        2610 non-null   float64
 19  AllHomes_Tier3_ShareUW        2610 non-null   float64
 20  AllHomes_AllTiers_ShareUW     2610 non-null   float64
 21  UWHomes_ShareInTier1          2610 non-null   float64
 22  UWHomes_ShareInTier2          2610 non-null   float64
 23  UWHomes_ShareInTier3          2610 non-null   float64
dtypes: float64(19), int64(1), object(4)
memory usage: 489.5+ KB

Just to show you, I’m going to reimport that data, but force that one variable to also come in as the same variable type. I don’t want anything unexpected to happen, so let’s get all numerics as floats. Notice how I do this – I give the column, or variable name, and the data type I want. I’m using a particular type of float type that comes with the numpy package.

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv', dtype = {'UWHomes_Tier2':np.float64})
uw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   RegionType                    2610 non-null   object 
 1   RegionName                    2610 non-null   object 
 2   StateName                     2609 non-null   object 
 3   MSAName                       1071 non-null   object 
 4   AllHomes_Tier1                2610 non-null   float64
 5   AllHomes_Tier2                2610 non-null   float64
 6   AllHomes_Tier3                2610 non-null   float64
 7   AllHomes_AllTiers             2610 non-null   float64
 8   UWHomes_Tier1                 2610 non-null   float64
 9   UWHomes_Tier2                 2610 non-null   float64
 10  UWHomes_Tier3                 2610 non-null   float64
 11  UWHomes_AllTiers              2610 non-null   float64
 12  UWHomes_TotalValue_Tier1      2610 non-null   float64
 13  UWHomes_TotalValue_Tier2      2610 non-null   float64
 14  UWHomes_TotalValue_Tier3      2610 non-null   float64
 15  UWHomes_TotalValue_AllTiers   2610 non-null   float64
 16  UWHomes_MedianValue_AllTiers  2610 non-null   float64
 17  AllHomes_Tier1_ShareUW        2610 non-null   float64
 18  AllHomes_Tier2_ShareUW        2610 non-null   float64
 19  AllHomes_Tier3_ShareUW        2610 non-null   float64
 20  AllHomes_AllTiers_ShareUW     2610 non-null   float64
 21  UWHomes_ShareInTier1          2610 non-null   float64
 22  UWHomes_ShareInTier2          2610 non-null   float64
 23  UWHomes_ShareInTier3          2610 non-null   float64
dtypes: float64(20), object(4)
memory usage: 489.5+ KB

OK, that seems better. All of these functions that we’re using have many different options. The only way to learn about them is to play around, read help documents, and search online for solutions to our problems.

There’s a concept in data analysis called tidy data. The characteristics of tidy data are:

  • Each variable forms a column.

  • Each observation forms a row.

  • Each type of observational unit forms a table.

../_images/04-tidy.png

Fig. 4.1 Tidy data.#

The first thing to note about this data is that there are several tiers. Some numbers are at the national level, some at the region, some at the state, and some at the MSA. Tidy data refers to one row, one observation, across different variables (columns) and, possibly, across time. When you have multiple observations of the same firm, state, person, etc. over time, we call that panel data. Panel data is tidy too.

Note that some RegionTypes (e.g. Nation, State) are actually summaries of other rows. This is bit confusing and we need to be mindful of this different levels when using the data.

Before we get to that, just a few more methods to look at the data.

First, let’s look at the first five rows.

uw.head()
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754

5 rows × 24 columns

NaN is how pandas shows missing values. So, that first row is RegionType == 'Nation' and RegionName == 'UnitedStates'. StateName and MSAName are undefined, since this row is all of the data added up. We usually don’t want to have summary statistics in the same DataFrame as underlying data! What a mess!

We can also look at the last five rows.

uw.tail()
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
2605 Zip 98592 Washington NaN 464.0 470.0 493.0 1427.0 8.0 29.0 ... 73407109.0 82715412.0 436490.5 0.017241 0.061702 0.273834 0.120533 0.046512 0.168605 0.784884
2606 Zip 98595 Washington NaN 558.0 571.0 598.0 1727.0 119.0 104.0 ... 13761194.0 40365702.0 131441.0 0.213262 0.182137 0.095318 0.162131 0.425000 0.371429 0.203571
2607 Zip 98612 Washington NaN 365.0 376.0 409.0 1150.0 91.0 79.0 ... 14406737.0 50284001.0 230495.0 0.249315 0.210106 0.092910 0.180870 0.437500 0.379808 0.182692
2608 Zip 32081 Florida NaN 1318.0 1328.0 1350.0 3996.0 91.0 42.0 ... 91112117.0 132505542.0 385196.0 0.069044 0.031627 0.097778 0.066316 0.343396 0.158491 0.498113
2609 Zip 33578 Florida NaN 5975.0 4537.0 3232.0 13744.0 8.0 6.0 ... 56852018.0 59115506.0 363355.0 0.001339 0.001322 0.038366 0.010041 0.057971 0.043478 0.898551

5 rows × 24 columns

Looking at this data, I think it really has four levels that describe the numbers. The columns RegionType, RegionName, StateName, and MSAName define unique rows of data. Again, note that these values can be missing. For example, row 2606 is Zip, 98595, Washington, but no MSAName. Why? MSA stands for Metropolitan Statistical Area and covers multiple zip codes. This data is not tidy.

Let’s think about the other variable definitions a bit. Zillow places homes into three tiers based on the home’s estimated market value, with Tier3 being the highest value. AllHomes refers to all homes in that tier or across all tiers. UWHomes refers to homes that are at risk for flooding. Note that there are some variables that are counts, some that are dollar values, and others that are percentages.

Finally, when using a Jupyter notebook, you can just put the name of the DataFrame in a cell and run the cell. You’ll get a peak at the DataFrame too. If you’re using a Python script, not a Jupyter notebook, then you have to do things like use print() to get the data to display. The Jupyter notebook does this automatically for us.

uw
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2605 Zip 98592 Washington NaN 464.0 470.0 493.0 1427.0 8.0 29.0 ... 7.340711e+07 8.271541e+07 436490.5 0.017241 0.061702 0.273834 0.120533 0.046512 0.168605 0.784884
2606 Zip 98595 Washington NaN 558.0 571.0 598.0 1727.0 119.0 104.0 ... 1.376119e+07 4.036570e+07 131441.0 0.213262 0.182137 0.095318 0.162131 0.425000 0.371429 0.203571
2607 Zip 98612 Washington NaN 365.0 376.0 409.0 1150.0 91.0 79.0 ... 1.440674e+07 5.028400e+07 230495.0 0.249315 0.210106 0.092910 0.180870 0.437500 0.379808 0.182692
2608 Zip 32081 Florida NaN 1318.0 1328.0 1350.0 3996.0 91.0 42.0 ... 9.111212e+07 1.325055e+08 385196.0 0.069044 0.031627 0.097778 0.066316 0.343396 0.158491 0.498113
2609 Zip 33578 Florida NaN 5975.0 4537.0 3232.0 13744.0 8.0 6.0 ... 5.685202e+07 5.911551e+07 363355.0 0.001339 0.001322 0.038366 0.010041 0.057971 0.043478 0.898551

2610 rows × 24 columns

4.3.2. Saving our work#

You’re going to see syntax like:

df = df.some_stuff

This means that our work is getting assigned to something in memory. This is kind of like saving our work. In this case, I’m replacing my original DataFrame df with another DataFrame, also called df. I could have changed the name of the DataFrame.

You’ll also see the inplace = True argument in some functions. This means that the old DataFrame will get overwritten automatically by whatever you just did.

Note

If you don’t assign your work to a placeholder in memory, then the Jupyter Notebook will show you a result, but it doesn’t get reflected in the DataFrame. So, if you try use what you just did, Python won’t know about it!

4.3.3. Selecting our data#

You access columns in pandas like this, using [] and ''. A column is a variable in our data set.

uw['RegionType']
0       Nation
1        State
2        State
3        State
4        State
         ...  
2605       Zip
2606       Zip
2607       Zip
2608       Zip
2609       Zip
Name: RegionType, Length: 2610, dtype: object

But, wait, you can also do this! But, only if the variable name doesn’t have spaces. The method with ['NAME'] is more general and can handle variable names with spaces, like ['Var Name'].

uw.RegionType
0       Nation
1        State
2        State
3        State
4        State
         ...  
2605       Zip
2606       Zip
2607       Zip
2608       Zip
2609       Zip
Name: RegionType, Length: 2610, dtype: object

Notice how as you’re typing, VS Code and Google Colab are trying to autocomplete for you? Helpful!

You can also do multiple columns. Use a , to separate the column names in []. There are now two pairs of [].

uw[['RegionType', 'RegionName', 'UWHomes_Tier2']]
RegionType RegionName UWHomes_Tier2
0 Nation United States 542681.0
1 State Alabama 2766.0
2 State California 14266.0
3 State Connecticut 3807.0
4 State Delaware 2574.0
... ... ... ...
2605 Zip 98592 29.0
2606 Zip 98595 104.0
2607 Zip 98612 79.0
2608 Zip 32081 42.0
2609 Zip 33578 6.0

2610 rows × 3 columns

You can save the new DataFrame with just those three columns if you want to use it elsewhere.

uw_only_three_columns = uw[['RegionType', 'RegionName', 'UWHomes_Tier2']]
uw_only_three_columns
RegionType RegionName UWHomes_Tier2
0 Nation United States 542681.0
1 State Alabama 2766.0
2 State California 14266.0
3 State Connecticut 3807.0
4 State Delaware 2574.0
... ... ... ...
2605 Zip 98592 29.0
2606 Zip 98595 104.0
2607 Zip 98612 79.0
2608 Zip 32081 42.0
2609 Zip 33578 6.0

2610 rows × 3 columns

We are using a list - the column names are in a list defined using []. This is why we had to go through lists earlier - they show up a lot.

You can also pull out just certain rows. We used head and tail above to do this in an “automated” way.

uw[0:10]
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754
5 State Florida Florida NaN 2391491.0 2435575.0 2456637.0 7283703.0 313090.0 263276.0 ... 2.782440e+11 4.163480e+11 276690.0 0.130918 0.108096 0.135070 0.124687 0.344743 0.289893 0.365364
6 State Georgia Georgia NaN 1003890.0 986416.0 920843.0 2911149.0 5433.0 6904.0 ... 6.102691e+09 9.565510e+09 295405.0 0.005412 0.006999 0.010676 0.007615 0.245083 0.311440 0.443477
7 State Hawaii Hawaii NaN 132718.0 132760.0 139397.0 404875.0 16716.0 12536.0 ... 1.134778e+10 2.516682e+10 500666.0 0.125951 0.094426 0.051658 0.090035 0.458563 0.343895 0.197542
8 State Louisiana Louisiana NaN 439606.0 424665.0 417773.0 1282044.0 25678.0 23280.0 ... 6.755054e+09 1.309180e+10 149888.0 0.058411 0.054820 0.060408 0.057872 0.346088 0.313768 0.340144
9 State Massachusetts Massachusetts NaN 610378.0 611776.0 611920.0 1834074.0 16830.0 16491.0 ... 3.436795e+10 4.988792e+10 582126.0 0.027573 0.026956 0.040749 0.031763 0.288897 0.283078 0.428025

10 rows × 24 columns

Again, notice how Python starts counting from 0. That 0, 1, 2… shows the current index for this DataFrame. We could change this and will later.

4.3.3.1. Using loc#

pandas has two location methods: loc and iloc. They are going to let us select parts of our data. The main difference: loc is labelled based and iloc is location based.

Since our columns have labels (our rows do not), we can use loc to specify columns by name. If we created an index for this data, then our rows would have labels. We’ll do that later. Any index for this data will be complicated, since no one column uniquely identifies anything!

You can read more about the two methods here.

We’ll start with loc. Again, our rows don’t have labels or an index, so we can use the number of the row, even with loc. We can pull just the first row of data.

uw.loc[0]
RegionType                              Nation
RegionName                       United States
StateName                                  NaN
MSAName                                    NaN
AllHomes_Tier1                      35461549.0
AllHomes_Tier2                      35452941.0
AllHomes_Tier3                      35484532.0
AllHomes_AllTiers                  106399022.0
UWHomes_Tier1                         594672.0
UWHomes_Tier2                         542681.0
UWHomes_Tier3                         725955.0
UWHomes_AllTiers                     1863308.0
UWHomes_TotalValue_Tier1        122992000000.0
UWHomes_TotalValue_Tier2        195513000000.0
UWHomes_TotalValue_Tier3        597449000000.0
UWHomes_TotalValue_AllTiers     915954000000.0
UWHomes_MedianValue_AllTiers          310306.0
AllHomes_Tier1_ShareUW                0.016769
AllHomes_Tier2_ShareUW                0.015307
AllHomes_Tier3_ShareUW                0.020458
AllHomes_AllTiers_ShareUW             0.017512
UWHomes_ShareInTier1                  0.319149
UWHomes_ShareInTier2                  0.291246
UWHomes_ShareInTier3                  0.389605
Name: 0, dtype: object

That returned a series with just the first row and all of the columns. You can make the output in a Jupyter notebook look a little nicer by turning the series into a DataFrame using to.frame(). Again, this is the first row and all of the column headers, just turned sideways.

uw.loc[0].to_frame()
0
RegionType Nation
RegionName United States
StateName NaN
MSAName NaN
AllHomes_Tier1 35461549.0
AllHomes_Tier2 35452941.0
AllHomes_Tier3 35484532.0
AllHomes_AllTiers 106399022.0
UWHomes_Tier1 594672.0
UWHomes_Tier2 542681.0
UWHomes_Tier3 725955.0
UWHomes_AllTiers 1863308.0
UWHomes_TotalValue_Tier1 122992000000.0
UWHomes_TotalValue_Tier2 195513000000.0
UWHomes_TotalValue_Tier3 597449000000.0
UWHomes_TotalValue_AllTiers 915954000000.0
UWHomes_MedianValue_AllTiers 310306.0
AllHomes_Tier1_ShareUW 0.016769
AllHomes_Tier2_ShareUW 0.015307
AllHomes_Tier3_ShareUW 0.020458
AllHomes_AllTiers_ShareUW 0.017512
UWHomes_ShareInTier1 0.319149
UWHomes_ShareInTier2 0.291246
UWHomes_ShareInTier3 0.389605

Things look more like what you’d expect if you pull multiple rows. This will also mean that we get a DataFrame and not just a series.

We can pull the first ten rows and just three columns. But, now we are going to use the column names. And - notice how 0:9 actually gets us 0:9! Because we are using loc, we include the last item in the range.

uw.loc[0:9,['RegionType', 'RegionName', 'UWHomes_Tier2']]
RegionType RegionName UWHomes_Tier2
0 Nation United States 542681.0
1 State Alabama 2766.0
2 State California 14266.0
3 State Connecticut 3807.0
4 State Delaware 2574.0
5 State Florida 263276.0
6 State Georgia 6904.0
7 State Hawaii 12536.0
8 State Louisiana 23280.0
9 State Massachusetts 16491.0

We could pull all of the columns and rows 0:9 like this.

uw.loc[0:9,:]
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0 ... 5.974490e+11 9.159540e+11 310306.0 0.016769 0.015307 0.020458 0.017512 0.319149 0.291246 0.389605
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0 ... 2.583039e+09 3.578203e+09 270254.0 0.005287 0.005317 0.010271 0.006868 0.270043 0.258456 0.471501
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0 ... 2.666638e+10 4.449536e+10 860373.0 0.002644 0.004637 0.005611 0.004313 0.201550 0.355415 0.443035
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0 ... 8.689480e+09 1.152170e+10 442036.0 0.013279 0.011375 0.021677 0.015455 0.285512 0.245138 0.469350
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0 ... 1.498013e+09 2.801847e+09 241344.5 0.039861 0.020177 0.025675 0.027793 0.411942 0.258304 0.329754
5 State Florida Florida NaN 2391491.0 2435575.0 2456637.0 7283703.0 313090.0 263276.0 ... 2.782440e+11 4.163480e+11 276690.0 0.130918 0.108096 0.135070 0.124687 0.344743 0.289893 0.365364
6 State Georgia Georgia NaN 1003890.0 986416.0 920843.0 2911149.0 5433.0 6904.0 ... 6.102691e+09 9.565510e+09 295405.0 0.005412 0.006999 0.010676 0.007615 0.245083 0.311440 0.443477
7 State Hawaii Hawaii NaN 132718.0 132760.0 139397.0 404875.0 16716.0 12536.0 ... 1.134778e+10 2.516682e+10 500666.0 0.125951 0.094426 0.051658 0.090035 0.458563 0.343895 0.197542
8 State Louisiana Louisiana NaN 439606.0 424665.0 417773.0 1282044.0 25678.0 23280.0 ... 6.755054e+09 1.309180e+10 149888.0 0.058411 0.054820 0.060408 0.057872 0.346088 0.313768 0.340144
9 State Massachusetts Massachusetts NaN 610378.0 611776.0 611920.0 1834074.0 16830.0 16491.0 ... 3.436795e+10 4.988792e+10 582126.0 0.027573 0.026956 0.040749 0.031763 0.288897 0.283078 0.428025

10 rows × 24 columns

We can pull a range for columns too. Again, notice that the column name range is inclusive of the last column, unlike with slicing by element number.

uw.loc[0:9,'RegionType':'UWHomes_Tier2']
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0
5 State Florida Florida NaN 2391491.0 2435575.0 2456637.0 7283703.0 313090.0 263276.0
6 State Georgia Georgia NaN 1003890.0 986416.0 920843.0 2911149.0 5433.0 6904.0
7 State Hawaii Hawaii NaN 132718.0 132760.0 139397.0 404875.0 16716.0 12536.0
8 State Louisiana Louisiana NaN 439606.0 424665.0 417773.0 1282044.0 25678.0 23280.0
9 State Massachusetts Massachusetts NaN 610378.0 611776.0 611920.0 1834074.0 16830.0 16491.0

4.3.3.2. Using iloc#

Now, let’s try iloc. I’ll pull the exact same data as above. The numbers used are going to work like slicing. We want to use 0:10 for the rows, because we want the first 10 rows. Same for the columns. Inclusive of the first number, but up to and excluding the second.

uw.iloc[0:10,0:10]
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2
0 Nation United States NaN NaN 35461549.0 35452941.0 35484532.0 106399022.0 594672.0 542681.0
1 State Alabama Alabama NaN 546670.0 520247.0 491300.0 1558217.0 2890.0 2766.0
2 State California California NaN 3060171.0 3076238.0 3169584.0 9305993.0 8090.0 14266.0
3 State Connecticut Connecticut NaN 333904.0 334688.0 336254.0 1004846.0 4434.0 3807.0
4 State Delaware Delaware NaN 102983.0 127573.0 127983.0 358539.0 4105.0 2574.0
5 State Florida Florida NaN 2391491.0 2435575.0 2456637.0 7283703.0 313090.0 263276.0
6 State Georgia Georgia NaN 1003890.0 986416.0 920843.0 2911149.0 5433.0 6904.0
7 State Hawaii Hawaii NaN 132718.0 132760.0 139397.0 404875.0 16716.0 12536.0
8 State Louisiana Louisiana NaN 439606.0 424665.0 417773.0 1282044.0 25678.0 23280.0
9 State Massachusetts Massachusetts NaN 610378.0 611776.0 611920.0 1834074.0 16830.0 16491.0

We can also select columns in various locations, like above.

uw.iloc[0:10,[0, 1, 9]]
RegionType RegionName UWHomes_Tier2
0 Nation United States 542681.0
1 State Alabama 2766.0
2 State California 14266.0
3 State Connecticut 3807.0
4 State Delaware 2574.0
5 State Florida 263276.0
6 State Georgia 6904.0
7 State Hawaii 12536.0
8 State Louisiana 23280.0
9 State Massachusetts 16491.0

4.3.3.3. Comparing four selection methods#

You can use these methods to select and save columns to a new DataFrame. Each of the following does the same thing.

uw_sub1 = uw[['RegionType', 'RegionName', 'UWHomes_Tier2']]
uw_sub2 = uw.loc[:,['RegionType', 'RegionName', 'UWHomes_Tier2']]
uw_sub3 = uw.iloc[:,[0, 1, 9]]
uw_sub4 = pd.DataFrame(uw, columns = ('RegionType', 'RegionName', 'UWHomes_Tier2'))

print(uw_sub1.equals(uw_sub2))
print(uw_sub2.equals(uw_sub3))
print(uw_sub3.equals(uw_sub4))
True
True
True

I can use the .equals() to test for the equality of each DataFrame.

Again, each way is correct. We are going to see the first way – passing the list of variables 0 – used in the next section where we filter our data based on different criteria.

4.3.3.4. Filtering#

Let’s try filtering our data and keeping just the MSA observations. This is sometimes called complex selection.

Here’s some help on filtering data with pandas.

Our first example simply selects that rows where the variable/column 'RegionType' is equal to the string 'MSA'

uw[uw['RegionType'] == 'MSA'].head(15)
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
251 MSA Aberdeen, WA Washington Aberdeen, WA 11606.0 11751.0 11749.0 35106.0 4119.0 2450.0 ... 1.734113e+08 7.791588e+08 88327.0 0.354903 0.208493 0.065537 0.209053 0.561248 0.333833 0.104919
252 MSA Astoria, OR Oregon Astoria, OR 4745.0 5228.0 5568.0 15541.0 811.0 376.0 ... 9.138231e+07 3.620604e+08 237912.0 0.170917 0.071920 0.040409 0.090856 0.574363 0.266289 0.159348
253 MSA Egg Harbor Township, NJ New Jersey Egg Harbor Township, NJ 34185.0 34117.0 34088.0 102390.0 9116.0 9575.0 ... 5.576646e+09 1.030631e+10 264441.0 0.266667 0.280652 0.252200 0.266510 0.334066 0.350887 0.315047
254 MSA Baltimore, MD Maryland Baltimore, MD 298318.0 312603.0 316502.0 927423.0 2293.0 2910.0 ... 3.698818e+09 5.153664e+09 336801.0 0.007686 0.009309 0.021861 0.013071 0.189160 0.240059 0.570780
255 MSA East Falmouth, MA Massachusetts East Falmouth, MA 45345.0 45358.0 45258.0 135961.0 1651.0 1395.0 ... 4.205784e+09 5.349099e+09 588650.0 0.036410 0.030755 0.073004 0.046705 0.260000 0.219685 0.520315
256 MSA Baton Rouge, LA Louisiana Baton Rouge, LA 78815.0 79581.0 78731.0 237127.0 2214.0 1904.0 ... 5.392566e+08 1.219648e+09 180953.5 0.028091 0.023925 0.024895 0.025632 0.364265 0.313261 0.322474
257 MSA Bay City, TX Texas Bay City, TX 4161.0 4252.0 4030.0 12443.0 195.0 351.0 ... 1.175738e+08 1.964820e+08 179753.5 0.046864 0.082549 0.107692 0.078759 0.198980 0.358163 0.442857
258 MSA Beaumont, TX Texas Beaumont, TX 49842.0 47829.0 44616.0 142287.0 1182.0 1043.0 ... 1.466850e+08 3.249066e+08 83413.0 0.023715 0.021807 0.018312 0.021379 0.388560 0.342867 0.268573
259 MSA Bellingham, WA Washington Bellingham, WA 29239.0 30413.0 30956.0 90608.0 2033.0 812.0 ... 7.000724e+08 1.184436e+09 211919.0 0.069530 0.026699 0.040283 0.045162 0.496823 0.198436 0.304741
260 MSA Boston, MA Massachusetts Boston, MA 409592.0 409966.0 410140.0 1229698.0 15948.0 15210.0 ... 2.851453e+10 4.266696e+10 570168.0 0.038936 0.037101 0.052509 0.042851 0.302653 0.288648 0.408699
261 MSA Bremerton, WA Washington Bremerton, WA 27865.0 27856.0 28119.0 83840.0 206.0 245.0 ... 2.076948e+09 2.275970e+09 708642.0 0.007393 0.008795 0.074896 0.030499 0.080563 0.095815 0.823621
262 MSA Bridgeport, CT Connecticut Bridgeport, CT 84535.0 84952.0 85028.0 254515.0 2326.0 1895.0 ... 6.041231e+09 7.812512e+09 593282.5 0.027515 0.022307 0.040869 0.030238 0.302235 0.246232 0.451533
263 MSA Brownsville, TX Texas Brownsville, TX 47084.0 37642.0 35503.0 120229.0 1621.0 1430.0 ... 5.327020e+08 1.036937e+09 171088.5 0.034428 0.037989 0.034448 0.035549 0.379270 0.334581 0.286149
264 MSA Brunswick, GA Georgia Brunswick, GA 14827.0 14885.0 15532.0 45244.0 2532.0 2924.0 ... 3.395779e+09 5.249447e+09 291962.5 0.170770 0.196439 0.300670 0.223809 0.250049 0.288762 0.461189
265 MSA Cambridge, MD Maryland Cambridge, MD 6193.0 6251.0 6118.0 18562.0 757.0 1119.0 ... 7.392582e+08 1.007092e+09 208306.0 0.122235 0.179011 0.350605 0.216625 0.188262 0.278289 0.533449

15 rows × 24 columns

Pay attention to the syntax. You are selecting a column from the DataFrame and asking to only keep the rows where that column(variable) is equal to ‘MSA’. I’m then displaying just the first 15 rows. And, I’m not saving this filtered data set anywhere – I’m just looking at it.

How about just North Carolina MSAs? I’ll again use the bigger DataFrame and just show the observations, rather than create a new DataFrame.

And, I’m now using compound conditions joined with an &. This boolean operator means and – both conditions need to be true.

You can also use | for “or”.

uw[(uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina')].head(15)
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
274 MSA Elizabeth City, NC North Carolina Elizabeth City, NC 5887.0 5882.0 5868.0 17637.0 2198.0 1364.0 ... 4.784827e+08 8.833370e+08 138246.0 0.373365 0.231894 0.283401 0.296252 0.420670 0.261053 0.318278
277 MSA Greenville, NC North Carolina Greenville, NC 22042.0 19844.0 18836.0 60722.0 71.0 34.0 ... 6.786239e+06 1.776378e+07 99179.5 0.003221 0.001713 0.001221 0.002108 0.554688 0.265625 0.179688
288 MSA Jacksonville, NC North Carolina Jacksonville, NC 17623.0 19413.0 17989.0 55025.0 792.0 1123.0 ... 6.370269e+08 1.029168e+09 264070.5 0.044941 0.057848 0.083329 0.062045 0.231986 0.328940 0.439074
292 MSA Kill Devil Hills, NC North Carolina Kill Devil Hills, NC 3194.0 3213.0 3486.0 9893.0 1461.0 1584.0 ... 9.128899e+08 1.650730e+09 280409.0 0.457420 0.492997 0.542456 0.498939 0.295989 0.320908 0.383104
297 MSA Morehead City, NC North Carolina Morehead City, NC 13555.0 13843.0 12927.0 40325.0 3195.0 3308.0 ... 2.316937e+09 3.449861e+09 244691.0 0.235706 0.238966 0.359712 0.276578 0.286470 0.296602 0.416928
302 MSA New Bern, NC North Carolina New Bern, NC 13637.0 13409.0 12440.0 39486.0 1364.0 1138.0 ... 5.442292e+08 8.553945e+08 163643.0 0.100022 0.084868 0.132235 0.105025 0.328912 0.274415 0.396672
334 MSA Washington, NC North Carolina Washington, NC 1289.0 1069.0 1043.0 3401.0 255.0 201.0 ... 5.513269e+07 9.990351e+07 116902.5 0.197828 0.188026 0.182167 0.189944 0.394737 0.311146 0.294118
336 MSA Wilmington, NC North Carolina Wilmington, NC 37100.0 40034.0 37475.0 114609.0 3367.0 3339.0 ... 4.265842e+09 6.761845e+09 383067.0 0.090755 0.083404 0.127578 0.100228 0.293114 0.290676 0.416210

8 rows × 24 columns

Notice how I needed to put () around each condition. Python is testing to see if each of these are True or False and then filtering our data accordingly. This is also called masking. Here’s an example.

in_nc = (uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina')
uw[in_nc].head(15)
RegionType RegionName StateName MSAName AllHomes_Tier1 AllHomes_Tier2 AllHomes_Tier3 AllHomes_AllTiers UWHomes_Tier1 UWHomes_Tier2 ... UWHomes_TotalValue_Tier3 UWHomes_TotalValue_AllTiers UWHomes_MedianValue_AllTiers AllHomes_Tier1_ShareUW AllHomes_Tier2_ShareUW AllHomes_Tier3_ShareUW AllHomes_AllTiers_ShareUW UWHomes_ShareInTier1 UWHomes_ShareInTier2 UWHomes_ShareInTier3
274 MSA Elizabeth City, NC North Carolina Elizabeth City, NC 5887.0 5882.0 5868.0 17637.0 2198.0 1364.0 ... 4.784827e+08 8.833370e+08 138246.0 0.373365 0.231894 0.283401 0.296252 0.420670 0.261053 0.318278
277 MSA Greenville, NC North Carolina Greenville, NC 22042.0 19844.0 18836.0 60722.0 71.0 34.0 ... 6.786239e+06 1.776378e+07 99179.5 0.003221 0.001713 0.001221 0.002108 0.554688 0.265625 0.179688
288 MSA Jacksonville, NC North Carolina Jacksonville, NC 17623.0 19413.0 17989.0 55025.0 792.0 1123.0 ... 6.370269e+08 1.029168e+09 264070.5 0.044941 0.057848 0.083329 0.062045 0.231986 0.328940 0.439074
292 MSA Kill Devil Hills, NC North Carolina Kill Devil Hills, NC 3194.0 3213.0 3486.0 9893.0 1461.0 1584.0 ... 9.128899e+08 1.650730e+09 280409.0 0.457420 0.492997 0.542456 0.498939 0.295989 0.320908 0.383104
297 MSA Morehead City, NC North Carolina Morehead City, NC 13555.0 13843.0 12927.0 40325.0 3195.0 3308.0 ... 2.316937e+09 3.449861e+09 244691.0 0.235706 0.238966 0.359712 0.276578 0.286470 0.296602 0.416928
302 MSA New Bern, NC North Carolina New Bern, NC 13637.0 13409.0 12440.0 39486.0 1364.0 1138.0 ... 5.442292e+08 8.553945e+08 163643.0 0.100022 0.084868 0.132235 0.105025 0.328912 0.274415 0.396672
334 MSA Washington, NC North Carolina Washington, NC 1289.0 1069.0 1043.0 3401.0 255.0 201.0 ... 5.513269e+07 9.990351e+07 116902.5 0.197828 0.188026 0.182167 0.189944 0.394737 0.311146 0.294118
336 MSA Wilmington, NC North Carolina Wilmington, NC 37100.0 40034.0 37475.0 114609.0 3367.0 3339.0 ... 4.265842e+09 6.761845e+09 383067.0 0.090755 0.083404 0.127578 0.100228 0.293114 0.290676 0.416210

8 rows × 24 columns

The series in_nc is a list of True/False values for whether or not an observation meets the criteria. I can then pass that series to the DataFrame uw, which will select observations that meet the True criteria. We are masking the big DataFrame and only selecting observations that have a True associated with them, based on the criteria specified. Essentially, this is just another way of filtering our data.

I can also use .loc to filter and then select certain columns.

uw.loc[(uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina'), ['RegionType', 'RegionName', 'MSAName', 'AllHomes_Tier1']].head(15)
RegionType RegionName MSAName AllHomes_Tier1
274 MSA Elizabeth City, NC Elizabeth City, NC 5887.0
277 MSA Greenville, NC Greenville, NC 22042.0
288 MSA Jacksonville, NC Jacksonville, NC 17623.0
292 MSA Kill Devil Hills, NC Kill Devil Hills, NC 3194.0
297 MSA Morehead City, NC Morehead City, NC 13555.0
302 MSA New Bern, NC New Bern, NC 13637.0
334 MSA Washington, NC Washington, NC 1289.0
336 MSA Wilmington, NC Wilmington, NC 37100.0

Like with the mask above, we can an also define our criteria separately and then include them.

my_filter = (uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina')
my_columns = ['RegionType', 'RegionName', 'MSAName', 'AllHomes_Tier1']
uw.loc[my_filter, my_columns].head(15)
RegionType RegionName MSAName AllHomes_Tier1
274 MSA Elizabeth City, NC Elizabeth City, NC 5887.0
277 MSA Greenville, NC Greenville, NC 22042.0
288 MSA Jacksonville, NC Jacksonville, NC 17623.0
292 MSA Kill Devil Hills, NC Kill Devil Hills, NC 3194.0
297 MSA Morehead City, NC Morehead City, NC 13555.0
302 MSA New Bern, NC New Bern, NC 13637.0
334 MSA Washington, NC Washington, NC 1289.0
336 MSA Wilmington, NC Wilmington, NC 37100.0

4.3.4. First look at an index#

Let’s go back to the stock data and work with indices a bit more. Remember how that one had the Date as the index?

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices.index
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14',
               ...
               '2018-06-18', '2018-06-19', '2018-06-20', '2018-06-21',
               '2018-06-22', '2018-06-25', '2018-06-26', '2018-06-27',
               '2018-06-28', '2018-06-29'],
              dtype='datetime64[ns]', name='Date', length=2216, freq=None)

Now, let’s try loc, but using our dtype = datetime index. We’ll pull in the SPX and VIX for just January 2010.

prices.loc['2010-01-01':'2010-01-31',['.SPX', '.VIX']]
.SPX .VIX
Date
2010-01-01 NaN NaN
2010-01-04 1132.99 20.04
2010-01-05 1136.52 19.35
2010-01-06 1137.14 19.16
2010-01-07 1141.69 19.06
2010-01-08 1144.98 18.13
2010-01-11 1146.98 17.55
2010-01-12 1136.22 18.25
2010-01-13 1145.68 17.85
2010-01-14 1148.46 17.63
2010-01-15 1136.03 17.91
2010-01-18 NaN NaN
2010-01-19 1150.23 17.58
2010-01-20 1138.04 18.68
2010-01-21 1116.48 22.27
2010-01-22 1091.76 27.31
2010-01-25 1096.78 25.41
2010-01-26 1092.17 24.55
2010-01-27 1097.50 23.14
2010-01-28 1084.53 23.73
2010-01-29 1073.87 24.62

You can even see in the output above how Date is in a different row than .SPX and .VIX. The date is not a variable anymore. It is an index.

This is a much simpler example than the Zillow data, which had multi-levels of data (e.g. MSA vs. Region vs. Zip), but no dates.

I do a more complicated multi-index, or multi-level, example when looking at reshaping our data.