pandas#

Now that we know how to get our data in, let’s look a little more at pandas and what it can do. To do this, let’s bring in some data from Zillow.

The data frame we will be working with contains data from Zillow.com, the housing web site, on the number of homes, the median value of homes, etc. at-risk for flooding due to rising sea levels.

You can find many different Zillow data sets here. The “uw” here stands for “under water” in a very literal sense. I first thought they were talking about mortgages!

As mentioned, take a look at Chapter 5 of Python for Data Analysis for an introduction to pandas.

import numpy as np
import pandas as pd

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')

We are going to use read_csv to get this data off of my Github page. pandas also comes with read_excel, read_json, read_html, and other ways to get formatted data into Python.

Looking at our data#

Let’s peak at this data. You can do that below, or using the .info method. We’ll get variable names, the number of non-missing (null) observations, and the type of data. One number variable, UWHomes_Tier2, got brought in as an integer, rather than a float. When you look at the data below, notice how that one column doesn’t have a .0 on any of the numbers. Integers can’t have decimal places.

uw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 RegionType                    2610 non-null   object 
 RegionName                    2610 non-null   object 
 StateName                     2609 non-null   object 
 MSAName                       1071 non-null   object 
 AllHomes_Tier1                2610 non-null   float64
 AllHomes_Tier2                2610 non-null   float64
 AllHomes_Tier3                2610 non-null   float64
 AllHomes_AllTiers             2610 non-null   float64
 UWHomes_Tier1                 2610 non-null   float64
 UWHomes_Tier2                 2610 non-null   int64  
UWHomes_Tier3                 2610 non-null   float64
UWHomes_AllTiers              2610 non-null   float64
UWHomes_TotalValue_Tier1      2610 non-null   float64
UWHomes_TotalValue_Tier2      2610 non-null   float64
UWHomes_TotalValue_Tier3      2610 non-null   float64
UWHomes_TotalValue_AllTiers   2610 non-null   float64
UWHomes_MedianValue_AllTiers  2610 non-null   float64
AllHomes_Tier1_ShareUW        2610 non-null   float64
AllHomes_Tier2_ShareUW        2610 non-null   float64
AllHomes_Tier3_ShareUW        2610 non-null   float64
AllHomes_AllTiers_ShareUW     2610 non-null   float64
UWHomes_ShareInTier1          2610 non-null   float64
UWHomes_ShareInTier2          2610 non-null   float64
UWHomes_ShareInTier3          2610 non-null   float64
dtypes: float64(19), int64(1), object(4)
memory usage: 489.5+ KB

Just to show you, I’m going to reimport that data, but force that one variable to also come in as the same variable type. I don’t want anything unexpected to happen, so let’s get all numerics as floats. Notice how I do this – I give the column, or variable name, and the data type I want. I’m using a particular type of float type that comes with the numpy package.

uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv', dtype = {'UWHomes_Tier2':np.float64})
uw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 RegionType                    2610 non-null   object 
 RegionName                    2610 non-null   object 
 StateName                     2609 non-null   object 
 MSAName                       1071 non-null   object 
 AllHomes_Tier1                2610 non-null   float64
 AllHomes_Tier2                2610 non-null   float64
 AllHomes_Tier3                2610 non-null   float64
 AllHomes_AllTiers             2610 non-null   float64
 UWHomes_Tier1                 2610 non-null   float64
 UWHomes_Tier2                 2610 non-null   float64
UWHomes_Tier3                 2610 non-null   float64
UWHomes_AllTiers              2610 non-null   float64
UWHomes_TotalValue_Tier1      2610 non-null   float64
UWHomes_TotalValue_Tier2      2610 non-null   float64
UWHomes_TotalValue_Tier3      2610 non-null   float64
UWHomes_TotalValue_AllTiers   2610 non-null   float64
UWHomes_MedianValue_AllTiers  2610 non-null   float64
AllHomes_Tier1_ShareUW        2610 non-null   float64
AllHomes_Tier2_ShareUW        2610 non-null   float64
AllHomes_Tier3_ShareUW        2610 non-null   float64
AllHomes_AllTiers_ShareUW     2610 non-null   float64
UWHomes_ShareInTier1          2610 non-null   float64
UWHomes_ShareInTier2          2610 non-null   float64
UWHomes_ShareInTier3          2610 non-null   float64
dtypes: float64(20), object(4)
memory usage: 489.5+ KB

OK, that seems better. All of these functions that we’re using have many different options. The only way to learn about them is to play around, read help documents, and search online for solutions to our problems.

There’s a concept in data analysis called tidy data. The characteristics of tidy data are:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

../_images/04-tidy.png — Fig. 32 Tidy data.#

The first thing to note about this data is that there are several tiers. Some numbers are at the national level, some at the region, some at the state, and some at the MSA. Tidy data refers to one row, one observation, across different variables (columns) and, possibly, across time. When you have multiple observations of the same firm, state, person, etc. over time, we call that panel data. Panel data is tidy too.

Note that some RegionTypes (e.g. Nation, State) are actually summaries of other rows. This is bit confusing and we need to be mindful of this different levels when using the data.

Before we get to that, just a few more methods to look at the data.

First, let’s look at the first five rows.

uw.head()

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754

5 rows × 24 columns

NaN is how pandas shows missing values. So, that first row is RegionType == 'Nation' and RegionName == 'UnitedStates'. StateName and MSAName are undefined, since this row is all of the data added up. We usually don’t want to have summary statistics in the same DataFrame as underlying data! What a mess!

We can also look at the last five rows.

uw.tail()

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
2605	Zip	98592	Washington	NaN	464.0	470.0	493.0	1427.0	8.0	29.0	...	73407109.0	82715412.0	436490.5	0.017241	0.061702	0.273834	0.120533	0.046512	0.168605	0.784884
2606	Zip	98595	Washington	NaN	558.0	571.0	598.0	1727.0	119.0	104.0	...	13761194.0	40365702.0	131441.0	0.213262	0.182137	0.095318	0.162131	0.425000	0.371429	0.203571
2607	Zip	98612	Washington	NaN	365.0	376.0	409.0	1150.0	91.0	79.0	...	14406737.0	50284001.0	230495.0	0.249315	0.210106	0.092910	0.180870	0.437500	0.379808	0.182692
2608	Zip	32081	Florida	NaN	1318.0	1328.0	1350.0	3996.0	91.0	42.0	...	91112117.0	132505542.0	385196.0	0.069044	0.031627	0.097778	0.066316	0.343396	0.158491	0.498113
2609	Zip	33578	Florida	NaN	5975.0	4537.0	3232.0	13744.0	8.0	6.0	...	56852018.0	59115506.0	363355.0	0.001339	0.001322	0.038366	0.010041	0.057971	0.043478	0.898551

5 rows × 24 columns

Looking at this data, I think it really has four levels that describe the numbers. The columns RegionType, RegionName, StateName, and MSAName define unique rows of data. Again, note that these values can be missing. For example, row 2606 is Zip, 98595, Washington, but no MSAName. Why? MSA stands for Metropolitan Statistical Area and covers multiple zip codes. This data is not tidy.

Let’s think about the other variable definitions a bit. Zillow places homes into three tiers based on the home’s estimated market value, with Tier3 being the highest value. AllHomes refers to all homes in that tier or across all tiers. UWHomes refers to homes that are at risk for flooding. Note that there are some variables that are counts, some that are dollar values, and others that are percentages.

Finally, when using a Jupyter notebook, you can just put the name of the DataFrame in a cell and run the cell. You’ll get a peak at the DataFrame too. If you’re using a Python script, not a Jupyter notebook, then you have to do things like use print() to get the data to display. The Jupyter notebook does this automatically for us.

uw

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2605	Zip	98592	Washington	NaN	464.0	470.0	493.0	1427.0	8.0	29.0	...	7.340711e+07	8.271541e+07	436490.5	0.017241	0.061702	0.273834	0.120533	0.046512	0.168605	0.784884
2606	Zip	98595	Washington	NaN	558.0	571.0	598.0	1727.0	119.0	104.0	...	1.376119e+07	4.036570e+07	131441.0	0.213262	0.182137	0.095318	0.162131	0.425000	0.371429	0.203571
2607	Zip	98612	Washington	NaN	365.0	376.0	409.0	1150.0	91.0	79.0	...	1.440674e+07	5.028400e+07	230495.0	0.249315	0.210106	0.092910	0.180870	0.437500	0.379808	0.182692
2608	Zip	32081	Florida	NaN	1318.0	1328.0	1350.0	3996.0	91.0	42.0	...	9.111212e+07	1.325055e+08	385196.0	0.069044	0.031627	0.097778	0.066316	0.343396	0.158491	0.498113
2609	Zip	33578	Florida	NaN	5975.0	4537.0	3232.0	13744.0	8.0	6.0	...	5.685202e+07	5.911551e+07	363355.0	0.001339	0.001322	0.038366	0.010041	0.057971	0.043478	0.898551

2610 rows × 24 columns

Saving our work#

You’re going to see syntax like:

df = df.some_stuff

This means that our work is getting assigned to something in memory. This is kind of like saving our work. In this case, I’m replacing my original DataFrame df with another DataFrame, also called df. I could have changed the name of the DataFrame.

You’ll also see the inplace = True argument in some functions. This means that the old DataFrame will get overwritten automatically by whatever you just did.

Note

If you don’t assign your work to a placeholder in memory, then the Jupyter Notebook will show you a result, but it doesn’t get reflected in the DataFrame. So, if you try use what you just did, Python won’t know about it!

Selecting our data#

You access columns in pandas like this, using [] and ''. A column is a variable in our data set.

uw['RegionType']

     Nation
      State
      State
      State
      State
         ...  
     Zip
     Zip
     Zip
     Zip
     Zip
Name: RegionType, Length: 2610, dtype: object

But, wait, you can also do this! But, only if the variable name doesn’t have spaces. The method with ['NAME'] is more general and can handle variable names with spaces, like ['Var Name'].

uw.RegionType

     Nation
      State
      State
      State
      State
         ...  
     Zip
     Zip
     Zip
     Zip
     Zip
Name: RegionType, Length: 2610, dtype: object

Notice how as you’re typing, VS Code and Google Colab are trying to autocomplete for you? Helpful!

You can also do multiple columns. Use a , to separate the column names in []. There are now two pairs of [].

uw[['RegionType', 'RegionName', 'UWHomes_Tier2']]

	RegionType	RegionName	UWHomes_Tier2
0	Nation	United States	542681.0
1	State	Alabama	2766.0
2	State	California	14266.0
3	State	Connecticut	3807.0
4	State	Delaware	2574.0
...	...	...	...
2605	Zip	98592	29.0
2606	Zip	98595	104.0
2607	Zip	98612	79.0
2608	Zip	32081	42.0
2609	Zip	33578	6.0

2610 rows × 3 columns

You can save the new DataFrame with just those three columns if you want to use it elsewhere.

uw_only_three_columns = uw[['RegionType', 'RegionName', 'UWHomes_Tier2']]
uw_only_three_columns

	RegionType	RegionName	UWHomes_Tier2
0	Nation	United States	542681.0
1	State	Alabama	2766.0
2	State	California	14266.0
3	State	Connecticut	3807.0
4	State	Delaware	2574.0
...	...	...	...
2605	Zip	98592	29.0
2606	Zip	98595	104.0
2607	Zip	98612	79.0
2608	Zip	32081	42.0
2609	Zip	33578	6.0

2610 rows × 3 columns

We are using a list - the column names are in a list defined using []. This is why we had to go through lists earlier - they show up a lot.

You can also pull out just certain rows. We used head and tail above to do this in an “automated” way.

uw[0:10]

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754
5	State	Florida	Florida	NaN	2391491.0	2435575.0	2456637.0	7283703.0	313090.0	263276.0	...	2.782440e+11	4.163480e+11	276690.0	0.130918	0.108096	0.135070	0.124687	0.344743	0.289893	0.365364
6	State	Georgia	Georgia	NaN	1003890.0	986416.0	920843.0	2911149.0	5433.0	6904.0	...	6.102691e+09	9.565510e+09	295405.0	0.005412	0.006999	0.010676	0.007615	0.245083	0.311440	0.443477
7	State	Hawaii	Hawaii	NaN	132718.0	132760.0	139397.0	404875.0	16716.0	12536.0	...	1.134778e+10	2.516682e+10	500666.0	0.125951	0.094426	0.051658	0.090035	0.458563	0.343895	0.197542
8	State	Louisiana	Louisiana	NaN	439606.0	424665.0	417773.0	1282044.0	25678.0	23280.0	...	6.755054e+09	1.309180e+10	149888.0	0.058411	0.054820	0.060408	0.057872	0.346088	0.313768	0.340144
9	State	Massachusetts	Massachusetts	NaN	610378.0	611776.0	611920.0	1834074.0	16830.0	16491.0	...	3.436795e+10	4.988792e+10	582126.0	0.027573	0.026956	0.040749	0.031763	0.288897	0.283078	0.428025

10 rows × 24 columns

Again, notice how Python starts counting from 0. That 0, 1, 2… shows the current index for this DataFrame. We could change this and will later.

Using loc#

pandas has two location methods: loc and iloc. They are going to let us select parts of our data. The main difference: loc is labelled based and iloc is location based.

Since our columns have labels (our rows do not), we can use loc to specify columns by name. If we created an index for this data, then our rows would have labels. We’ll do that later. Any index for this data will be complicated, since no one column uniquely identifies anything!

You can read more about the two methods here.

We’ll start with loc. Again, our rows don’t have labels or an index, so we can use the number of the row, even with loc. We can pull just the first row of data.

uw.loc[0]

RegionType                              Nation
RegionName                       United States
StateName                                  NaN
MSAName                                    NaN
AllHomes_Tier1                      35461549.0
AllHomes_Tier2                      35452941.0
AllHomes_Tier3                      35484532.0
AllHomes_AllTiers                  106399022.0
UWHomes_Tier1                         594672.0
UWHomes_Tier2                         542681.0
UWHomes_Tier3                         725955.0
UWHomes_AllTiers                     1863308.0
UWHomes_TotalValue_Tier1        122992000000.0
UWHomes_TotalValue_Tier2        195513000000.0
UWHomes_TotalValue_Tier3        597449000000.0
UWHomes_TotalValue_AllTiers     915954000000.0
UWHomes_MedianValue_AllTiers          310306.0
AllHomes_Tier1_ShareUW                0.016769
AllHomes_Tier2_ShareUW                0.015307
AllHomes_Tier3_ShareUW                0.020458
AllHomes_AllTiers_ShareUW             0.017512
UWHomes_ShareInTier1                  0.319149
UWHomes_ShareInTier2                  0.291246
UWHomes_ShareInTier3                  0.389605
Name: 0, dtype: object

That returned a series with just the first row and all of the columns. You can make the output in a Jupyter notebook look a little nicer by turning the series into a DataFrame using to.frame(). Again, this is the first row and all of the column headers, just turned sideways.

uw.loc[0].to_frame()

	0
RegionType	Nation
RegionName	United States
StateName	NaN
MSAName	NaN
AllHomes_Tier1	35461549.0
AllHomes_Tier2	35452941.0
AllHomes_Tier3	35484532.0
AllHomes_AllTiers	106399022.0
UWHomes_Tier1	594672.0
UWHomes_Tier2	542681.0
UWHomes_Tier3	725955.0
UWHomes_AllTiers	1863308.0
UWHomes_TotalValue_Tier1	122992000000.0
UWHomes_TotalValue_Tier2	195513000000.0
UWHomes_TotalValue_Tier3	597449000000.0
UWHomes_TotalValue_AllTiers	915954000000.0
UWHomes_MedianValue_AllTiers	310306.0
AllHomes_Tier1_ShareUW	0.016769
AllHomes_Tier2_ShareUW	0.015307
AllHomes_Tier3_ShareUW	0.020458
AllHomes_AllTiers_ShareUW	0.017512
UWHomes_ShareInTier1	0.319149
UWHomes_ShareInTier2	0.291246
UWHomes_ShareInTier3	0.389605

Things look more like what you’d expect if you pull multiple rows. This will also mean that we get a DataFrame and not just a series.

We can pull the first ten rows and just three columns. But, now we are going to use the column names. And - notice how 0:9 actually gets us 0:9! Because we are using loc, we include the last item in the range.

uw.loc[0:9,['RegionType', 'RegionName', 'UWHomes_Tier2']]

	RegionType	RegionName	UWHomes_Tier2
0	Nation	United States	542681.0
1	State	Alabama	2766.0
2	State	California	14266.0
3	State	Connecticut	3807.0
4	State	Delaware	2574.0
5	State	Florida	263276.0
6	State	Georgia	6904.0
7	State	Hawaii	12536.0
8	State	Louisiana	23280.0
9	State	Massachusetts	16491.0

We could pull all of the columns and rows 0:9 like this.

uw.loc[0:9,:]

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0	...	5.974490e+11	9.159540e+11	310306.0	0.016769	0.015307	0.020458	0.017512	0.319149	0.291246	0.389605
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0	...	2.583039e+09	3.578203e+09	270254.0	0.005287	0.005317	0.010271	0.006868	0.270043	0.258456	0.471501
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0	...	2.666638e+10	4.449536e+10	860373.0	0.002644	0.004637	0.005611	0.004313	0.201550	0.355415	0.443035
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0	...	8.689480e+09	1.152170e+10	442036.0	0.013279	0.011375	0.021677	0.015455	0.285512	0.245138	0.469350
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0	...	1.498013e+09	2.801847e+09	241344.5	0.039861	0.020177	0.025675	0.027793	0.411942	0.258304	0.329754
5	State	Florida	Florida	NaN	2391491.0	2435575.0	2456637.0	7283703.0	313090.0	263276.0	...	2.782440e+11	4.163480e+11	276690.0	0.130918	0.108096	0.135070	0.124687	0.344743	0.289893	0.365364
6	State	Georgia	Georgia	NaN	1003890.0	986416.0	920843.0	2911149.0	5433.0	6904.0	...	6.102691e+09	9.565510e+09	295405.0	0.005412	0.006999	0.010676	0.007615	0.245083	0.311440	0.443477
7	State	Hawaii	Hawaii	NaN	132718.0	132760.0	139397.0	404875.0	16716.0	12536.0	...	1.134778e+10	2.516682e+10	500666.0	0.125951	0.094426	0.051658	0.090035	0.458563	0.343895	0.197542
8	State	Louisiana	Louisiana	NaN	439606.0	424665.0	417773.0	1282044.0	25678.0	23280.0	...	6.755054e+09	1.309180e+10	149888.0	0.058411	0.054820	0.060408	0.057872	0.346088	0.313768	0.340144
9	State	Massachusetts	Massachusetts	NaN	610378.0	611776.0	611920.0	1834074.0	16830.0	16491.0	...	3.436795e+10	4.988792e+10	582126.0	0.027573	0.026956	0.040749	0.031763	0.288897	0.283078	0.428025

10 rows × 24 columns

We can pull a range for columns too. Again, notice that the column name range is inclusive of the last column, unlike with slicing by element number.

uw.loc[0:9,'RegionType':'UWHomes_Tier2']

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0
5	State	Florida	Florida	NaN	2391491.0	2435575.0	2456637.0	7283703.0	313090.0	263276.0
6	State	Georgia	Georgia	NaN	1003890.0	986416.0	920843.0	2911149.0	5433.0	6904.0
7	State	Hawaii	Hawaii	NaN	132718.0	132760.0	139397.0	404875.0	16716.0	12536.0
8	State	Louisiana	Louisiana	NaN	439606.0	424665.0	417773.0	1282044.0	25678.0	23280.0
9	State	Massachusetts	Massachusetts	NaN	610378.0	611776.0	611920.0	1834074.0	16830.0	16491.0

Using iloc#

Now, let’s try iloc. I’ll pull the exact same data as above. The numbers used are going to work like slicing. We want to use 0:10 for the rows, because we want the first 10 rows. Same for the columns. Inclusive of the first number, but up to and excluding the second.

uw.iloc[0:10,0:10]

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2
0	Nation	United States	NaN	NaN	35461549.0	35452941.0	35484532.0	106399022.0	594672.0	542681.0
1	State	Alabama	Alabama	NaN	546670.0	520247.0	491300.0	1558217.0	2890.0	2766.0
2	State	California	California	NaN	3060171.0	3076238.0	3169584.0	9305993.0	8090.0	14266.0
3	State	Connecticut	Connecticut	NaN	333904.0	334688.0	336254.0	1004846.0	4434.0	3807.0
4	State	Delaware	Delaware	NaN	102983.0	127573.0	127983.0	358539.0	4105.0	2574.0
5	State	Florida	Florida	NaN	2391491.0	2435575.0	2456637.0	7283703.0	313090.0	263276.0
6	State	Georgia	Georgia	NaN	1003890.0	986416.0	920843.0	2911149.0	5433.0	6904.0
7	State	Hawaii	Hawaii	NaN	132718.0	132760.0	139397.0	404875.0	16716.0	12536.0
8	State	Louisiana	Louisiana	NaN	439606.0	424665.0	417773.0	1282044.0	25678.0	23280.0
9	State	Massachusetts	Massachusetts	NaN	610378.0	611776.0	611920.0	1834074.0	16830.0	16491.0

We can also select columns in various locations, like above.

uw.iloc[0:10,[0, 1, 9]]

	RegionType	RegionName	UWHomes_Tier2
0	Nation	United States	542681.0
1	State	Alabama	2766.0
2	State	California	14266.0
3	State	Connecticut	3807.0
4	State	Delaware	2574.0
5	State	Florida	263276.0
6	State	Georgia	6904.0
7	State	Hawaii	12536.0
8	State	Louisiana	23280.0
9	State	Massachusetts	16491.0

Comparing four selection methods#

You can use these methods to select and save columns to a new DataFrame. Each of the following does the same thing.

uw_sub1 = uw[['RegionType', 'RegionName', 'UWHomes_Tier2']]
uw_sub2 = uw.loc[:,['RegionType', 'RegionName', 'UWHomes_Tier2']]
uw_sub3 = uw.iloc[:,[0, 1, 9]]
uw_sub4 = pd.DataFrame(uw, columns = ('RegionType', 'RegionName', 'UWHomes_Tier2'))

print(uw_sub1.equals(uw_sub2))
print(uw_sub2.equals(uw_sub3))
print(uw_sub3.equals(uw_sub4))

True
True
True

I can use the .equals() to test for the equality of each DataFrame.

Again, each way is correct. We are going to see the first way – passing the list of variables 0 – used in the next section where we filter our data based on different criteria.

Filtering#

Let’s try filtering our data and keeping just the MSA observations. This is sometimes called complex selection.

Here’s some help on filtering data with pandas.

Our first example simply selects that rows where the variable/column 'RegionType' is equal to the string 'MSA'

uw[uw['RegionType'] == 'MSA'].head(15)

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
251	MSA	Aberdeen, WA	Washington	Aberdeen, WA	11606.0	11751.0	11749.0	35106.0	4119.0	2450.0	...	1.734113e+08	7.791588e+08	88327.0	0.354903	0.208493	0.065537	0.209053	0.561248	0.333833	0.104919
252	MSA	Astoria, OR	Oregon	Astoria, OR	4745.0	5228.0	5568.0	15541.0	811.0	376.0	...	9.138231e+07	3.620604e+08	237912.0	0.170917	0.071920	0.040409	0.090856	0.574363	0.266289	0.159348
253	MSA	Egg Harbor Township, NJ	New Jersey	Egg Harbor Township, NJ	34185.0	34117.0	34088.0	102390.0	9116.0	9575.0	...	5.576646e+09	1.030631e+10	264441.0	0.266667	0.280652	0.252200	0.266510	0.334066	0.350887	0.315047
254	MSA	Baltimore, MD	Maryland	Baltimore, MD	298318.0	312603.0	316502.0	927423.0	2293.0	2910.0	...	3.698818e+09	5.153664e+09	336801.0	0.007686	0.009309	0.021861	0.013071	0.189160	0.240059	0.570780
255	MSA	East Falmouth, MA	Massachusetts	East Falmouth, MA	45345.0	45358.0	45258.0	135961.0	1651.0	1395.0	...	4.205784e+09	5.349099e+09	588650.0	0.036410	0.030755	0.073004	0.046705	0.260000	0.219685	0.520315
256	MSA	Baton Rouge, LA	Louisiana	Baton Rouge, LA	78815.0	79581.0	78731.0	237127.0	2214.0	1904.0	...	5.392566e+08	1.219648e+09	180953.5	0.028091	0.023925	0.024895	0.025632	0.364265	0.313261	0.322474
257	MSA	Bay City, TX	Texas	Bay City, TX	4161.0	4252.0	4030.0	12443.0	195.0	351.0	...	1.175738e+08	1.964820e+08	179753.5	0.046864	0.082549	0.107692	0.078759	0.198980	0.358163	0.442857
258	MSA	Beaumont, TX	Texas	Beaumont, TX	49842.0	47829.0	44616.0	142287.0	1182.0	1043.0	...	1.466850e+08	3.249066e+08	83413.0	0.023715	0.021807	0.018312	0.021379	0.388560	0.342867	0.268573
259	MSA	Bellingham, WA	Washington	Bellingham, WA	29239.0	30413.0	30956.0	90608.0	2033.0	812.0	...	7.000724e+08	1.184436e+09	211919.0	0.069530	0.026699	0.040283	0.045162	0.496823	0.198436	0.304741
260	MSA	Boston, MA	Massachusetts	Boston, MA	409592.0	409966.0	410140.0	1229698.0	15948.0	15210.0	...	2.851453e+10	4.266696e+10	570168.0	0.038936	0.037101	0.052509	0.042851	0.302653	0.288648	0.408699
261	MSA	Bremerton, WA	Washington	Bremerton, WA	27865.0	27856.0	28119.0	83840.0	206.0	245.0	...	2.076948e+09	2.275970e+09	708642.0	0.007393	0.008795	0.074896	0.030499	0.080563	0.095815	0.823621
262	MSA	Bridgeport, CT	Connecticut	Bridgeport, CT	84535.0	84952.0	85028.0	254515.0	2326.0	1895.0	...	6.041231e+09	7.812512e+09	593282.5	0.027515	0.022307	0.040869	0.030238	0.302235	0.246232	0.451533
263	MSA	Brownsville, TX	Texas	Brownsville, TX	47084.0	37642.0	35503.0	120229.0	1621.0	1430.0	...	5.327020e+08	1.036937e+09	171088.5	0.034428	0.037989	0.034448	0.035549	0.379270	0.334581	0.286149
264	MSA	Brunswick, GA	Georgia	Brunswick, GA	14827.0	14885.0	15532.0	45244.0	2532.0	2924.0	...	3.395779e+09	5.249447e+09	291962.5	0.170770	0.196439	0.300670	0.223809	0.250049	0.288762	0.461189
265	MSA	Cambridge, MD	Maryland	Cambridge, MD	6193.0	6251.0	6118.0	18562.0	757.0	1119.0	...	7.392582e+08	1.007092e+09	208306.0	0.122235	0.179011	0.350605	0.216625	0.188262	0.278289	0.533449

15 rows × 24 columns

Pay attention to the syntax. You are selecting a column from the DataFrame and asking to only keep the rows where that column(variable) is equal to ‘MSA’. I’m then displaying just the first 15 rows. And, I’m not saving this filtered data set anywhere – I’m just looking at it.

How about just North Carolina MSAs? I’ll again use the bigger DataFrame and just show the observations, rather than create a new DataFrame.

And, I’m now using compound conditions joined with an &. This boolean operator means and – both conditions need to be true.

You can also use | for “or”.

uw[(uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina')].head(15)

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
274	MSA	Elizabeth City, NC	North Carolina	Elizabeth City, NC	5887.0	5882.0	5868.0	17637.0	2198.0	1364.0	...	4.784827e+08	8.833370e+08	138246.0	0.373365	0.231894	0.283401	0.296252	0.420670	0.261053	0.318278
277	MSA	Greenville, NC	North Carolina	Greenville, NC	22042.0	19844.0	18836.0	60722.0	71.0	34.0	...	6.786239e+06	1.776378e+07	99179.5	0.003221	0.001713	0.001221	0.002108	0.554688	0.265625	0.179688
288	MSA	Jacksonville, NC	North Carolina	Jacksonville, NC	17623.0	19413.0	17989.0	55025.0	792.0	1123.0	...	6.370269e+08	1.029168e+09	264070.5	0.044941	0.057848	0.083329	0.062045	0.231986	0.328940	0.439074
292	MSA	Kill Devil Hills, NC	North Carolina	Kill Devil Hills, NC	3194.0	3213.0	3486.0	9893.0	1461.0	1584.0	...	9.128899e+08	1.650730e+09	280409.0	0.457420	0.492997	0.542456	0.498939	0.295989	0.320908	0.383104
297	MSA	Morehead City, NC	North Carolina	Morehead City, NC	13555.0	13843.0	12927.0	40325.0	3195.0	3308.0	...	2.316937e+09	3.449861e+09	244691.0	0.235706	0.238966	0.359712	0.276578	0.286470	0.296602	0.416928
302	MSA	New Bern, NC	North Carolina	New Bern, NC	13637.0	13409.0	12440.0	39486.0	1364.0	1138.0	...	5.442292e+08	8.553945e+08	163643.0	0.100022	0.084868	0.132235	0.105025	0.328912	0.274415	0.396672
334	MSA	Washington, NC	North Carolina	Washington, NC	1289.0	1069.0	1043.0	3401.0	255.0	201.0	...	5.513269e+07	9.990351e+07	116902.5	0.197828	0.188026	0.182167	0.189944	0.394737	0.311146	0.294118
336	MSA	Wilmington, NC	North Carolina	Wilmington, NC	37100.0	40034.0	37475.0	114609.0	3367.0	3339.0	...	4.265842e+09	6.761845e+09	383067.0	0.090755	0.083404	0.127578	0.100228	0.293114	0.290676	0.416210

8 rows × 24 columns

Notice how I needed to put () around each condition. Python is testing to see if each of these are True or False and then filtering our data accordingly. This is also called masking. Here’s an example.

in_nc = (uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina')
uw[in_nc].head(15)

	RegionType	RegionName	StateName	MSAName	AllHomes_Tier1	AllHomes_Tier2	AllHomes_Tier3	AllHomes_AllTiers	UWHomes_Tier1	UWHomes_Tier2	...	UWHomes_TotalValue_Tier3	UWHomes_TotalValue_AllTiers	UWHomes_MedianValue_AllTiers	AllHomes_Tier1_ShareUW	AllHomes_Tier2_ShareUW	AllHomes_Tier3_ShareUW	AllHomes_AllTiers_ShareUW	UWHomes_ShareInTier1	UWHomes_ShareInTier2	UWHomes_ShareInTier3
274	MSA	Elizabeth City, NC	North Carolina	Elizabeth City, NC	5887.0	5882.0	5868.0	17637.0	2198.0	1364.0	...	4.784827e+08	8.833370e+08	138246.0	0.373365	0.231894	0.283401	0.296252	0.420670	0.261053	0.318278
277	MSA	Greenville, NC	North Carolina	Greenville, NC	22042.0	19844.0	18836.0	60722.0	71.0	34.0	...	6.786239e+06	1.776378e+07	99179.5	0.003221	0.001713	0.001221	0.002108	0.554688	0.265625	0.179688
288	MSA	Jacksonville, NC	North Carolina	Jacksonville, NC	17623.0	19413.0	17989.0	55025.0	792.0	1123.0	...	6.370269e+08	1.029168e+09	264070.5	0.044941	0.057848	0.083329	0.062045	0.231986	0.328940	0.439074
292	MSA	Kill Devil Hills, NC	North Carolina	Kill Devil Hills, NC	3194.0	3213.0	3486.0	9893.0	1461.0	1584.0	...	9.128899e+08	1.650730e+09	280409.0	0.457420	0.492997	0.542456	0.498939	0.295989	0.320908	0.383104
297	MSA	Morehead City, NC	North Carolina	Morehead City, NC	13555.0	13843.0	12927.0	40325.0	3195.0	3308.0	...	2.316937e+09	3.449861e+09	244691.0	0.235706	0.238966	0.359712	0.276578	0.286470	0.296602	0.416928
302	MSA	New Bern, NC	North Carolina	New Bern, NC	13637.0	13409.0	12440.0	39486.0	1364.0	1138.0	...	5.442292e+08	8.553945e+08	163643.0	0.100022	0.084868	0.132235	0.105025	0.328912	0.274415	0.396672
334	MSA	Washington, NC	North Carolina	Washington, NC	1289.0	1069.0	1043.0	3401.0	255.0	201.0	...	5.513269e+07	9.990351e+07	116902.5	0.197828	0.188026	0.182167	0.189944	0.394737	0.311146	0.294118
336	MSA	Wilmington, NC	North Carolina	Wilmington, NC	37100.0	40034.0	37475.0	114609.0	3367.0	3339.0	...	4.265842e+09	6.761845e+09	383067.0	0.090755	0.083404	0.127578	0.100228	0.293114	0.290676	0.416210

8 rows × 24 columns

The series in_nc is a list of True/False values for whether or not an observation meets the criteria. I can then pass that series to the DataFrame uw, which will select observations that meet the True criteria. We are masking the big DataFrame and only selecting observations that have a True associated with them, based on the criteria specified. Essentially, this is just another way of filtering our data.

I can also use .loc to filter and then select certain columns.

uw.loc[(uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina'), ['RegionType', 'RegionName', 'MSAName', 'AllHomes_Tier1']].head(15)

	RegionType	RegionName	MSAName	AllHomes_Tier1
274	MSA	Elizabeth City, NC	Elizabeth City, NC	5887.0
277	MSA	Greenville, NC	Greenville, NC	22042.0
288	MSA	Jacksonville, NC	Jacksonville, NC	17623.0
292	MSA	Kill Devil Hills, NC	Kill Devil Hills, NC	3194.0
297	MSA	Morehead City, NC	Morehead City, NC	13555.0
302	MSA	New Bern, NC	New Bern, NC	13637.0
334	MSA	Washington, NC	Washington, NC	1289.0
336	MSA	Wilmington, NC	Wilmington, NC	37100.0

Like with the mask above, we can an also define our criteria separately and then include them.

my_filter = (uw['RegionType'] == 'MSA') & (uw['StateName'] == 'North Carolina')
my_columns = ['RegionType', 'RegionName', 'MSAName', 'AllHomes_Tier1']
uw.loc[my_filter, my_columns].head(15)

	RegionType	RegionName	MSAName	AllHomes_Tier1
274	MSA	Elizabeth City, NC	Elizabeth City, NC	5887.0
277	MSA	Greenville, NC	Greenville, NC	22042.0
288	MSA	Jacksonville, NC	Jacksonville, NC	17623.0
292	MSA	Kill Devil Hills, NC	Kill Devil Hills, NC	3194.0
297	MSA	Morehead City, NC	Morehead City, NC	13555.0
302	MSA	New Bern, NC	New Bern, NC	13637.0
334	MSA	Washington, NC	Washington, NC	1289.0
336	MSA	Wilmington, NC	Wilmington, NC	37100.0

First look at an index#

Let’s go back to the stock data and work with indices a bit more. Remember how that one had the Date as the index?

prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices.index

DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14',
               ...
               '2018-06-18', '2018-06-19', '2018-06-20', '2018-06-21',
               '2018-06-22', '2018-06-25', '2018-06-26', '2018-06-27',
               '2018-06-28', '2018-06-29'],
              dtype='datetime64[ns]', name='Date', length=2216, freq=None)

Now, let’s try loc, but using our dtype = datetime index. We’ll pull in the SPX and VIX for just January 2010.

prices.loc['2010-01-01':'2010-01-31',['.SPX', '.VIX']]

	.SPX	.VIX
Date
2010-01-01	NaN	NaN
2010-01-04	1132.99	20.04
2010-01-05	1136.52	19.35
2010-01-06	1137.14	19.16
2010-01-07	1141.69	19.06
2010-01-08	1144.98	18.13
2010-01-11	1146.98	17.55
2010-01-12	1136.22	18.25
2010-01-13	1145.68	17.85
2010-01-14	1148.46	17.63
2010-01-15	1136.03	17.91
2010-01-18	NaN	NaN
2010-01-19	1150.23	17.58
2010-01-20	1138.04	18.68
2010-01-21	1116.48	22.27
2010-01-22	1091.76	27.31
2010-01-25	1096.78	25.41
2010-01-26	1092.17	24.55
2010-01-27	1097.50	23.14
2010-01-28	1084.53	23.73
2010-01-29	1073.87	24.62

You can even see in the output above how Date is in a different row than .SPX and .VIX. The date is not a variable anymore. It is an index.

This is a much simpler example than the Zillow data, which had multi-levels of data (e.g. MSA vs. Region vs. Zip), but no dates.

I do a more complicated multi-index, or multi-level, example when looking at reshaping our data.

Data Analysis in Finance

pandas

Contents

pandas#

Looking at our data#

Saving our work#

Selecting our data#

Using loc#

Using iloc#

Comparing four selection methods#

Filtering#

First look at an index#