Cleaning our data
Contents
4.4. Cleaning our data#
Cleaning your data and dealing with dates. That’s the life of anyone who deals with data, especially in finance. Most data sets are messy. This is especially true if you or your firm is the one collecting the data and not just purchasing something.
Then, once your data is (reasonably) clean and error free, you start creating the variables that will go into your statistical/machine learning models. This is called feature engineering.
DataCamp has an article on cleaning data in pandas
.
Chapter 7 of Python for Data Analysis for more data cleaning tips.
And, as always, I like the Coding for Economists text.
Working with Data has
pandas
examples. Note the discussion of *tidy data.Data Transformation has data aggregation and summary examples.
Missing Values covers what to do with missing values in your data.
You’ll see all of these steps when we get our actual models.
4.4.1. Data Integrity#
Before you even start to clean your data, you have to know how your data were created.
What does this mean in practice? Read the data manual if you are using data constructed by someone else! Did you pull data from FRED? How did the BLS construct that time series? Did anything change? Can you really compare a number from 2023 to 2019, or did a definition change?
4.4.2. Underwater exploration#
We’ll start by bringing in our Zillow data again.
import numpy as np
import pandas as pd
uw = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/zestimatesAndCutoffs_byGeo_uw_2017-10-10_forDataPage.csv')
uw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2610 entries, 0 to 2609
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RegionType 2610 non-null object
1 RegionName 2610 non-null object
2 StateName 2609 non-null object
3 MSAName 1071 non-null object
4 AllHomes_Tier1 2610 non-null float64
5 AllHomes_Tier2 2610 non-null float64
6 AllHomes_Tier3 2610 non-null float64
7 AllHomes_AllTiers 2610 non-null float64
8 UWHomes_Tier1 2610 non-null float64
9 UWHomes_Tier2 2610 non-null int64
10 UWHomes_Tier3 2610 non-null float64
11 UWHomes_AllTiers 2610 non-null float64
12 UWHomes_TotalValue_Tier1 2610 non-null float64
13 UWHomes_TotalValue_Tier2 2610 non-null float64
14 UWHomes_TotalValue_Tier3 2610 non-null float64
15 UWHomes_TotalValue_AllTiers 2610 non-null float64
16 UWHomes_MedianValue_AllTiers 2610 non-null float64
17 AllHomes_Tier1_ShareUW 2610 non-null float64
18 AllHomes_Tier2_ShareUW 2610 non-null float64
19 AllHomes_Tier3_ShareUW 2610 non-null float64
20 AllHomes_AllTiers_ShareUW 2610 non-null float64
21 UWHomes_ShareInTier1 2610 non-null float64
22 UWHomes_ShareInTier2 2610 non-null float64
23 UWHomes_ShareInTier3 2610 non-null float64
dtypes: float64(19), int64(1), object(4)
memory usage: 489.5+ KB
Once of the most important steps in data cleaning is just looking at what we have. What are the variables? What are their types? How many unique values of each variable do we have? Any missings? Do we see anything unexpected?
This page has a nice summary.
We can select a column and change the variable type using .astype()
.
uw['UWHomes_Tier2'] = uw['UWHomes_Tier2'].astype('float64')
This code is actually taking the column, changing its type, and saving it to the DataFrame with the same column name. Note that we could changed the name of the column.
4.4.3. Counting, missings, and dupes#
When you first bring in some data, you often want to do some simple checks. How many observations do you have? Are you missing anything? Are their duplicate observations? How do we define duplicate observations?
You can’t do any data analysis until you understand the structure of your data. What do you have? Why do you have it? How do you want to use it?
Let’s start with counts. We can count the number of times a category appears in a variable. Note that this is only useful for variables that have an sense of category. You would do this for any of the home count or dollar value variables, for example.
We can select a categorical variable, like RegionType and then use .value_counts()
from pandas
.
uw['RegionType'].value_counts()
Zip 1247
City 1017
County 227
MSA 95
State 23
Nation 1
Name: RegionType, dtype: int64
We should look for missing values for each variable. isna()
returns a TRUE
or FALSE
for each value, depending on whether or not it is NaN
, or missing. We can then take those 1/0, true or false, values and add them up with sum()
.
uw.isna().sum()
RegionType 0
RegionName 0
StateName 1
MSAName 1539
AllHomes_Tier1 0
AllHomes_Tier2 0
AllHomes_Tier3 0
AllHomes_AllTiers 0
UWHomes_Tier1 0
UWHomes_Tier2 0
UWHomes_Tier3 0
UWHomes_AllTiers 0
UWHomes_TotalValue_Tier1 0
UWHomes_TotalValue_Tier2 0
UWHomes_TotalValue_Tier3 0
UWHomes_TotalValue_AllTiers 0
UWHomes_MedianValue_AllTiers 0
AllHomes_Tier1_ShareUW 0
AllHomes_Tier2_ShareUW 0
AllHomes_Tier3_ShareUW 0
AllHomes_AllTiers_ShareUW 0
UWHomes_ShareInTier1 0
UWHomes_ShareInTier2 0
UWHomes_ShareInTier3 0
dtype: int64
You can see the logic in how these functions work. We take our DataFrame uw and we send it to isna()
. This actually creates another DataFrame, that we then pass to sum()
. Check it out.
uw.isna()
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | True | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2605 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2606 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2607 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2608 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2609 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
2610 rows × 24 columns
The same sort of logic applies to the method unique()
. This one gives an array
of each unique value in a column or set of columns. This will save an array of unique values that appear in the RegionName column.
unique_regions = uw.RegionName.unique()
I’m using the .ColumnName
construction to pull the variable.
We can count the number of unique values for a variable.
uw.RegionName.nunique()
2496
We can filter on one variable and count another. We’re looking for unexpected things, just trying to get a sense for what we have.
uw[uw['RegionType'] == 'MSA'].MSAName.nunique()
95
This syntax works too! We are pulling the column name out using a list defined by the column names inside of []
.
uw[uw['RegionType'] == 'MSA']['MSAName'].nunique()
95
We have 95 unique MSA is our data. Seems reasonable. MSAs are like city-regions.
Finally, we can just drop any rows with missing values using .dropna()
from pandas
. If we don’t specify a column, then it will drop a row if ANY value is missing. I don’t actually want to drop missings here, so I’m not going to save my work with a uw =
.
You can read more here.
uw.dropna(axis = 0)
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
251 | MSA | Aberdeen, WA | Washington | Aberdeen, WA | 11606.0 | 11751.0 | 11749.0 | 35106.0 | 4119.0 | 2450.0 | ... | 1.734113e+08 | 7.791588e+08 | 88327.0 | 0.354903 | 0.208493 | 0.065537 | 0.209053 | 0.561248 | 0.333833 | 0.104919 |
252 | MSA | Astoria, OR | Oregon | Astoria, OR | 4745.0 | 5228.0 | 5568.0 | 15541.0 | 811.0 | 376.0 | ... | 9.138231e+07 | 3.620604e+08 | 237912.0 | 0.170917 | 0.071920 | 0.040409 | 0.090856 | 0.574363 | 0.266289 | 0.159348 |
253 | MSA | Egg Harbor Township, NJ | New Jersey | Egg Harbor Township, NJ | 34185.0 | 34117.0 | 34088.0 | 102390.0 | 9116.0 | 9575.0 | ... | 5.576646e+09 | 1.030631e+10 | 264441.0 | 0.266667 | 0.280652 | 0.252200 | 0.266510 | 0.334066 | 0.350887 | 0.315047 |
254 | MSA | Baltimore, MD | Maryland | Baltimore, MD | 298318.0 | 312603.0 | 316502.0 | 927423.0 | 2293.0 | 2910.0 | ... | 3.698818e+09 | 5.153664e+09 | 336801.0 | 0.007686 | 0.009309 | 0.021861 | 0.013071 | 0.189160 | 0.240059 | 0.570780 |
255 | MSA | East Falmouth, MA | Massachusetts | East Falmouth, MA | 45345.0 | 45358.0 | 45258.0 | 135961.0 | 1651.0 | 1395.0 | ... | 4.205784e+09 | 5.349099e+09 | 588650.0 | 0.036410 | 0.030755 | 0.073004 | 0.046705 | 0.260000 | 0.219685 | 0.520315 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1358 | City | Town of Salisbury | Massachusetts | Boston, MA | 1207.0 | 1207.0 | 1207.0 | 3621.0 | 397.0 | 402.0 | ... | 1.484370e+08 | 3.900701e+08 | 342026.0 | 0.328915 | 0.333057 | 0.221210 | 0.294394 | 0.372420 | 0.377111 | 0.250469 |
1359 | City | Town of Duxbury | Massachusetts | Boston, MA | 1835.0 | 1835.0 | 1837.0 | 5507.0 | 68.0 | 33.0 | ... | 1.909504e+08 | 2.389202e+08 | 749341.0 | 0.037057 | 0.017984 | 0.056614 | 0.037225 | 0.331707 | 0.160976 | 0.507317 |
1360 | City | New Shoreham | Rhode Island | Providence, RI | 506.0 | 506.0 | 506.0 | 1518.0 | 57.0 | 25.0 | ... | 9.383493e+07 | 1.654886e+08 | 1072092.0 | 0.112648 | 0.049407 | 0.094862 | 0.085639 | 0.438462 | 0.192308 | 0.369231 |
1361 | City | Yarmouth | Massachusetts | East Falmouth, MA | 5178.0 | 5178.0 | 5180.0 | 15536.0 | 472.0 | 184.0 | ... | 4.531706e+08 | 6.061486e+08 | 332409.0 | 0.091155 | 0.035535 | 0.102124 | 0.076274 | 0.398312 | 0.155274 | 0.446414 |
1362 | City | Town of Hingham | Massachusetts | Boston, MA | 2462.0 | 2462.0 | 2462.0 | 7386.0 | 48.0 | 33.0 | ... | 8.266223e+07 | 1.309589e+08 | 731420.0 | 0.019496 | 0.013404 | 0.021121 | 0.018007 | 0.360902 | 0.248120 | 0.390977 |
1071 rows × 24 columns
We now have 1071 rows, since MSAName had so many missing values. But - that was OK given how this data are constructed. You’ll see another example with stock data below.
Now, I’ll just drop observations if AllHomes_Tier1 is missing.
uw.dropna(subset = ['AllHomes_Tier1'], axis = 0)
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nation | United States | NaN | NaN | 35461549.0 | 35452941.0 | 35484532.0 | 106399022.0 | 594672.0 | 542681.0 | ... | 5.974490e+11 | 9.159540e+11 | 310306.0 | 0.016769 | 0.015307 | 0.020458 | 0.017512 | 0.319149 | 0.291246 | 0.389605 |
1 | State | Alabama | Alabama | NaN | 546670.0 | 520247.0 | 491300.0 | 1558217.0 | 2890.0 | 2766.0 | ... | 2.583039e+09 | 3.578203e+09 | 270254.0 | 0.005287 | 0.005317 | 0.010271 | 0.006868 | 0.270043 | 0.258456 | 0.471501 |
2 | State | California | California | NaN | 3060171.0 | 3076238.0 | 3169584.0 | 9305993.0 | 8090.0 | 14266.0 | ... | 2.666638e+10 | 4.449536e+10 | 860373.0 | 0.002644 | 0.004637 | 0.005611 | 0.004313 | 0.201550 | 0.355415 | 0.443035 |
3 | State | Connecticut | Connecticut | NaN | 333904.0 | 334688.0 | 336254.0 | 1004846.0 | 4434.0 | 3807.0 | ... | 8.689480e+09 | 1.152170e+10 | 442036.0 | 0.013279 | 0.011375 | 0.021677 | 0.015455 | 0.285512 | 0.245138 | 0.469350 |
4 | State | Delaware | Delaware | NaN | 102983.0 | 127573.0 | 127983.0 | 358539.0 | 4105.0 | 2574.0 | ... | 1.498013e+09 | 2.801847e+09 | 241344.5 | 0.039861 | 0.020177 | 0.025675 | 0.027793 | 0.411942 | 0.258304 | 0.329754 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2605 | Zip | 98592 | Washington | NaN | 464.0 | 470.0 | 493.0 | 1427.0 | 8.0 | 29.0 | ... | 7.340711e+07 | 8.271541e+07 | 436490.5 | 0.017241 | 0.061702 | 0.273834 | 0.120533 | 0.046512 | 0.168605 | 0.784884 |
2606 | Zip | 98595 | Washington | NaN | 558.0 | 571.0 | 598.0 | 1727.0 | 119.0 | 104.0 | ... | 1.376119e+07 | 4.036570e+07 | 131441.0 | 0.213262 | 0.182137 | 0.095318 | 0.162131 | 0.425000 | 0.371429 | 0.203571 |
2607 | Zip | 98612 | Washington | NaN | 365.0 | 376.0 | 409.0 | 1150.0 | 91.0 | 79.0 | ... | 1.440674e+07 | 5.028400e+07 | 230495.0 | 0.249315 | 0.210106 | 0.092910 | 0.180870 | 0.437500 | 0.379808 | 0.182692 |
2608 | Zip | 32081 | Florida | NaN | 1318.0 | 1328.0 | 1350.0 | 3996.0 | 91.0 | 42.0 | ... | 9.111212e+07 | 1.325055e+08 | 385196.0 | 0.069044 | 0.031627 | 0.097778 | 0.066316 | 0.343396 | 0.158491 | 0.498113 |
2609 | Zip | 33578 | Florida | NaN | 5975.0 | 4537.0 | 3232.0 | 13744.0 | 8.0 | 6.0 | ... | 5.685202e+07 | 5.911551e+07 | 363355.0 | 0.001339 | 0.001322 | 0.038366 | 0.010041 | 0.057971 | 0.043478 | 0.898551 |
2610 rows × 24 columns
Nothing gets dropped, since there were no missing values.
Unique and missing values are important. So is the idea of duplicates. Does a row (an observation) contain the same values as another row? That could be all of the values across all of the variables, or just a particular column (e.g. ticker), or a set of columns (e.g. ticker and date).
We can see if our data has any duplicates by using .duplicated()
from pandas
.
You can read more here.
I’m going to filter my data to only include RegionType='Zip'
and then look for duplicate Zip codes. The Zip codes are found in RegionName. I’ll save this to a new DataFrame called dupes. The argument keep='first
will keep the first duplicate, if any.
I am then going to mask my subset of the uw data. So, I filter to get just Zip codes. Then, I mask on the dupes DataFrame. This DataFrame is just an array of True/False values. By masking, I’m telling Python to only keep values that are True. There are no True values, so no duplicate Zips and the resulting DataFrame is empty.
dupes = uw[uw['RegionType'] == 'Zip'].duplicated(subset=['RegionName'], keep='first')
uw[uw['RegionType'] == 'Zip'][dupes]
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 |
---|
0 rows × 24 columns
We can use drop_duplicates()
to look at our data and drop rows that are duplicate observations. We use the subset=
argument to tell it what columns to look at for duplicates. We can leave out this argument if we want to look across all columns.
The argument keep=
will tell it which duplicate to keep in the data (e.g. the first one or the last one).
You can read more about the syntax here.
uw.drop_duplicates()
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nation | United States | NaN | NaN | 35461549.0 | 35452941.0 | 35484532.0 | 106399022.0 | 594672.0 | 542681.0 | ... | 5.974490e+11 | 9.159540e+11 | 310306.0 | 0.016769 | 0.015307 | 0.020458 | 0.017512 | 0.319149 | 0.291246 | 0.389605 |
1 | State | Alabama | Alabama | NaN | 546670.0 | 520247.0 | 491300.0 | 1558217.0 | 2890.0 | 2766.0 | ... | 2.583039e+09 | 3.578203e+09 | 270254.0 | 0.005287 | 0.005317 | 0.010271 | 0.006868 | 0.270043 | 0.258456 | 0.471501 |
2 | State | California | California | NaN | 3060171.0 | 3076238.0 | 3169584.0 | 9305993.0 | 8090.0 | 14266.0 | ... | 2.666638e+10 | 4.449536e+10 | 860373.0 | 0.002644 | 0.004637 | 0.005611 | 0.004313 | 0.201550 | 0.355415 | 0.443035 |
3 | State | Connecticut | Connecticut | NaN | 333904.0 | 334688.0 | 336254.0 | 1004846.0 | 4434.0 | 3807.0 | ... | 8.689480e+09 | 1.152170e+10 | 442036.0 | 0.013279 | 0.011375 | 0.021677 | 0.015455 | 0.285512 | 0.245138 | 0.469350 |
4 | State | Delaware | Delaware | NaN | 102983.0 | 127573.0 | 127983.0 | 358539.0 | 4105.0 | 2574.0 | ... | 1.498013e+09 | 2.801847e+09 | 241344.5 | 0.039861 | 0.020177 | 0.025675 | 0.027793 | 0.411942 | 0.258304 | 0.329754 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2605 | Zip | 98592 | Washington | NaN | 464.0 | 470.0 | 493.0 | 1427.0 | 8.0 | 29.0 | ... | 7.340711e+07 | 8.271541e+07 | 436490.5 | 0.017241 | 0.061702 | 0.273834 | 0.120533 | 0.046512 | 0.168605 | 0.784884 |
2606 | Zip | 98595 | Washington | NaN | 558.0 | 571.0 | 598.0 | 1727.0 | 119.0 | 104.0 | ... | 1.376119e+07 | 4.036570e+07 | 131441.0 | 0.213262 | 0.182137 | 0.095318 | 0.162131 | 0.425000 | 0.371429 | 0.203571 |
2607 | Zip | 98612 | Washington | NaN | 365.0 | 376.0 | 409.0 | 1150.0 | 91.0 | 79.0 | ... | 1.440674e+07 | 5.028400e+07 | 230495.0 | 0.249315 | 0.210106 | 0.092910 | 0.180870 | 0.437500 | 0.379808 | 0.182692 |
2608 | Zip | 32081 | Florida | NaN | 1318.0 | 1328.0 | 1350.0 | 3996.0 | 91.0 | 42.0 | ... | 9.111212e+07 | 1.325055e+08 | 385196.0 | 0.069044 | 0.031627 | 0.097778 | 0.066316 | 0.343396 | 0.158491 | 0.498113 |
2609 | Zip | 33578 | Florida | NaN | 5975.0 | 4537.0 | 3232.0 | 13744.0 | 8.0 | 6.0 | ... | 5.685202e+07 | 5.911551e+07 | 363355.0 | 0.001339 | 0.001322 | 0.038366 | 0.010041 | 0.057971 | 0.043478 | 0.898551 |
2610 rows × 24 columns
uw.drop_duplicates(subset=['RegionName', 'StateName'], keep='last')
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nation | United States | NaN | NaN | 35461549.0 | 35452941.0 | 35484532.0 | 106399022.0 | 594672.0 | 542681.0 | ... | 5.974490e+11 | 9.159540e+11 | 310306.0 | 0.016769 | 0.015307 | 0.020458 | 0.017512 | 0.319149 | 0.291246 | 0.389605 |
1 | State | Alabama | Alabama | NaN | 546670.0 | 520247.0 | 491300.0 | 1558217.0 | 2890.0 | 2766.0 | ... | 2.583039e+09 | 3.578203e+09 | 270254.0 | 0.005287 | 0.005317 | 0.010271 | 0.006868 | 0.270043 | 0.258456 | 0.471501 |
2 | State | California | California | NaN | 3060171.0 | 3076238.0 | 3169584.0 | 9305993.0 | 8090.0 | 14266.0 | ... | 2.666638e+10 | 4.449536e+10 | 860373.0 | 0.002644 | 0.004637 | 0.005611 | 0.004313 | 0.201550 | 0.355415 | 0.443035 |
3 | State | Connecticut | Connecticut | NaN | 333904.0 | 334688.0 | 336254.0 | 1004846.0 | 4434.0 | 3807.0 | ... | 8.689480e+09 | 1.152170e+10 | 442036.0 | 0.013279 | 0.011375 | 0.021677 | 0.015455 | 0.285512 | 0.245138 | 0.469350 |
4 | State | Delaware | Delaware | NaN | 102983.0 | 127573.0 | 127983.0 | 358539.0 | 4105.0 | 2574.0 | ... | 1.498013e+09 | 2.801847e+09 | 241344.5 | 0.039861 | 0.020177 | 0.025675 | 0.027793 | 0.411942 | 0.258304 | 0.329754 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2605 | Zip | 98592 | Washington | NaN | 464.0 | 470.0 | 493.0 | 1427.0 | 8.0 | 29.0 | ... | 7.340711e+07 | 8.271541e+07 | 436490.5 | 0.017241 | 0.061702 | 0.273834 | 0.120533 | 0.046512 | 0.168605 | 0.784884 |
2606 | Zip | 98595 | Washington | NaN | 558.0 | 571.0 | 598.0 | 1727.0 | 119.0 | 104.0 | ... | 1.376119e+07 | 4.036570e+07 | 131441.0 | 0.213262 | 0.182137 | 0.095318 | 0.162131 | 0.425000 | 0.371429 | 0.203571 |
2607 | Zip | 98612 | Washington | NaN | 365.0 | 376.0 | 409.0 | 1150.0 | 91.0 | 79.0 | ... | 1.440674e+07 | 5.028400e+07 | 230495.0 | 0.249315 | 0.210106 | 0.092910 | 0.180870 | 0.437500 | 0.379808 | 0.182692 |
2608 | Zip | 32081 | Florida | NaN | 1318.0 | 1328.0 | 1350.0 | 3996.0 | 91.0 | 42.0 | ... | 9.111212e+07 | 1.325055e+08 | 385196.0 | 0.069044 | 0.031627 | 0.097778 | 0.066316 | 0.343396 | 0.158491 | 0.498113 |
2609 | Zip | 33578 | Florida | NaN | 5975.0 | 4537.0 | 3232.0 | 13744.0 | 8.0 | 6.0 | ... | 5.685202e+07 | 5.911551e+07 | 363355.0 | 0.001339 | 0.001322 | 0.038366 | 0.010041 | 0.057971 | 0.043478 | 0.898551 |
2571 rows × 24 columns
You want to keep track of the number of observations (rows) when you’re dropping duplicates. In this case, we go from 2,610 rows to 2,571.
4.4.4. Sorting our data#
Another basic task that we often face is sorting our data. You can sort numeric or string data, by variable using .sort_values()
from pandas
.
The ascending=
argument can be True or False and tells us how to sort. You can pass it a list of how to sort if you’re sorting on multiple variables.
I’l sort the data by RegionType first. Then, I’ll sort the data by RegionType, RegionName, and AllHomes_Tier1
You can read more here.
uw.sort_values(by = 'RegionType', ascending = True)
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1304 | City | Ortley Beach | New Jersey | New York, NY | 767.0 | 763.0 | 766.0 | 2296.0 | 746.0 | 724.0 | ... | 447937059.0 | 879315914.0 | 348092.0 | 0.972621 | 0.948886 | 0.953003 | 0.958188 | 0.339091 | 0.329091 | 0.331818 |
896 | City | Smyrna | Delaware | Dover, DE | 4464.0 | 4464.0 | 4464.0 | 13392.0 | 128.0 | 39.0 | ... | 9078029.0 | 29936443.0 | 110029.0 | 0.028674 | 0.008737 | 0.005376 | 0.014262 | 0.670157 | 0.204188 | 0.125654 |
895 | City | Smithfield | Virginia | Virginia Beach, VA | 2251.0 | 2251.0 | 2252.0 | 6754.0 | 20.0 | 20.0 | ... | 68564829.0 | 76200890.0 | 379356.5 | 0.008885 | 0.008885 | 0.065719 | 0.027835 | 0.106383 | 0.106383 | 0.787234 |
894 | City | Secaucus | New Jersey | New York, NY | 1586.0 | 1584.0 | 1586.0 | 4756.0 | 1106.0 | 537.0 | ... | 277046340.0 | 864994299.0 | 391859.0 | 0.697352 | 0.339015 | 0.288777 | 0.441758 | 0.526416 | 0.255593 | 0.217991 |
893 | City | Seaside Park | New Jersey | New York, NY | 672.0 | 672.0 | 672.0 | 2016.0 | 443.0 | 538.0 | ... | 359297149.0 | 827539382.0 | 558632.0 | 0.659226 | 0.800595 | 0.595238 | 0.685020 | 0.320782 | 0.389573 | 0.289645 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1775 | Zip | 22307 | Virginia | NaN | 1483.0 | 1097.0 | 1970.0 | 4550.0 | 181.0 | 215.0 | ... | 45692041.0 | 153788304.0 | 248578.0 | 0.122050 | 0.195989 | 0.031472 | 0.100659 | 0.395197 | 0.469432 | 0.135371 |
1774 | Zip | 21921 | Maryland | NaN | 5270.0 | 5284.0 | 5403.0 | 15957.0 | 12.0 | 47.0 | ... | 41149066.0 | 53223597.0 | 316059.0 | 0.002277 | 0.008895 | 0.017583 | 0.009651 | 0.077922 | 0.305195 | 0.616883 |
1773 | Zip | 21903 | Maryland | NaN | 784.0 | 851.0 | 856.0 | 2491.0 | 10.0 | 37.0 | ... | 21432532.0 | 30859486.0 | 260047.5 | 0.012755 | 0.043478 | 0.064252 | 0.040947 | 0.098039 | 0.362745 | 0.539216 |
1780 | Zip | 22578 | Virginia | NaN | 458.0 | 448.0 | 462.0 | 1368.0 | 73.0 | 159.0 | ... | 118131703.0 | 184974026.0 | 452094.0 | 0.159389 | 0.354911 | 0.370130 | 0.294591 | 0.181141 | 0.394541 | 0.424318 |
2609 | Zip | 33578 | Florida | NaN | 5975.0 | 4537.0 | 3232.0 | 13744.0 | 8.0 | 6.0 | ... | 56852018.0 | 59115506.0 | 363355.0 | 0.001339 | 0.001322 | 0.038366 | 0.010041 | 0.057971 | 0.043478 | 0.898551 |
2610 rows × 24 columns
uw.sort_values(by = ['RegionType', 'RegionName', 'AllHomes_Tier1'], ascending = [True, True, False])
RegionType | RegionName | StateName | MSAName | AllHomes_Tier1 | AllHomes_Tier2 | AllHomes_Tier3 | AllHomes_AllTiers | UWHomes_Tier1 | UWHomes_Tier2 | ... | UWHomes_TotalValue_Tier3 | UWHomes_TotalValue_AllTiers | UWHomes_MedianValue_AllTiers | AllHomes_Tier1_ShareUW | AllHomes_Tier2_ShareUW | AllHomes_Tier3_ShareUW | AllHomes_AllTiers_ShareUW | UWHomes_ShareInTier1 | UWHomes_ShareInTier2 | UWHomes_ShareInTier3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
690 | City | Abbeville | Louisiana | Lafayette, LA | 1995.0 | 1995.0 | 1997.0 | 5987.0 | 196.0 | 512.0 | ... | 30967445.0 | 100185101.0 | 103417.0 | 0.098246 | 0.256642 | 0.095643 | 0.150159 | 0.218020 | 0.569522 | 0.212458 |
811 | City | Aberdeen | Washington | Aberdeen, WA | 2799.0 | 2799.0 | 2799.0 | 8397.0 | 2412.0 | 1145.0 | ... | 64053930.0 | 389836601.0 | 87694.0 | 0.861736 | 0.409075 | 0.091104 | 0.453972 | 0.632739 | 0.300367 | 0.066894 |
346 | City | Absecon | New Jersey | Egg Harbor Township, NJ | 1097.0 | 1097.0 | 1097.0 | 3291.0 | 40.0 | 55.0 | ... | 18714933.0 | 33321640.0 | 180979.0 | 0.036463 | 0.050137 | 0.060164 | 0.048921 | 0.248447 | 0.341615 | 0.409938 |
939 | City | Aiea | Hawaii | Honolulu, HI | 4209.0 | 4209.0 | 4211.0 | 12629.0 | 0.0 | 146.0 | ... | 35306798.0 | 137032636.0 | 708881.0 | 0.000000 | 0.034688 | 0.007837 | 0.014174 | 0.000000 | 0.815642 | 0.184358 |
1151 | City | Alva | Florida | Fort Myers, FL | 589.0 | 589.0 | 590.0 | 1768.0 | 6.0 | 38.0 | ... | 59575578.0 | 69446593.0 | 369560.0 | 0.010187 | 0.064516 | 0.164407 | 0.079751 | 0.042553 | 0.269504 | 0.687943 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2603 | Zip | 98584 | Washington | NaN | 4708.0 | 4694.0 | 4766.0 | 14168.0 | 8.0 | 41.0 | ... | 98174201.0 | 108265864.0 | 366002.0 | 0.001699 | 0.008735 | 0.047839 | 0.019551 | 0.028881 | 0.148014 | 0.823105 |
2604 | Zip | 98588 | Washington | NaN | 469.0 | 479.0 | 479.0 | 1427.0 | 17.0 | 35.0 | ... | 51494123.0 | 60982770.0 | 351648.0 | 0.036247 | 0.073069 | 0.229645 | 0.113525 | 0.104938 | 0.216049 | 0.679012 |
2605 | Zip | 98592 | Washington | NaN | 464.0 | 470.0 | 493.0 | 1427.0 | 8.0 | 29.0 | ... | 73407109.0 | 82715412.0 | 436490.5 | 0.017241 | 0.061702 | 0.273834 | 0.120533 | 0.046512 | 0.168605 | 0.784884 |
2606 | Zip | 98595 | Washington | NaN | 558.0 | 571.0 | 598.0 | 1727.0 | 119.0 | 104.0 | ... | 13761194.0 | 40365702.0 | 131441.0 | 0.213262 | 0.182137 | 0.095318 | 0.162131 | 0.425000 | 0.371429 | 0.203571 |
2607 | Zip | 98612 | Washington | NaN | 365.0 | 376.0 | 409.0 | 1150.0 | 91.0 | 79.0 | ... | 14406737.0 | 50284001.0 | 230495.0 | 0.249315 | 0.210106 | 0.092910 | 0.180870 | 0.437500 | 0.379808 | 0.182692 |
2610 rows × 24 columns
4.4.5. Renaming columns#
While cleaning our data, we might want to rename our columns, or variables. You can do this automatically with something like pyjanitor
(see below). You can also do it when importing the data using .read_csv
from pandas
, if you know that you’ll want to change the name.
It is also easy enough to do using .rename
from pandas
. You give it a mapping of old name to new name. I’ll rename MSAName for fun.
Note the inplace=True
argument. You’ll see this a lot for pandas
methods. This makes it so that we don’t have to do a uw = uw.rename()
. Instead, we save over the old DataFrame with the new one with the new column name.
You can read more here.
uw.rename(columns = {'MSAName':'msaname'}, inplace = True)
4.4.6. Dropping columns#
We can also simply drop a column of data. You could do this via data subsets (e.g. .iloc
), but this method lets you use .drop()
to specify the name of the column. It’s a little clearer doing things this way.
The axis=1
option means that you are dropping the column named. I’ve picked a column at random to drop. Notice that I’m saving my work back to the uw DataFrame.
uw = uw.drop('UWHomes_TotalValue_Tier3', axis=1)
4.4.7. Missing data example#
Finally, let’s try some more complicated code. I found this example here.
First, we will create a DataFrame that has the total number of missing values for each variable. We can sort the data using sort_values()
. The ascending=False
option will have the variable with the largest number of missings at the top.
total = uw.isna().sum().sort_values(ascending=False)
By the way, you’ll also see a isnull()
method that does the same thing as isna()
.
Next, let’s create a DataFrame that that has the percentage of values that are missing for each variable. This is neat one - we are creating a DataFrame of values (total number missing) for the numerator and another DataFrame of values (total number) for the denominator. Then, we are dividing two DataFrames, giving us another DataFrame of the resulting division. We then sort.
percent = (uw.isnull().sum()/uw.isnull().count()).sort_values(ascending = False)
We can use a new function called concat
from pandas
that combines data, either as rows (stacking) or columns (combining). We’ll combine columns, with means concatenating along axis=1. We’ll name both columns. We can do this because each DataFrame has the same index created by pandas
, all of our variable names. So, there’s a one-to-one correspondence between the two DataFrames.
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Missing Percent'])
Let’s take the percents and multiply all of them by 100, just to make them look like percents. And to show you, again, that you can.
missing_data['Missing Percent'] = missing_data['Missing Percent'] * 100
For the last step, we can filter and just get the variable names where more than 10% of our data are missing.
missing_data[missing_data['Missing Percent'] > 10]
Total | Missing Percent | |
---|---|---|
msaname | 1539 | 58.965517 |
4.4.8. Back to stocks#
We can bring back the stock data too, as that data has some missing values.
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
index_col = 0, parse_dates = True)
Why are there missing values? Holidays and weekends, when trading doesn’t take place.
prices.isna().sum()
AAPL.O 78
MSFT.O 78
INTC.O 78
AMZN.O 78
GS.N 78
SPY 78
.SPX 78
.VIX 78
EUR= 0
XAU= 5
GDX 78
GLD 78
dtype: int64
We can drop these rows. We’ll specify axis=0
, or rows.
prices = prices.dropna(axis=0)
prices.isna().sum()
AAPL.O 0
MSFT.O 0
INTC.O 0
AMZN.O 0
GS.N 0
SPY 0
.SPX 0
.VIX 0
EUR= 0
XAU= 0
GDX 0
GLD 0
dtype: int64
prices.head(15)
AAPL.O | MSFT.O | INTC.O | AMZN.O | GS.N | SPY | .SPX | .VIX | EUR= | XAU= | GDX | GLD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | ||||||||||||
2010-01-04 | 30.572827 | 30.950 | 20.880 | 133.900 | 173.08 | 113.33 | 1132.99 | 20.04 | 1.4411 | 1120.00 | 47.71 | 109.80 |
2010-01-05 | 30.625684 | 30.960 | 20.870 | 134.690 | 176.14 | 113.63 | 1136.52 | 19.35 | 1.4368 | 1118.65 | 48.17 | 109.70 |
2010-01-06 | 30.138541 | 30.770 | 20.800 | 132.250 | 174.26 | 113.71 | 1137.14 | 19.16 | 1.4412 | 1138.50 | 49.34 | 111.51 |
2010-01-07 | 30.082827 | 30.452 | 20.600 | 130.000 | 177.67 | 114.19 | 1141.69 | 19.06 | 1.4318 | 1131.90 | 49.10 | 110.82 |
2010-01-08 | 30.282827 | 30.660 | 20.830 | 133.520 | 174.31 | 114.57 | 1144.98 | 18.13 | 1.4412 | 1136.10 | 49.84 | 111.37 |
2010-01-11 | 30.015684 | 30.270 | 20.950 | 130.308 | 171.56 | 114.73 | 1146.98 | 17.55 | 1.4513 | 1152.60 | 50.17 | 112.85 |
2010-01-12 | 29.674256 | 30.070 | 20.608 | 127.350 | 167.82 | 113.66 | 1136.22 | 18.25 | 1.4494 | 1127.30 | 48.35 | 110.49 |
2010-01-13 | 30.092827 | 30.350 | 20.960 | 129.110 | 169.07 | 114.62 | 1145.68 | 17.85 | 1.4510 | 1138.40 | 48.86 | 111.54 |
2010-01-14 | 29.918542 | 30.960 | 21.480 | 127.350 | 168.53 | 114.93 | 1148.46 | 17.63 | 1.4502 | 1142.85 | 48.60 | 112.03 |
2010-01-15 | 29.418542 | 30.860 | 20.800 | 127.140 | 165.21 | 113.64 | 1136.03 | 17.91 | 1.4382 | 1129.90 | 47.42 | 110.86 |
2010-01-19 | 30.719969 | 31.100 | 21.050 | 127.610 | 166.86 | 115.06 | 1150.23 | 17.58 | 1.4298 | 1137.70 | 47.69 | 111.52 |
2010-01-20 | 30.246398 | 30.585 | 21.080 | 125.780 | 167.79 | 113.89 | 1138.04 | 18.68 | 1.4101 | 1111.30 | 45.73 | 108.94 |
2010-01-21 | 29.724542 | 30.010 | 20.840 | 126.620 | 160.87 | 111.70 | 1116.48 | 22.27 | 1.4090 | 1094.75 | 43.75 | 107.37 |
2010-01-22 | 28.249972 | 28.960 | 19.910 | 121.430 | 154.12 | 109.21 | 1091.76 | 27.31 | 1.4137 | 1092.60 | 43.79 | 107.17 |
2010-01-25 | 29.010685 | 29.320 | 20.320 | 120.310 | 154.98 | 109.77 | 1096.78 | 25.41 | 1.4150 | 1097.85 | 43.19 | 107.48 |
4.4.9. Pyjanitor#
We are going to look at a fun package that is based on something from the R statistical programming language, called pyjanitor.
To use this package, you’ll need to type the following in the terminal (Mac) or cmd terminal (Windows).
pip install pyjanitor
This will install pyjanitor
using pip
. Again, if you’re using Windows, make sure that your terminal is cmd and not Powershell. pip
works in Powershell, but you have to change a system setting first.You’ll need to restart VS Code once you’ve installed it.
If you’re using Google Colab, there’s currently a bug in pyjanitor that requires you to install a previous version. You can do this in a Jupyter cell. Note that you need to use !pip
when using a terminal command inside of a Jupyter notebook.
!pip install pyjanitor==0.23.1
There are even finance specific tools.
We can bring in janitor
and treat it as part of pandas
using the import
function.
import janitor
pyjanitor
lets us have an interesting workflow. We can read in our data set, remove columns, drop missings, and rename columns, all in one series of steps.
prices = (
pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
index_col = 0, parse_dates = True)
.remove_columns(['GLD'])
.dropna()
.rename_column('AAPL.O', 'AAPL')
.rename_column('MSFT.O', 'MSFT')
)
There are also some built-in, general functions. clean_names()
does what it says. For example, it sets all characters in a variable name to lower case.
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
index_col = 0, parse_dates = True)
prices = prices.clean_names()
Note the syntax - you start with the DataFrame and then apply the function to it.
You might see another syntax, though. You can import the functions directly and then include the DataFrame as an argument to that function. Both ways are detailed in the pyjanitor instructions.
from janitor import clean_names, remove_empty
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
index_col = 0, parse_dates = True)
prices = clean_names(prices) # See the difference?
The method flag_nulls
creates a new variable that will have a 1 if any of the variables specified are missing. In this case, I didn’t specify anything, so it will look across all of the variables. If any variable is NaN
, then that row gets a 1. Notice the any.
prices = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
index_col = 0, parse_dates = True)
prices.flag_nulls()
AAPL.O | MSFT.O | INTC.O | AMZN.O | GS.N | SPY | .SPX | .VIX | EUR= | XAU= | GDX | GLD | null_flag | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | |||||||||||||
2010-01-01 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.4323 | 1096.35 | NaN | NaN | 1 |
2010-01-04 | 30.572827 | 30.950 | 20.88 | 133.90 | 173.08 | 113.33 | 1132.99 | 20.04 | 1.4411 | 1120.00 | 47.71 | 109.80 | 0 |
2010-01-05 | 30.625684 | 30.960 | 20.87 | 134.69 | 176.14 | 113.63 | 1136.52 | 19.35 | 1.4368 | 1118.65 | 48.17 | 109.70 | 0 |
2010-01-06 | 30.138541 | 30.770 | 20.80 | 132.25 | 174.26 | 113.71 | 1137.14 | 19.16 | 1.4412 | 1138.50 | 49.34 | 111.51 | 0 |
2010-01-07 | 30.082827 | 30.452 | 20.60 | 130.00 | 177.67 | 114.19 | 1141.69 | 19.06 | 1.4318 | 1131.90 | 49.10 | 110.82 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018-06-25 | 182.170000 | 98.390 | 50.71 | 1663.15 | 221.54 | 271.00 | 2717.07 | 17.33 | 1.1702 | 1265.00 | 22.01 | 119.89 | 0 |
2018-06-26 | 184.430000 | 99.080 | 49.67 | 1691.09 | 221.58 | 271.60 | 2723.06 | 15.92 | 1.1645 | 1258.64 | 21.95 | 119.26 | 0 |
2018-06-27 | 184.160000 | 97.540 | 48.76 | 1660.51 | 220.18 | 269.35 | 2699.63 | 17.91 | 1.1552 | 1251.62 | 21.81 | 118.58 | 0 |
2018-06-28 | 185.500000 | 98.630 | 49.25 | 1701.45 | 223.42 | 270.89 | 2716.31 | 16.85 | 1.1567 | 1247.88 | 21.93 | 118.22 | 0 |
2018-06-29 | 185.110000 | 98.610 | 49.71 | 1699.80 | 220.57 | 271.28 | 2718.37 | 16.09 | 1.1683 | 1252.25 | 22.31 | 118.65 | 0 |
2216 rows × 13 columns
Finally, simple way to see if we have any rows of duplicate data. This will happen surprisingly (and unfortunately) often when we start merging data together.
prices.get_dupes()
AAPL.O | MSFT.O | INTC.O | AMZN.O | GS.N | SPY | .SPX | .VIX | EUR= | XAU= | GDX | GLD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Date |