Merging and reshaping data#

We are going to cover a few topics in this brief section. First, we want to learn about merging or joining data sets. Think of this like xlookup in Excel, or, more generally, SQL queries. In fact, you can do SQL in Python, if you’re familiar with it. We will also see concatenating, or stacking data. We will see how to reshape our data. We’ll see wide and long data. We’ll end with using multi-level indices to shape our data.

I’ll work through a few common examples below using stock data.

Here are some visualizations for what these different functions do.

# Set-up

import numpy as np
import pandas as pd

import matplotlib as mpl 

import matplotlib.pyplot as plt

# Load AAPL and XOM stock data from GitHub
aapl = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/aapl.csv')
xom = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/xom.csv')

# These CSV files contain historical stock prices and returns

aapl = aapl.set_index('date')
xom = xom.set_index('date')

aapl.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 757 entries, 20190102 to 20211231
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   PERMNO   757 non-null    int64  
 1   TICKER   757 non-null    object 
 2   CUSIP    757 non-null    int64  
 3   DISTCD   13 non-null     float64
 4   DIVAMT   13 non-null     float64
 5   FACPR    13 non-null     float64
 6   FACSHR   13 non-null     float64
 7   PRC      757 non-null    float64
 8   VOL      757 non-null    int64  
 9   RET      757 non-null    float64
 10  SHROUT   757 non-null    int64  
 11  CFACPR   757 non-null    int64  
 12  CFACSHR  757 non-null    int64  
 13  sprtrn   757 non-null    float64
dtypes: float64(7), int64(6), object(1)
memory usage: 88.7+ KB

Our usual set-up is above. Notice that I’m setting the index of each DataFrame using .set_index(), instead of doing it inside of the .read_csv(). Either way works.

When we set date as our index using .set_index(), it becomes the primary identifier for each row. This means it no longer appears as a regular column, but instead is used to label each row for easier lookups and time-series operations.

xom.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 756 entries, 20190102 to 20211230
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   PERMNO   756 non-null    int64  
 1   TICKER   756 non-null    object 
 2   CUSIP    756 non-null    object 
 3   DISTCD   12 non-null     float64
 4   DIVAMT   12 non-null     float64
 5   FACPR    12 non-null     float64
 6   FACSHR   12 non-null     float64
 7   PRC      756 non-null    float64
 8   VOL      756 non-null    int64  
 9   RET      756 non-null    float64
 10  SHROUT   756 non-null    int64  
 11  CFACPR   756 non-null    int64  
 12  CFACSHR  756 non-null    int64  
 13  sprtrn   756 non-null    float64
dtypes: float64(7), int64(5), object(2)
memory usage: 88.6+ KB

To confirm that date is now the index, we can use .info().
This command prints:

The total number of rows and columns.
Data types of each column.
Whether an index is set (look at the first line, where it should mention “Index: Int64Index” instead of a column name).

xom.index

Int64Index([20190102, 20190103, 20190104, 20190107, 20190108, 20190109,
            20190110, 20190111, 20190114, 20190115,
            ...
            20211216, 20211217, 20211220, 20211221, 20211222, 20211223,
            20211227, 20211228, 20211229, 20211230],
           dtype='int64', name='date', length=756)

Concatenating data sets#

Appending, or concatenation, or stacking data. This is when you simply put one data set on top of another. However, you need to be careful! Do the data sets share the same variables? How are index values handled?

You can use .concat() from pandas. Notice that our two datasets don’t have exactly the same number of rows. Older code might use .append() from pandas.

pd.concat((aapl, xom), sort=False)  

	PERMNO	TICKER	CUSIP	DISTCD	DIVAMT	FACPR	FACSHR	PRC	VOL	RET	SHROUT	CFACPR	CFACSHR	sprtrn
date
20190102	14593	AAPL	3783310	NaN	NaN	NaN	NaN	157.92000	37066356	0.001141	4729803	4	4	0.001269
20190103	14593	AAPL	3783310	NaN	NaN	NaN	NaN	142.19000	91373695	-0.099607	4729803	4	4	-0.024757
20190104	14593	AAPL	3783310	NaN	NaN	NaN	NaN	148.25999	58603001	0.042689	4729803	4	4	0.034336
20190107	14593	AAPL	3783310	NaN	NaN	NaN	NaN	147.92999	54770364	-0.002226	4729803	4	4	0.007010
20190108	14593	AAPL	3783310	NaN	NaN	NaN	NaN	150.75000	41026062	0.019063	4729803	4	4	0.009695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20211223	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.02000	13543291	0.000492	4233567	1	1	0.006224
20211227	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.89000	12596340	0.014258	4233567	1	1	0.013839
20211228	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.69000	12786873	-0.003232	4233567	1	1	-0.001010
20211229	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.15000	12733601	-0.008753	4233567	1	1	0.001402
20211230	11850	XOM	30231G10	NaN	NaN	NaN	NaN	60.79000	11940294	-0.005887	4233567	1	1	-0.002990

1513 rows × 14 columns

See how there are 1513 rows, but the index for XOM only goes to 755? We didn’t ignore the index, so the original index comes over. The first observation of both DataFrames starts at 0. There are 756 AAPL observations and 755 XOM.

Let’s ignore the index now.

pd.concat((aapl, xom), ignore_index=True, sort=False)  

	PERMNO	TICKER	CUSIP	DISTCD	DIVAMT	FACPR	FACSHR	PRC	VOL	RET	SHROUT	CFACPR	CFACSHR	sprtrn
0	14593	AAPL	3783310	NaN	NaN	NaN	NaN	157.92000	37066356	0.001141	4729803	4	4	0.001269
1	14593	AAPL	3783310	NaN	NaN	NaN	NaN	142.19000	91373695	-0.099607	4729803	4	4	-0.024757
2	14593	AAPL	3783310	NaN	NaN	NaN	NaN	148.25999	58603001	0.042689	4729803	4	4	0.034336
3	14593	AAPL	3783310	NaN	NaN	NaN	NaN	147.92999	54770364	-0.002226	4729803	4	4	0.007010
4	14593	AAPL	3783310	NaN	NaN	NaN	NaN	150.75000	41026062	0.019063	4729803	4	4	0.009695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1508	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.02000	13543291	0.000492	4233567	1	1	0.006224
1509	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.89000	12596340	0.014258	4233567	1	1	0.013839
1510	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.69000	12786873	-0.003232	4233567	1	1	-0.001010
1511	11850	XOM	30231G10	NaN	NaN	NaN	NaN	61.15000	12733601	-0.008753	4233567	1	1	0.001402
1512	11850	XOM	30231G10	NaN	NaN	NaN	NaN	60.79000	11940294	-0.005887	4233567	1	1	-0.002990

1513 rows × 14 columns

Here we ignored the index, so the new index goes from 0 to 1512. What should you do? I would ignore the index and reset it in this situation. Once this data is stacked, we have long or narrow data, where each observation is uniquely identified by the date and a security ID. We have three security IDs in this data: PERMNO, TICKER, and CUSIP. In this particular situation, they should all three work. The PERMNO is the only one that is guaranteed to not change over time. Stocks change tickers. Tickers get reused. Same with CUSIPs.

Stacking data sets usually occurs when you’re doing fancier looping, which creates some set of data or results for different subgroups (e.g. different years). You then can stack the subgroups on top of each other to get a finished, complete data set.

Reshaping data sets#

We are going to see two types of data, generally speaking: wide data and long, or narrow data. We will be melting our data (going from wide to long) or pivoting our data (going from long to wide).

In general, wide data has a separate column for each variable (e.g. returns, volume, etc.). Each row is then an observation for a security particular unit or firm.

In finance, it can get more complicated, as wide data might have a date column and then columns like aapl_ret, msft_volume, etc. Notice how the firm identifiers are mixed with the variable type?

Long data would have columns like firm, variable, and value. Then, each firm would appear on multiple rows, instead of a single row, and the variable column would have multiple values like return, volume, etc. Finally, the specific value for that variable would be in the value column.

You can also add an extra dimension for time and add a date column. Then, you would multiple variables per firm on specific dates.

Long data is sometimes called tidy data. It is, generally speaking, easier to summarize and deal with. Every variable value (e.g. a return) has a key paired with it (e.g. firm ID and date).

Here’s another explanation of these methods.

Let’s start with going from wide to long. We will use pd.melt to do this.

wide = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/wide_example.csv')
wide

	PERMNO	date	COMNAM	CUSIP	PRC	RET	OPENPRC	NUMTRD
0	10107	20190513	MICROSOFT CORP	59491810	123.35000	-0.029733	124.11000	288269.0
1	11850	20190513	EXXON MOBIL CORP	30231G10	75.71000	-0.011102	75.65000	NaN
2	14593	20190513	APPLE INC	03783310	185.72000	-0.058119	187.71001	462090.0
3	93436	20190513	TESLA INC	88160R10	227.00999	-0.052229	232.00999	139166.0

This data has four firms, all on a single date. There are actually three IDs here: PERMNO, COMNAM, and CUSIP. PERMNO is the permanent security ID assigned by CRSP. COMNAM is obviously the firm name. CUSIP is a security-level ID, so multiple securities (e.g. bonds) from the same firm will have different CUSIPs, though the first six digits of the CUSIP will ID the firm. When you look up fund holdings with the SEC, you’ll get a CUSIP for each security held by the mutual fund, hedge fund, etc.

There are then four variables for each security: price (PRC), return (RET), the price at the open (OPENPRC), and the number of trades on the NASDAQ exchange (NUMTRD) Exxon doesn’t trade on the NASDAQ, so that variable is missing, or NaN.

We can use pd.melt to take this wide-like data and make it long. Why do I say “wide-like”? This data is kind of a in-between class wide and classic long data. Classic wide data would have each column as a ticker or firm ID, with a single date column. Then, underneath each ticker, you might have that stock’s return on that day. What if you have both prices and returns? Well, each column might be something like AAPL_ret and AAPL_prc. This is not really ideal. But - note how many shapes the same type of data can have!

I’m sure that there is a more technical term than “wide-like”. You can find examples of multi-index data frames below, which start to explore some of the complexities with how we structure our data.

Let’s make this data even longer. Long data will have a firm identifier (or several), a date, a column that contains the names of the different variables (e.g. PRC, RET), and a column for the values of the variables. I will keep PERMNO as my ID, keep PRC and RET, name my variable column vars and my value column values. I’ll save this to a new DataFrame called long.

long = pd.melt(wide, id_vars = ['PERMNO'], value_vars = ['PRC', 'RET'], var_name = 'vars', value_name = 'values')
long

	PERMNO	vars	values
0	10107	PRC	123.350000
1	11850	PRC	75.710000
2	14593	PRC	185.720000
3	93436	PRC	227.009990
4	10107	RET	-0.029733
5	11850	RET	-0.011102
6	14593	RET	-0.058119
7	93436	RET	-0.052229

You don’t have to explicitly specify everything that I did. Here, I melt and just tell it to create a PERMNO ID column. Keep everything else. You can really see the long part now.

pd.melt(wide, id_vars = ['PERMNO'])

	PERMNO	variable	value
0	10107	date	20190513
1	11850	date	20190513
2	14593	date	20190513
3	93436	date	20190513
4	10107	COMNAM	MICROSOFT CORP
5	11850	COMNAM	EXXON MOBIL CORP
6	14593	COMNAM	APPLE INC
7	93436	COMNAM	TESLA INC
8	10107	CUSIP	59491810
9	11850	CUSIP	30231G10
10	14593	CUSIP	03783310
11	93436	CUSIP	88160R10
12	10107	PRC	123.35
13	11850	PRC	75.71
14	14593	PRC	185.72
15	93436	PRC	227.00999
16	10107	RET	-0.029733
17	11850	RET	-0.011102
18	14593	RET	-0.058119
19	93436	RET	-0.052229
20	10107	OPENPRC	124.11
21	11850	OPENPRC	75.65
22	14593	OPENPRC	187.71001
23	93436	OPENPRC	232.00999
24	10107	NUMTRD	288269.0
25	11850	NUMTRD	NaN
26	14593	NUMTRD	462090.0
27	93436	NUMTRD	139166.0

You can also have two different identifying variables. This is helpful if you have panel data, with multiple firm observations across different dates. This is obviously very common in finance.

long2 = pd.melt(wide, id_vars = ['PERMNO', 'date'], value_vars = ['PRC', 'RET'], var_name = 'vars', value_name = 'values')
long2

	PERMNO	date	vars	values
0	10107	20190513	PRC	123.350000
1	11850	20190513	PRC	75.710000
2	14593	20190513	PRC	185.720000
3	93436	20190513	PRC	227.009990
4	10107	20190513	RET	-0.029733
5	11850	20190513	RET	-0.011102
6	14593	20190513	RET	-0.058119
7	93436	20190513	RET	-0.052229

We can put our long data back to wide using pd.pivot. You need to tell it the column that has the values, the column that has the variable names, and the column to create your index (ID).

wide2 = pd.pivot(long, values = 'values', columns = 'vars', index = 'PERMNO')
wide2

vars	PRC	RET
PERMNO
10107	123.35000	-0.029733
11850	75.71000	-0.011102
14593	185.72000	-0.058119
93436	227.00999	-0.052229

We can use the long data that kept the date to create wide data with multiple indices: the firm ID and the date. This will again be very helpful when we deal with multiple firms over multiple time periods. The firm ID and the date will uniquely identify each observation.

wide3 = pd.pivot(long2, values = 'values', columns = 'vars', index = ['PERMNO', 'date'])
wide3

	vars	PRC	RET
PERMNO	date
10107	20190513	123.35000	-0.029733
11850	20190513	75.71000	-0.011102
14593	20190513	185.72000	-0.058119
93436	20190513	227.00999	-0.052229

Joining data sets#

You will need to join, or merge data sets all of the time. Data sets have identifiers, or keys. Python calls these the data’s index. For us, this could be a stock ticker, a PERMNO, or a CUSIP. Finance data sets also have dates typically. If we have several different data sets, we can merge them together using keys.

To do this, we need to understand the shape of our data, discussed above.

../_images/04-joins.png — Fig. 33 Visualization for the four main ways to merge, or join, data.#

We will use pd.join to do this. I’m going to bring in two sets of data from CRSP. Each data set has the same three securities, XOM, AAPL, and TLT, over the same time period (Jan 2019 - Dec 2021). But, they have some different variables.

crsp1 = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/crsp_022722.csv')
crsp2 = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/crsp_030622.csv')

We can look at each DataFrame and see what they have.

crsp1

	PERMNO	date	TICKER	CUSIP	DISTCD	DIVAMT	FACPR	FACSHR	PRC	VOL	RET	SHROUT	CFACPR	CFACSHR	sprtrn
0	11850	20190102	XOM	30231G10	NaN	NaN	NaN	NaN	69.69000	16727246	0.021997	4233807	1	1	0.001269
1	11850	20190103	XOM	30231G10	NaN	NaN	NaN	NaN	68.62000	13866115	-0.015354	4233807	1	1	-0.024757
2	11850	20190104	XOM	30231G10	NaN	NaN	NaN	NaN	71.15000	16043642	0.036870	4233807	1	1	0.034336
3	11850	20190107	XOM	30231G10	NaN	NaN	NaN	NaN	71.52000	10844159	0.005200	4233807	1	1	0.007010
4	11850	20190108	XOM	30231G10	NaN	NaN	NaN	NaN	72.04000	11438966	0.007271	4233807	1	1	0.009695
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2266	89468	20211227	TLT	46428743	NaN	NaN	NaN	NaN	148.88000	7854290	0.002424	132000	1	1	0.013839
2267	89468	20211228	TLT	46428743	NaN	NaN	NaN	NaN	148.28999	9173504	-0.003963	133200	1	1	-0.001010
2268	89468	20211229	TLT	46428743	NaN	NaN	NaN	NaN	146.67000	11763496	-0.010925	133400	1	1	0.001402
2269	89468	20211230	TLT	46428743	NaN	NaN	NaN	NaN	147.89999	10340148	0.008386	133400	1	1	-0.002990
2270	89468	20211231	TLT	46428743	NaN	NaN	NaN	NaN	148.19000	13389734	0.001961	132800	1	1	-0.002626

2271 rows × 15 columns

crsp2

	PERMNO	date	TICKER	COMNAM	CUSIP	RETX
0	11850	20190102	XOM	EXXON MOBIL CORP	30231G10	0.021997
1	11850	20190103	XOM	EXXON MOBIL CORP	30231G10	-0.015354
2	11850	20190104	XOM	EXXON MOBIL CORP	30231G10	0.036870
3	11850	20190107	XOM	EXXON MOBIL CORP	30231G10	0.005200
4	11850	20190108	XOM	EXXON MOBIL CORP	30231G10	0.007271
...	...	...	...	...	...	...
2266	89468	20211227	TLT	ISHARES TRUST	46428743	0.002424
2267	89468	20211228	TLT	ISHARES TRUST	46428743	-0.003963
2268	89468	20211229	TLT	ISHARES TRUST	46428743	-0.010925
2269	89468	20211230	TLT	ISHARES TRUST	46428743	0.008386
2270	89468	20211231	TLT	ISHARES TRUST	46428743	0.001961

2271 rows × 6 columns

We can see that they each have the same IDs (e.g. PERMNO) and date. The crsp2 DataFrame has an additional variable, RETX, or returns without dividends, that I want to merge in. I’m going to simplify the crsp2 DataFrame and only keep three variables: PERMNO, date, and retx. I will then merge using PERMNO and date as my keys. These two variables uniquely identify each row.

I will do an inner join. This will only keep PERMNO-date observations that are in both data sets. An outer join would keep all observations, including those that are in one dataset, but not the other. This might be helpful if you had PERMNOs in one data set, but not the other, and you wanted to keep all of your data.

For this particular data, there should be a 1:1 correspondence between the observations.

crsp2_clean = crsp2[['PERMNO', 'date', 'RETX']]
merged = crsp1.merge(crsp2_clean, how = 'inner', on = ['PERMNO', 'date'] )
merged

	PERMNO	date	TICKER	CUSIP	DISTCD	DIVAMT	FACPR	FACSHR	PRC	VOL	RET	SHROUT	CFACPR	CFACSHR	sprtrn	RETX
0	11850	20190102	XOM	30231G10	NaN	NaN	NaN	NaN	69.69000	16727246	0.021997	4233807	1	1	0.001269	0.021997
1	11850	20190103	XOM	30231G10	NaN	NaN	NaN	NaN	68.62000	13866115	-0.015354	4233807	1	1	-0.024757	-0.015354
2	11850	20190104	XOM	30231G10	NaN	NaN	NaN	NaN	71.15000	16043642	0.036870	4233807	1	1	0.034336	0.036870
3	11850	20190107	XOM	30231G10	NaN	NaN	NaN	NaN	71.52000	10844159	0.005200	4233807	1	1	0.007010	0.005200
4	11850	20190108	XOM	30231G10	NaN	NaN	NaN	NaN	72.04000	11438966	0.007271	4233807	1	1	0.009695	0.007271
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2266	89468	20211227	TLT	46428743	NaN	NaN	NaN	NaN	148.88000	7854290	0.002424	132000	1	1	0.013839	0.002424
2267	89468	20211228	TLT	46428743	NaN	NaN	NaN	NaN	148.28999	9173504	-0.003963	133200	1	1	-0.001010	-0.003963
2268	89468	20211229	TLT	46428743	NaN	NaN	NaN	NaN	146.67000	11763496	-0.010925	133400	1	1	0.001402	-0.010925
2269	89468	20211230	TLT	46428743	NaN	NaN	NaN	NaN	147.89999	10340148	0.008386	133400	1	1	-0.002990	0.008386
2270	89468	20211231	TLT	46428743	NaN	NaN	NaN	NaN	148.19000	13389734	0.001961	132800	1	1	-0.002626	0.001961

2271 rows × 16 columns

You can see the RETX variable at the end of the merged DataFrame.

What happens if you don’t clean up with crsp2 data first?

merged = crsp1.merge(crsp2, how = 'inner', on = ['PERMNO', 'date'] )
merged

	PERMNO	date	TICKER_x	CUSIP_x	DISTCD	DIVAMT	FACPR	FACSHR	PRC	VOL	RET	SHROUT	CFACPR	CFACSHR	sprtrn	TICKER_y	COMNAM	CUSIP_y	RETX
0	11850	20190102	XOM	30231G10	NaN	NaN	NaN	NaN	69.69000	16727246	0.021997	4233807	1	1	0.001269	XOM	EXXON MOBIL CORP	30231G10	0.021997
1	11850	20190103	XOM	30231G10	NaN	NaN	NaN	NaN	68.62000	13866115	-0.015354	4233807	1	1	-0.024757	XOM	EXXON MOBIL CORP	30231G10	-0.015354
2	11850	20190104	XOM	30231G10	NaN	NaN	NaN	NaN	71.15000	16043642	0.036870	4233807	1	1	0.034336	XOM	EXXON MOBIL CORP	30231G10	0.036870
3	11850	20190107	XOM	30231G10	NaN	NaN	NaN	NaN	71.52000	10844159	0.005200	4233807	1	1	0.007010	XOM	EXXON MOBIL CORP	30231G10	0.005200
4	11850	20190108	XOM	30231G10	NaN	NaN	NaN	NaN	72.04000	11438966	0.007271	4233807	1	1	0.009695	XOM	EXXON MOBIL CORP	30231G10	0.007271
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2266	89468	20211227	TLT	46428743	NaN	NaN	NaN	NaN	148.88000	7854290	0.002424	132000	1	1	0.013839	TLT	ISHARES TRUST	46428743	0.002424
2267	89468	20211228	TLT	46428743	NaN	NaN	NaN	NaN	148.28999	9173504	-0.003963	133200	1	1	-0.001010	TLT	ISHARES TRUST	46428743	-0.003963
2268	89468	20211229	TLT	46428743	NaN	NaN	NaN	NaN	146.67000	11763496	-0.010925	133400	1	1	0.001402	TLT	ISHARES TRUST	46428743	-0.010925
2269	89468	20211230	TLT	46428743	NaN	NaN	NaN	NaN	147.89999	10340148	0.008386	133400	1	1	-0.002990	TLT	ISHARES TRUST	46428743	0.008386
2270	89468	20211231	TLT	46428743	NaN	NaN	NaN	NaN	148.19000	13389734	0.001961	132800	1	1	-0.002626	TLT	ISHARES TRUST	46428743	0.001961

2271 rows × 19 columns

Do you see the _x and _y suffixes? pd.merge adds that when you have two variables with the same name in both data sets that you are not using as key values. The variables that you merge on, the key values, truly get merged together, and so you only get one column for PERMNO and one column for date, in this example. But, TICKER is in both data sets and is not a key value. So, when you merge the data, you’re going to have two columns names TICKER. The _x means that the variable is from the left (first) data set and the _y means that the variable is from the right (second) data set.

You can read more about pd.merge here. Bad merges that lose observations are a common problem.

You have to understand your data structure before trying to combine DataFrames together.

Multiple Levels#

So far, we’ve only defined data with one index value, like a date or a stock ticker. But - look at the return data above. There is both a stock and a date. When data has an ID and a time dimension, we call that panel data. For example, stock-level or firm-level data through time. pandas can help us deal with this type of data as well.

You can read more about MultiIndex data on the pandas help page.

This type of data is also called hierarchical. Let’s use the crsp2 DataFrame to see what’s going on.

crsp2

	PERMNO	date	TICKER	COMNAM	CUSIP	RETX
0	11850	20190102	XOM	EXXON MOBIL CORP	30231G10	0.021997
1	11850	20190103	XOM	EXXON MOBIL CORP	30231G10	-0.015354
2	11850	20190104	XOM	EXXON MOBIL CORP	30231G10	0.036870
3	11850	20190107	XOM	EXXON MOBIL CORP	30231G10	0.005200
4	11850	20190108	XOM	EXXON MOBIL CORP	30231G10	0.007271
...	...	...	...	...	...	...
2266	89468	20211227	TLT	ISHARES TRUST	46428743	0.002424
2267	89468	20211228	TLT	ISHARES TRUST	46428743	-0.003963
2268	89468	20211229	TLT	ISHARES TRUST	46428743	-0.010925
2269	89468	20211230	TLT	ISHARES TRUST	46428743	0.008386
2270	89468	20211231	TLT	ISHARES TRUST	46428743	0.001961

2271 rows × 6 columns

We have PERMNO, our ID, and date. Together, these each define a unique observation.

We can see that crsp2 doesn’t have an index that we’ve defined. It is just the default number.

crsp2.index

RangeIndex(start=0, stop=2271, step=1)

I’ll now set two indices for the CRSP data and save this as a new DataFrame called crsp3. I’m using the pandas .set_index() function and passing the name of the columns in the list.

I then sort the data using .sort_index(). This does what it says - it sorts by date and then PERMNO.

crsp3 = crsp2.set_index(['date', 'PERMNO'])

crsp3.sort_index()

crsp3

		TICKER	COMNAM	CUSIP	RETX
date	PERMNO
20190102	11850	XOM	EXXON MOBIL CORP	30231G10	0.021997
20190103	11850	XOM	EXXON MOBIL CORP	30231G10	-0.015354
20190104	11850	XOM	EXXON MOBIL CORP	30231G10	0.036870
20190107	11850	XOM	EXXON MOBIL CORP	30231G10	0.005200
20190108	11850	XOM	EXXON MOBIL CORP	30231G10	0.007271
...	...	...	...	...	...
20211227	89468	TLT	ISHARES TRUST	46428743	0.002424
20211228	89468	TLT	ISHARES TRUST	46428743	-0.003963
20211229	89468	TLT	ISHARES TRUST	46428743	-0.010925
20211230	89468	TLT	ISHARES TRUST	46428743	0.008386
20211231	89468	TLT	ISHARES TRUST	46428743	0.001961

2271 rows × 4 columns

You can see the multiple levels now. You can perform operations by using these indices.

Let’s group by PERMNO, or our Level 1 index, and calculate the average return for each stock across all dates.

crsp3.groupby(level = 1)['RETX'].mean()

PERMNO
11850    0.000124
14593    0.002221
89468    0.000315
Name: RETX, dtype: float64

Let’s do the same thing using the first level, or the date. This gives the average return of the three stocks on each date.

crsp3.groupby(level = 0)['RETX'].mean()

date
20190102    0.009468
20190103   -0.034527
20190104    0.022661
20190107    0.000009
20190108    0.007902
              ...   
20211227    0.013219
20211228   -0.004321
20211229   -0.006392
20211230   -0.001360
20211231    0.001669
Name: RETX, Length: 757, dtype: float64

As before, we can use to.frame() to go from a series to a DataFrame and make the output look a little nicer.

crsp3.groupby(level = 1)['RETX'].mean().to_frame()

	RETX
PERMNO
11850	0.000124
14593	0.002221
89468	0.000315

We can use the name of the levels, too. Here I’ll find the min and the max return for each stock and save it as a new DataFrame.

crsp3_agg = crsp3.groupby(level = ['PERMNO'])['RETX'].agg(['max', 'min'])
crsp3_agg

	max	min
PERMNO
11850	0.126868	-0.122248
14593	0.119808	-0.128647
89468	0.075196	-0.066683

With indices defined, I can use .unstack() to from long to wide data. I’ll take the return and create new columns for each stock.

crsp3['RETX'].unstack()

PERMNO	11850	14593	89468
date
20190102	0.021997	0.001141	0.005267
20190103	-0.015354	-0.099607	0.011379
20190104	0.036870	0.042689	-0.011575
20190107	0.005200	-0.002226	-0.002948
20190108	0.007271	0.019063	-0.002628
...	...	...	...
20211227	0.014258	0.022975	0.002424
20211228	-0.003232	-0.005767	-0.003963
20211229	-0.008753	0.000502	-0.010925
20211230	-0.005887	-0.006578	0.008386
20211231	0.006580	-0.003535	0.001961

757 rows × 3 columns

Note

The PERMNOs are now the column headers, so we’re treating each stock like a variable. You’ll see a lot of finance data like this, with tickers or another ID as columns. But, you can no longer merge on the stock if you wanted to bring in additional variables! This is why the organization of your data is so important. Long data is easier to deal with when combining data sets.

If I had defined the PERMNO as the Level 0 index and date as the Level 1 index, then I would have unstacked my data with dates as the columns and the stocks as the rows. That’s not what I wanted, though.

Columns and levels#

The unstacked data above has a PERMNO for each column. Underneath each PERMNO is a return. But, what if we have, say, both return and price data?

We can also add levels to our columns, creating groups of variables when we have wide data. Let’s go back to the wide3 DataFrame from above. We can also go even wider. Let’s get of the two indices, PERMNO and date, in wide3 and turn them both back into columns.

wide3 = wide3.reset_index()
wide3

vars	PERMNO	date	PRC	RET
0	10107	20190513	123.35000	-0.029733
1	11850	20190513	75.71000	-0.011102
2	14593	20190513	185.72000	-0.058119
3	93436	20190513	227.00999	-0.052229

Now, let’s create a DataFrame where just date is an index and go wider, where each column is now either a RET or a PRC for each PERMNO.

wide4 = pd.pivot(wide3, values = ['RET', 'PRC'], columns = 'PERMNO', index = 'date')
wide4

	RET				PRC
PERMNO	10107	11850	14593	93436	10107	11850	14593	93436
date
20190513	-0.029733	-0.011102	-0.058119	-0.052229	123.35	75.71	185.72	227.00999

We only had one date, so this isn’t a very interesting DataFrame.

I can take my DataFrame, select the index, and get the name of my index. Remember, indices are about the rows.

wide4.index.names

FrozenList(['date'])

But, you can tell from the above DataFrame that my columns now look different. They have what pandas calls levels.

wide4.columns.levels

FrozenList([['RET', 'PRC'], [10107, 11850, 14593, 93436]])

See how there are two lists? One for each level of column.

Data Analysis in Finance

Merging and reshaping data

Contents