Importing data#

Let’s start by getting some data into Python. This will usually be the first step in our code, right after we import any of the libraries, like numpy and pandas that we need.

There is a DataCamp tutorial on Excel files and Python that you might find helpful. You can read also more about reading in CSV files at DataCamp.

We’ll typically be either importing our data from CSV and Excel files or using data APIs (Application Programming Interfaces) to pull in data from external sources. We’ll do more of the latter in a separate section of the notes.

There’s some price data up on our course Github page. Let’s import this both as a downloaded CSV file and then directly from Github. No need to download the file to your computer and deal with folders.

First, we need to import pandas. Just like numpy. In fact, at the top of basically all of your code, you will import both of these libraries (and more!).

Everyone imports pandas as pd.

import pandas as pd

Let’s start by making a DataFrame ourself, just to see how they are different from numpy arrays.

df = pd.DataFrame([75,79,82,60], columns = ['Prices'], index =[1,2,3,4])

This is similar to an example from Python for Finance, 2e. I am creating a DataFrame with one column. This column has four stock prices. I am labeling that column as Prices. I then have an index value that refers to each row. I am using numbers. This could easily be a date and, in fact, will be later on when we import real price data.

See how this is kind of like a spreadsheet? And, unlike an array, we have labels or headers for columns. And we have indexes for each row that aren’t really part of the data.

We’ll do more with this in the next section.

Ok, let’s import that CSV file into a DataFrame. You’ll need to data on your local computer (or Google Drive) for code like this to work. We’ll change that in a second.

prices = pd.read_csv('../data/tr_eikon_eod_data.csv',
                  index_col = 0, parse_dates = True)

First, if you’re in Jupyter Notebook in VS Code, click the Variables button at the top. Google Colab has a similar button on the left. You’ll see the two DataFrames that we now have in memory appear below. Click the pop-out button to view them.

Notice how I did the path name. ../ means to start start from my home directory. What’s my home directory? That’s where your .ipynb file lives. Relative to that directory, Python looks for a folder called data. Finally, it looks for the .csv file called tr_eikon_eod_data.csv inside of that folder.

What are those other options doing? index_col=0 tells pandas to create a DataFrame where column 0 (the first column in the .csv file) is the index. Check out the CSV file - that’s the date. With finance data, the date will often be the index, since we are dealing with time series data (e.g. stock prices or returns).

Sometimes data will be multi-index. For example, if we have multiple stocks and dates. More on this later - it depends on how the data are organized.

The option parse_dates is telling pandas to look at the index column and try to turn what it sees into official Python dates. That works in this case. Finance coders will joke that 90% of their time is spent doing date corrections. For example, what happens if you have multiple markets in different time zones and you’re trying to deal with time series at a trade-level frequency (i.e. less than a second)? Have fun!

You can read about the pd.read_csv method on the pandas web page.

You can also write Python data to a .csv or Excel file.

Let’s now read in this file directly from our class Github page. When reading data in from Github, you won’t have to download it first. I’ll call the resulting DataFrame prices2, just so that we can see that we actually have the same thing twice.

prices2 = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv')

I made a “mistake” on purpose there. Notice how I didn’t do the index_col? Check out the prices2 DataFrame below. You’ll see an index column with 1, 2, 3,.. and then the date column. The read_csv created an index for me, since I didn’t specify one.

prices2
Date AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY .SPX .VIX EUR= XAU= GDX GLD
0 2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN 1.4323 1096.35 NaN NaN
1 2010-01-04 30.572827 30.950 20.88 133.90 173.08 113.33 1132.99 20.04 1.4411 1120.00 47.71 109.80
2 2010-01-05 30.625684 30.960 20.87 134.69 176.14 113.63 1136.52 19.35 1.4368 1118.65 48.17 109.70
3 2010-01-06 30.138541 30.770 20.80 132.25 174.26 113.71 1137.14 19.16 1.4412 1138.50 49.34 111.51
4 2010-01-07 30.082827 30.452 20.60 130.00 177.67 114.19 1141.69 19.06 1.4318 1131.90 49.10 110.82
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2211 2018-06-25 182.170000 98.390 50.71 1663.15 221.54 271.00 2717.07 17.33 1.1702 1265.00 22.01 119.89
2212 2018-06-26 184.430000 99.080 49.67 1691.09 221.58 271.60 2723.06 15.92 1.1645 1258.64 21.95 119.26
2213 2018-06-27 184.160000 97.540 48.76 1660.51 220.18 269.35 2699.63 17.91 1.1552 1251.62 21.81 118.58
2214 2018-06-28 185.500000 98.630 49.25 1701.45 223.42 270.89 2716.31 16.85 1.1567 1247.88 21.93 118.22
2215 2018-06-29 185.110000 98.610 49.71 1699.80 220.57 271.28 2718.37 16.09 1.1683 1252.25 22.31 118.65

2216 rows × 13 columns

Let’s redo that.

prices2 = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True)

prices2
AAPL.O MSFT.O INTC.O AMZN.O GS.N SPY .SPX .VIX EUR= XAU= GDX GLD
Date
2010-01-01 NaN NaN NaN NaN NaN NaN NaN NaN 1.4323 1096.35 NaN NaN
2010-01-04 30.572827 30.950 20.88 133.90 173.08 113.33 1132.99 20.04 1.4411 1120.00 47.71 109.80
2010-01-05 30.625684 30.960 20.87 134.69 176.14 113.63 1136.52 19.35 1.4368 1118.65 48.17 109.70
2010-01-06 30.138541 30.770 20.80 132.25 174.26 113.71 1137.14 19.16 1.4412 1138.50 49.34 111.51
2010-01-07 30.082827 30.452 20.60 130.00 177.67 114.19 1141.69 19.06 1.4318 1131.90 49.10 110.82
... ... ... ... ... ... ... ... ... ... ... ... ...
2018-06-25 182.170000 98.390 50.71 1663.15 221.54 271.00 2717.07 17.33 1.1702 1265.00 22.01 119.89
2018-06-26 184.430000 99.080 49.67 1691.09 221.58 271.60 2723.06 15.92 1.1645 1258.64 21.95 119.26
2018-06-27 184.160000 97.540 48.76 1660.51 220.18 269.35 2699.63 17.91 1.1552 1251.62 21.81 118.58
2018-06-28 185.500000 98.630 49.25 1701.45 223.42 270.89 2716.31 16.85 1.1567 1247.88 21.93 118.22
2018-06-29 185.110000 98.610 49.71 1699.80 220.57 271.28 2718.37 16.09 1.1683 1252.25 22.31 118.65

2216 rows × 12 columns

That’s better! We’ll end by looking at some of the attributes of our prices DataFrame.

First, let’s check our index. Is it really a date?

prices.index
DatetimeIndex(['2010-01-01', '2010-01-04', '2010-01-05', '2010-01-06',
               '2010-01-07', '2010-01-08', '2010-01-11', '2010-01-12',
               '2010-01-13', '2010-01-14',
               ...
               '2018-06-18', '2018-06-19', '2018-06-20', '2018-06-21',
               '2018-06-22', '2018-06-25', '2018-06-26', '2018-06-27',
               '2018-06-28', '2018-06-29'],
              dtype='datetime64[ns]', name='Date', length=2216, freq=None)

It is! We’ll do more with dates. A lot more.

Let’s look at the columns.

prices.columns
Index(['AAPL.O', 'MSFT.O', 'INTC.O', 'AMZN.O', 'GS.N', 'SPY', '.SPX', '.VIX',
       'EUR=', 'XAU=', 'GDX', 'GLD'],
      dtype='object')

Yup, those are our column headers from the .csv file.

There are many options that can go into the pd.read_csv function. For example, you can load just a certain number of rows.

prices100 = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True, nrows = 100)

You can bring in just certain columns. In pandas, we select columns using ['COLUMN'].

prices_spx= pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True, usecols = (['.SPX']))

You can also specify if a particular value should be set to missing. read_csv() already looks for common items, like ‘NA’ or ‘n/a’. I’ll have it look for ‘-’ as well, though that isn’t a missing value in this data.

prices2 = pd.read_csv('https://github.com/aaiken1/fin-data-analysis-python/raw/main/data/tr_eikon_eod_data.csv',
                      index_col = 0, parse_dates = True, na_values = '-')

We can also read data in from Excel files, using .read_excel(). I’ll read in some HF return data.

hf = pd.read_excel('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/hf_rets.xlsx',
                  index_col = 0, parse_dates = True, sheet_name = 'rets')  

We can use the option sheet_name= to specific a particular sheet from the Excel file. You can read more about the various options here.

We have our data in Python! Now, we can manipulate it, clean it, merge it with something else, summarize it, plot it, and do some actual finance.