Using APIs for Data Imports #
This chapter starts by using NASDAQ Data Link to download some BTC price and return data. We’ll also see our first set of simulations. I’ll then show you how to use Pandas Data Reader.
This is also our first time using an API. Their API, or Application Programming Interface, let’s us talk to a remote data storage system and pull in what we need. APIs are more general, though, and are used whenever you need one application to talk to another.
We’ll use the NASDAQ Data Link. They also have Python specific instructions.
You can read about the install on their package page.
We can again use pip
to install packages via the command line or in your Jupyter notebook. You can type this directly into a code cell in a notebook in Githib Codespaces. Run that cell.
pip install nasdaq-data-link
You can also use the ! pip
convention. The ! tells the Jupyter notebook that you want to run a terminal command in that cell. Older notebook environments had to have the !, but Github Codespaces does not.
! pip install nasdaq-data-link
When you sign-up for NASDAQ Data Link, you’ll get an API Key. You will need to add this key to the set-up to access the NASDAQ data using nasdaqdatalink
.
I have saved my key locally and am bringing it in with nasdaqdatalink.read_key
, so that it isn’t publicly available. You don’t need that bit of code.
You can also install pandas-datareader
using pip
.
pip install pandas-datareader
Finally, for a large set of APIs for access data, check out Rapid API. Some are free, others you have to pay for. You’ll need to get an access API key for each one. More on this at the end of these notes.
Let’s do our usual sort of set-up code.
# Set-up
import nasdaqdatalink # You could also do something like: import nasdaqdatalink as ndl
import pandas_datareader as pdr
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib as mpl
import matplotlib.pyplot as plt
# Include this to have plots show up in your Jupyter notebook.
%matplotlib inline
import os # to get access to environment variables
NASDAQ_DATA_LINK_API_KEY = os.environ.get('NASDAQ')
nasdaqdatalink.ApiConfig.api_key = NASDAQ_DATA_LINK_API_KEY
nasdaqdatalink.read_key()
#nasdaqdatalink.read_key(filepath="/data/.corporatenasdaqdatalinkapikey")
In order for you to use nasdaqdatalink
, you’ll need to find you API key on their web page. To do this, log in, go to the upper-right corner, and click on the little person icon. You’ll find your key under Account Settings.
Here’s one way. The not-very-safe way. Copy and paste that key into:
nasdaqdatalink.ApiConfig.api_key = 'YOUR_KEY_HERE'
nasdaqdatalink.read_key()
That will work. But, copying and pasting your API keys like this is, in general, a very bad idea! After all, if someone has you API key, they can charge things to your account. Github and Codespaces has a way around this, though.
Let’s use Github Secrets to create an enviroment variable that you can associate with any of your repositories. Then, within a repo, you can access that “secret” variable, without actually typing it into your code.
Go to your main Github page at Github.com. Click on your image in the upper-right. Go to Settings. Click on Codespaces under Code, planning, and automation.
Click on Update under Codespaces Secrets. You can now name your secret (e.g. NASDAQ) and copy your API key in the box below. Select the repo(s) that you want to associate with this secret. Add your secret.
You’ll now see that secret in the main Codespace settings page. You can refer to the name of that secret in your Codespace now, like I do above. Note that you need to import the os
package to access this environment variable. An environment variable is a variable defined for all of the work in this particular repo. My secret is named NASDAQ
, so I can refer to that.
You’ll need to reload your Codespace if you had it running in another tab in order to access that secret.
With my API key read, I can now start downloading data.
gdp = nasdaqdatalink.get('FRED/GDP')
gdp
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File /opt/anaconda3/lib/python3.9/site-packages/requests/models.py:971, in Response.json(self, **kwargs)
970 try:
--> 971 return complexjson.loads(self.text, **kwargs)
972 except JSONDecodeError as e:
973 # Catch JSON-related errors and raise as requests.JSONDecodeError
974 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
File /opt/anaconda3/lib/python3.9/site-packages/simplejson/__init__.py:525, in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw)
521 if (cls is None and encoding is None and object_hook is None and
522 parse_int is None and parse_float is None and
523 parse_constant is None and object_pairs_hook is None
524 and not use_decimal and not kw):
--> 525 return _default_decoder.decode(s)
526 if cls is None:
File /opt/anaconda3/lib/python3.9/site-packages/simplejson/decoder.py:372, in JSONDecoder.decode(self, s, _w, _PY3)
371 s = str(s, self.encoding)
--> 372 obj, end = self.raw_decode(s)
373 end = _w(s, end).end()
File /opt/anaconda3/lib/python3.9/site-packages/simplejson/decoder.py:402, in JSONDecoder.raw_decode(self, s, idx, _w, _PY3)
401 idx += 3
--> 402 return self.scan_once(s, idx=_w(s, idx).end())
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
JSONDecodeError Traceback (most recent call last)
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/connection.py:90, in Connection.parse(cls, response)
89 try:
---> 90 return response.json()
91 except ValueError:
File /opt/anaconda3/lib/python3.9/site-packages/requests/models.py:975, in Response.json(self, **kwargs)
972 except JSONDecodeError as e:
973 # Catch JSON-related errors and raise as requests.JSONDecodeError
974 # This aliases json.JSONDecodeError and simplejson.JSONDecodeError
--> 975 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
DataLinkError Traceback (most recent call last)
Cell In[2], line 1
----> 1 gdp = nasdaqdatalink.get('FRED/GDP')
2 gdp
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/get.py:48, in get(dataset, **kwargs)
46 if dataset_args['column_index'] is not None:
47 kwargs.update({'column_index': dataset_args['column_index']})
---> 48 data = Dataset(dataset_args['code']).data(params=kwargs, handle_column_not_found=True)
49 # Array
50 elif isinstance(dataset, list):
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/model/dataset.py:47, in Dataset.data(self, **options)
45 updated_options = Util.merge_options('params', params, **options)
46 try:
---> 47 return Data.all(**updated_options)
48 except NotFoundError:
49 if handle_not_found_error:
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/operations/list.py:15, in ListOperation.all(cls, **options)
13 options['params'] = {}
14 path = Util.constructed_path(cls.list_path(), options['params'])
---> 15 r = Connection.request('get', path, **options)
16 response_data = r.json()
17 Util.convert_to_dates(response_data)
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/connection.py:40, in Connection.request(cls, http_verb, url, **options)
36 options['headers'] = headers
38 abs_url = '%s/%s' % (ApiConfig.api_base, url)
---> 40 return cls.execute_request(http_verb, abs_url, **options)
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/connection.py:52, in Connection.execute_request(cls, http_verb, url, **options)
47 response = session.request(method=http_verb,
48 url=url,
49 verify=ApiConfig.verify_ssl,
50 **options)
51 if response.status_code < 200 or response.status_code >= 300:
---> 52 cls.handle_api_error(response)
53 else:
54 return response
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/connection.py:96, in Connection.handle_api_error(cls, resp)
94 @classmethod
95 def handle_api_error(cls, resp):
---> 96 error_body = cls.parse(resp)
98 # if our app does not form a proper data_link_error response
99 # throw generic error
100 if 'quandl_error' not in error_body:
File /opt/anaconda3/lib/python3.9/site-packages/nasdaqdatalink/connection.py:92, in Connection.parse(cls, response)
90 return response.json()
91 except ValueError:
---> 92 raise DataLinkError(http_status=response.status_code, http_body=response.text)
DataLinkError: (Status 403) Something went wrong. Please try again. If you continue to have problems, please contact us at connect@data.nasdaq.com.
You can explore more FRED data here. Always read the documentation to know what you’re pulling.
btc = nasdaqdatalink.get('BCHAIN/MKPRU')
btc.tail()
Value | |
---|---|
Date | |
2024-01-04 | 42854.95 |
2024-01-05 | 44190.10 |
2024-01-06 | 44181.10 |
2024-01-07 | 43975.63 |
2024-01-08 | 43928.07 |
btc['ret'] = btc.pct_change().dropna()
btc = btc.loc['2015-01-01':,['Value', 'ret']]
btc.plot()
<AxesSubplot:xlabel='Date'>
Well, that’s not a very good graph. The returns and price levels are in different units. Let’s use an f print
to show and format the average BTC return.
print(f'Average return: {100 * btc.ret.mean():.2f}%')
Average return: 0.22%
Let’s make a cumulative return chart and daily return chart. We can then stack these on top of each other. I’ll use the .sub(1)
method to subtract 1 from the cumulative product. You see this a lot in the DataCamps.
btc['ret_g'] = btc.ret.add(1) # gross return
btc['ret_c'] = btc.ret_g.cumprod().sub(1) # cummulative return
btc
Value | ret | ret_g | ret_c | |
---|---|---|---|---|
Date | ||||
2015-01-01 | 316.15 | 0.001425 | 1.001425 | 0.001425 |
2015-01-02 | 314.81 | -0.004238 | 0.995762 | -0.002819 |
2015-01-03 | 270.93 | -0.139386 | 0.860614 | -0.141812 |
2015-01-04 | 276.80 | 0.021666 | 1.021666 | -0.123218 |
2015-01-05 | 263.17 | -0.049241 | 0.950759 | -0.166392 |
... | ... | ... | ... | ... |
2024-01-04 | 42854.95 | -0.046799 | 0.953201 | 134.745803 |
2024-01-05 | 44190.10 | 0.031155 | 1.031155 | 138.974976 |
2024-01-06 | 44181.10 | -0.000204 | 0.999796 | 138.946468 |
2024-01-07 | 43975.63 | -0.004651 | 0.995349 | 138.295629 |
2024-01-08 | 43928.07 | -0.001082 | 0.998918 | 138.144979 |
3295 rows × 4 columns
We can now make a graph using the fig, axs method. This is good review! Again, notice that semi-colon at the end. This suppresses some annoying output in the Jupyter notebook.
fig, axs = plt.subplots(2, 1, sharex=True, sharey=False, figsize=(10, 6))
axs[0].plot(btc.ret_c, 'g', label = 'BTC Cumulative Return')
axs[1].plot(btc.ret, 'b', label = 'BTC Daily Return')
axs[0].set_title('BTC Cumulative Returns')
axs[1].set_title('BTC Daily Returns')
axs[0].legend()
axs[1].legend();
I can make the same graph using the .add_subplot()
syntax. The method above gives you some more flexibility, since you can give both plots the same x-axis.
fig = plt.figure(figsize=(10, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(btc.ret_c, 'g', label = 'BTC Cumulative Return')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(btc.ret, 'b', label = 'BTC Daily Return')
ax1.set_title('BTC Cumulative Returns')
ax2.set_title('BTC Daily Returns')
ax1.legend()
ax2.legend()
plt.subplots_adjust(wspace=0.5, hspace=0.5);
Let’s put together some ideas, write a function, and run a simulation. We’ll use something called geometric brownian motion (GBM). What is GBM? It is a particular stochastic differential equation. But, what’s important for us is the idea, which is fairly simple. Here’s the formula:
This says that the change in the stock price has two components - a drift, or average increase over time, and a shock that it is random at each point in time. The shock is scaled by the standard deviation of returns that you use. So, larger standard deviation, the bigger the shocks can be. This is basically the simplest way that you can model an asset price.
The shocks are what make the price wiggle around around, or else it would just go up over time, based on the drift value that we use.
And, I’ll stress - we aren’t predicting here, so to speak. We are trying to capture some basic reality about how an asset moves and then seeing what is possible in the future. We aren’t making a statement about whether we think an asset is overvalued or undervalued, will go up or down, etc.
You can solve this equation to get the value of the asset at any point in time t. You just need to know the total of all of the shocks at time t.
T = 30 # How long is our simulation? Let's do 31 days (0 to 30 the way Python counts)
N = 30 # number of time points in the prediction time horizon, making this the same as T means that we will simulate daily returns
S_0 = btc.Value[-1] # initial BTC price
N_SIM = 100 # How many simulations to run?
mu = btc.ret.mean()
sigma = btc.ret.std()
This is the basic syntax for writing a function in Python. We saw this earlier, back when doing “Comp 101”. Remember, in Python, indentation matters!
def simulate_gbm(s_0, mu, sigma, n_sims, T, N):
dt = T/N # One day
dW = np.random.normal(scale = np.sqrt(dt),
size=(n_sims, N)) # The random part
W = np.cumsum(dW, axis=1)
time_step = np.linspace(dt, T, N)
time_steps = np.broadcast_to(time_step, (n_sims, N))
S_t = s_0 * np.exp((mu - 0.5 * sigma ** 2) * time_steps + sigma * np.sqrt(time_steps) * W)
S_t = np.insert(S_t, 0, s_0, axis=1)
return S_t
Nothing happens when we define a function. We’ve just created something called simulate_gbm
that we can now use just like any other Python function.
We can look at each piece of the function code, with some numbers hard-coded, to get a sense of what’s going on. This gets tricky - keep track of the dimensions. I think that’s the hardest part. How many numbers are we creating in each array? What do they mean?
# Creates 100 rows of 30 random numbers from the standard normal distribution.
dW = np.random.normal(scale = np.sqrt(1),
size=(100, 30))
# cumulative sum along each row
W = np.cumsum(dW, axis=1)
# Array with numbers from 1 to 30
time_step = np.linspace(1, 30, 30)
# Expands that to be 100 rows of numbers from 1 to 30. This is going to be the t in the formula above. So, for the price on the 30th day, we have t=30.
time_steps = np.broadcast_to(time_step, (100, 30))
# This is the formula from above to find the value of the asset any any point in time t. np.exp is the natural number e. W is the cumulative sum of all of our random shocks.
S_t = S_0 * np.exp((mu - 0.5 * sigma ** 2) * time_steps + sigma * np.sqrt(time_steps) * W)
# This inserts the initial price at the start of each row.
S_t = np.insert(S_t, 0, S_0, axis=1)
We can look at these individually, too.
dW
array([[ 1.85112558, 0.13211658, 0.45139814, ..., 0.72596267,
-1.63582981, -0.03377876],
[ 0.15421336, 0.4148752 , 0.6439595 , ..., -0.47171271,
1.20669713, 2.6471703 ],
[ 1.79628151, -0.26420644, 0.93929834, ..., -1.13225649,
0.05780285, -0.72668984],
...,
[ 0.32724273, 2.16254346, 0.03592569, ..., 1.82984001,
0.09920881, -0.59093787],
[ 0.00503292, 0.07004887, -0.42857624, ..., 2.08370213,
-0.27461578, -1.99069844],
[ 1.12432136, -1.04952278, 0.09237506, ..., -0.25002425,
0.54331545, 1.29904012]])
time_steps
array([[ 1., 2., 3., ..., 28., 29., 30.],
[ 1., 2., 3., ..., 28., 29., 30.],
[ 1., 2., 3., ..., 28., 29., 30.],
...,
[ 1., 2., 3., ..., 28., 29., 30.],
[ 1., 2., 3., ..., 28., 29., 30.],
[ 1., 2., 3., ..., 28., 29., 30.]])
len(time_steps)
100
np.shape(time_steps)
(100, 30)
I do this kind of step-by-step break down all of the time. It’s the only way I can understand what’s going on.
We can then use our function. This returns an narray
.
gbm_simulations = simulate_gbm(S_0, mu, sigma, N_SIM, T, N)
And, we can plot all of the simulations. I’m going to use pandas
to plot, save to ax
, and the style the ax
.
gbm_simulations_df = pd.DataFrame(np.transpose(gbm_simulations))
# plotting
ax = gbm_simulations_df.plot(alpha=0.2, legend=False)
ax.set_title('BTC Simulations', fontsize=16);
The y-axis has a very wide range, since some extreme values are possible, given this simulation.
Using pandas-datareader#
The pandas data-reader API lets us access additional data sources, such as FRED.
There are also API that let you access the same data. For example, Yahoo! Finance has several, like yfinance. I find accessing Yahoo! Finance via an API to be very buggy - Yahoo! actively tries to stop it. So, you can try those instructions, but they may or may not work.
Lots of developers have written APIs to access different data sources.
Note
Different data sources might require API keys. Sometimes you have to pay. Always read the documentation.
Here’s another FRED example, but using pandas-datareader
.
start = dt.datetime(2010, 1, 1)
end = dt.datetime(2013, 1, 27)
gdp = pdr.DataReader('GDP', 'fred', start, end)
gdp.head
<bound method NDFrame.head of GDP
DATE
2010-01-01 14764.610
2010-04-01 14980.193
2010-07-01 15141.607
2010-10-01 15309.474
2011-01-01 15351.448
2011-04-01 15557.539
2011-07-01 15647.680
2011-10-01 15842.259
2012-01-01 16068.805
2012-04-01 16207.115
2012-07-01 16319.541
2012-10-01 16420.419
2013-01-01 16648.189>
Data Details - Using APIs#
This notes above use the NASDAQ Datalink API to pull some BTC data. Now, I’ll discuss using this API more generally, as well as using Rapid API, another website with a variety of data options. I’ll also show you an API from Github.
As mentioned above, APIs are ways for one program or piece of software to talk to another. In our case, we’re using them to get data. That data might come in as a pandas
DataFrame, ready to use. Other times, it might come in as something called a JSON file. We’ll have to do a bit more work with this common data structure.
NASDAQ API - Another Example#
Let’s look at the NASDAQ API one more time. Once you log in, you’ll see the home page below. Note the strip across the upper-left, that has API, Python, Excel, etc. You can use the NASDAQ API in a variety of settings. There’s a SEARCH FOR DATA box at the top.
If you click EXPLORE next to the search box, you’re taken to a list of all of their data. Much of it is premium - you have to pay. However, you can filter for free data. There’s free data for house prices, gold and silver markets, IMF macro data, the Fed, etc. Much of this free data comes from Quandl, which was purchased by Nasdaq recently.
Quandl has been completely integrated by NASDAQ now, though you will see legacy instructions on the website that refer to its older API commands.
Let’s look at the Zillow data, the first option presented when I look for free data. I’ve used them in labs and exams.
Each the data APIs shows you samples of what you can access. So, we see an example table with data for a particular indicator and region. We also see a table that has a list of all of the indicators and what they measure. Finally, we see a table with all of the regions and what they represent.
This data structure makes it clear that we can download value data and then merge in ID and region descriptions if needed. But, how do we do that? See the tab in the upper-left, with DATA highlighted? You can click on DOCUMENTATION and USAGE to learn more. We’ll look at a quick example here.
Click USAGE and then the Python icon. You’ll seen an example that lets you filter by a single indicator_id and region. It has your API key and the .get_table
method.
However, note the quandl
stuff. They haven’t transitioned this code yet. You’ll need to do a pip
install for quandl.
pip install quandl
Also, we didn’t use .get_table
above for BTC. The Zillow data is stored differently.
Make sure that you include your API key. You can input it directly, using the code that they provide, but this isn’t the preferred method. Instead, use a Github Secret, like we did above.
#! pip install quandl
# Bring in quandl for downloading data
import quandl
# quandl.ApiConfig.api_key = 'YOUR_KEY_HERE'
quandl.read_key()
You need that paginate=True
in there in order to download all of the available data. Without it, it will only pull the first 10,000 rows. Using paginate extends the limit to 1,000,000 rows, or observations. Now, note that this could be a lot of data! You might need to download the data in chunks to get what you want.
Let’s try pulling in the indicator_id ZATT for all regions.
# zillow = quandl.get_table('ZILLOW/DATA', indicator_id = 'ZATT', paginate=True)
I’ve commented out the code above, because I know it will exceed the download limit! So, we need to be more selective.
If you look on the NASDAQ Zillow documentation page, you’ll see the three tables that you can download, the variables inside of each, and what you’re allowed to filter on. You unfortunately can’t filter on date in the ZILLOW/DATA table. Other data sets, like FRED, do let you specify start and end dates. Every API is different.
You can find examples of how to filter and sub-select your data on the NASDAQ website: https://docs.data.nasdaq.com/docs/python-tables
However, you can filter on region_id. Let’s pull the ZILLOW/REGIONS table to see what we can use.
regions = quandl.get_table('ZILLOW/REGIONS', paginate=True)
regions
region_id | region_type | region | |
---|---|---|---|
None | |||
0 | 99999 | zip | 98847;WA;Wenatchee, WA;Leavenworth;Chelan County |
1 | 99998 | zip | 98846;WA;nan;Pateros;Okanogan County |
2 | 99997 | zip | 98845; WA; Wenatchee; Douglas County; Palisades |
3 | 99996 | zip | 98844;WA;nan;Oroville;Okanogan County |
4 | 99995 | zip | 98843;WA;Wenatchee, WA;Orondo;Douglas County |
... | ... | ... | ... |
89300 | 100000 | zip | 98848;WA;Moses Lake, WA;Quincy;Grant County |
89301 | 10000 | city | Bloomington;MD;nan;Garrett County |
89302 | 1000 | county | Echols County;GA;Valdosta, GA |
89303 | 100 | county | Bibb County;AL;Birmingham-Hoover, AL |
89304 | 10 | state | Colorado |
89305 rows × 3 columns
What if we just want cities?
cities = regions[regions.region_type == 'city']
cities
region_id | region_type | region | |
---|---|---|---|
None | |||
10 | 9999 | city | Carrsville;VA;Virginia Beach-Norfolk-Newport N... |
20 | 9998 | city | Birchleaf;VA;nan;Dickenson County |
56 | 9994 | city | Wright;KS;Dodge City, KS;Ford County |
124 | 9987 | city | Weston;CT;Bridgeport-Stamford-Norwalk, CT;Fair... |
168 | 9980 | city | South Wilmington; IL; Chicago-Naperville-Elgin... |
... | ... | ... | ... |
89203 | 10010 | city | Atwood;KS;nan;Rawlins County |
89224 | 10008 | city | Bound Brook;NJ;New York-Newark-Jersey City, NY... |
89254 | 10005 | city | Chanute;KS;nan;Neosho County |
89290 | 10001 | city | Blountsville;AL;Birmingham-Hoover, AL;Blount C... |
89301 | 10000 | city | Bloomington;MD;nan;Garrett County |
28125 rows × 3 columns
cities.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28125 entries, 10 to 89301
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region_id 28125 non-null object
1 region_type 28125 non-null object
2 region 28125 non-null object
dtypes: object(3)
memory usage: 878.9+ KB
I like to look and see what things are stored as, too. Remember, the object
type is very generic.
There are 28,125 rows of cities! How about counties?
counties = regions[regions.region_type == 'county']
counties
region_id | region_type | region | |
---|---|---|---|
None | |||
94 | 999 | county | Durham County;NC;Durham-Chapel Hill, NC |
169 | 998 | county | Duplin County;NC;nan |
246 | 997 | county | Dubois County;IN;Jasper, IN |
401 | 995 | county | Donley County;TX;nan |
589 | 993 | county | Dimmit County;TX;nan |
... | ... | ... | ... |
89069 | 1003 | county | Elmore County;AL;Montgomery, AL |
89120 | 1002 | county | Elbert County;GA;nan |
89204 | 1001 | county | Elbert County;CO;Denver-Aurora-Lakewood, CO |
89302 | 1000 | county | Echols County;GA;Valdosta, GA |
89303 | 100 | county | Bibb County;AL;Birmingham-Hoover, AL |
3097 rows × 3 columns
Can’t find the regions you want? You could export the whole thing to a CSV file and explore it in Excel. This will show up in whatever folder you currently have as your home in VS Code.
counties.to_csv('counties.csv', index = True)
You can also open up the Variables window in VS Code VS Code (or the equivalent in Google Colab) and scroll through the file, looking for the region_id values that you want.
Finally, you can search the text in a column directly. Let’s find counties in NC.
nc_counties = counties[counties['region'].str.contains(";NC")]
nc_counties
region_id | region_type | region | |
---|---|---|---|
None | |||
94 | 999 | county | Durham County;NC;Durham-Chapel Hill, NC |
169 | 998 | county | Duplin County;NC;nan |
2683 | 962 | county | Craven County;NC;New Bern, NC |
4637 | 935 | county | Chowan County;NC;nan |
4972 | 93 | county | Ashe County;NC;nan |
... | ... | ... | ... |
87475 | 1180 | county | Martin County;NC;nan |
87821 | 1147 | county | Lenoir County;NC;Kinston, NC |
88578 | 1059 | county | Greene County;NC;nan |
88670 | 1049 | county | Graham County;NC;nan |
88823 | 1032 | county | Gaston County;NC;Charlotte-Concord-Gastonia, N... |
100 rows × 3 columns
nc_counties.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 94 to 88823
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region_id 100 non-null object
1 region_type 100 non-null object
2 region 100 non-null object
dtypes: object(3)
memory usage: 3.1+ KB
There are 100 counties in NC, so this worked. Now, we can save these regions to a list and use that to pull data.
By exploring the data like this, you can maybe find the region_id values that you want and give them as a list. I’m also going to use the qopts =
option to name the columns that I want to pull. This isn’t necessary here, since I want all of the columns, but I wanted to show you that you could do this.
nc_county_list = nc_counties['region_id'].to_list()
I’m going to pull down just the NC counties. I’ll comment out my code, though, so that I don’t download from the API every time I update my notes. This can cause a time-out error.
#zillow_nc = quandl.get_table('ZILLOW/DATA', indicator_id = 'ZATT', paginate = True, region_id = nc_county_list, qopts = {'columns': ['indicator_id', 'region_id', 'date', 'value']})
#zillow_nc.head(25)
indicator_id | region_id | date | value | |
---|---|---|---|---|
None | ||||
0 | ZATT | 999 | 2024-01-31 | 557371.760608 |
1 | ZATT | 999 | 2023-12-31 | 557661.809892 |
2 | ZATT | 999 | 2023-11-30 | 556887.747711 |
3 | ZATT | 999 | 2023-10-31 | 555300.176497 |
4 | ZATT | 999 | 2023-09-30 | 553246.874842 |
5 | ZATT | 999 | 2023-08-31 | 550952.109307 |
6 | ZATT | 999 | 2023-07-31 | 547469.076081 |
7 | ZATT | 999 | 2023-06-30 | 542981.400227 |
8 | ZATT | 999 | 2023-05-31 | 538523.149903 |
9 | ZATT | 999 | 2023-04-30 | 536026.114767 |
10 | ZATT | 999 | 2023-03-31 | 535978.017181 |
11 | ZATT | 999 | 2023-02-28 | 537883.027543 |
12 | ZATT | 999 | 2023-01-31 | 541175.601929 |
13 | ZATT | 999 | 2022-12-31 | 545361.903669 |
14 | ZATT | 999 | 2022-11-30 | 549751.208349 |
15 | ZATT | 999 | 2022-10-31 | 554505.626693 |
16 | ZATT | 999 | 2022-09-30 | 559896.493202 |
17 | ZATT | 999 | 2022-08-31 | 564915.670096 |
18 | ZATT | 999 | 2022-07-31 | 565417.726032 |
19 | ZATT | 999 | 2022-06-30 | 559501.205643 |
20 | ZATT | 999 | 2022-05-31 | 547576.105906 |
21 | ZATT | 999 | 2022-04-30 | 533067.734141 |
22 | ZATT | 999 | 2022-03-31 | 518721.114714 |
23 | ZATT | 999 | 2022-02-28 | 506041.692210 |
24 | ZATT | 999 | 2022-01-31 | 495895.806066 |
Hey, there’s Durham County!
#zillow_nc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27676 entries, 0 to 27675
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 indicator_id 27676 non-null object
1 region_id 27676 non-null object
2 date 27676 non-null datetime64[ns]
3 value 27676 non-null float64
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 865.0+ KB
Now you can filter by date if you like. And, you could pull down multiple states this way, change the variable type, etc. You could also merge in the region names using region_id as your key.
Using Rapid API#
Another data option is Rapid API. There’s all types of data here - markets, sports, gambling, housing, etc. People will write their own APIs, perhaps interfacing with the websites that contain the information. They can then publish their APIs on this webpage. Many have free options, some you have to pay for. There are thousands here, so you’ll have to dig around.
One you have an account, you’ll be able to subscribe to different APIs. You probably want the data to have a free option.
The quick start guide is here.
Luckily, all of the APIs here tend to have the same structures. These are called REST APIs. This stands for “Representational State Transfer” and is just a standardized way for computers to talk to each other. They are going to use a standard data format, like JSON. More on this below.
You can read more on their API Learn page.
We’ll look at one example, Pinnacle Odds, which has some sports gambling information: https://rapidapi.com/tipsters/api/pinnacle-odds/
Once you’ve subscribed, you see the main endpoint screen.
At the top, you’ll see Endpoints, About, Tutorials, Discussions, and Pricing. Click around to read more about the API.
We are currently on Endpoints. Endpoints are basically like URLs. They are where different tables of data live. We are going to use this page to figure out the data that we need. And, the webpage page will also create the Python code needed to download the data!
You can start on the left of the screen. You’ll see a list of the different tables available. I’ll try List of Sports in this example. You’ll see why in a minute.
You’ll note that the middle section now changed. This is where you can filter and ask for particular types of data from that table. In this case, there are no options to change.
On the right, you’ll see Code Snippets. The default is Node.js, a type of Javascript. We don’t want that. Click the dropdown box and look for Python. They have three ways, using three different packages, to interface with the API from Python and download the data. I’ll pick Requests
- it seemed to work below.
This will change the code. You’ll see the package import, your API key, the host, and the data request. You can click Copy Code.
But, before we run this on our end, let’s click Test Endpoint. That’s the blue box in the middle. Then, click Results on the left and Body. By doing this, we essentially just ran that code in the browser. We can see what data we’re going to get. This is a JSON file with 9 items. Each item has 6 keys. You can see what the keys are - they are giving us the ids for each sport. For example, “Soccer” is “id = 1”.
This is very helpful! We need to know these id values if we want to pull particular sports.
For fun, let’s pull this simple JSON file on our end. I’ve copied and pasted the code below. It didn’t like the print
function, so I just dropped it. I am again loading in my API key from an separate file. You’ll use your own.
I am commenting out my code so that it doesn’t run and use my API key everytime I update my book.
# import requests
# from dotenv import load_dotenv # For my .env file which contains my API keys locally
# import os # For my .env file which contains my API keys locally
# load_dotenv() # For my .env file which contains my API keys locally
# RAPID_API_KEY = os.getenv('RAPID_API_KEY')
# url = "https://pinnacle-odds.p.rapidapi.com/kit/v1/sports"
# headers = {
# "X-RapidAPI-Key": RAPID_API_KEY,
# "X-RapidAPI-Host": "pinnacle-odds.p.rapidapi.com"
# }
# sports_ids = requests.request("GET", url, headers=headers)
# print(sports_ids.text)
We can turn that response file into a JSON file. This is what it wants to be!
Again, all of the code that follows is also commented out so that it doesn’t run every time I edit this online book. The output from the code is still there, however.
#sports_ids_json = sports_ids.json()
#sports_ids_json
That’s JSON. I was able to show the whole thing in the notebook.
Let’s get that into a pandas
DataFrame now. To do that, we have to know a bit about how JSON files are structured. This one is easy. pd.json_normalize
is a useful tool here.
#sports_ids_df = pd.json_normalize(data = sports_ids_json)
#sports_ids_df
What do all of those columns mean? I don’t know! You’d want to read the documentation for your API.
Also, note how I’m changing the names of my objects as I go. I want to keep each data structure in memory - what I originally downloaded, the JSON file, the DataFrame. This way, I don’t overwrite anything and I won’t be forced to download the data all over again.
Now, let’s see if we can pull some actual data. I notice that id = 3 is Basketball. Cool. Let’s try for some NBA data. Go back to the left of the Endpoint page and click on List of archive events. The middle will change and you’ll have some required and optional inputs. I know I want sport_id to be 3. But I don’t want all basketball. Just the NBA. So, I notice the league_ids option below. But I don’t know the number of the NBA.
OK, back to the left side. See List of leagues? Click that. I put in sport_id = 3. I then click Test Endpoint. I go to Results, select Body, and then Expand All. I do a CTRL-F to look for “NBA”.
And I find a bunch of possibilities! NBA games. Summer League. D-League. Summer League! If you’re betting on NBA Summer League, please seek help. Let’s use the regular NBA. That’s league_id = 487.
Back to List of archive events. I’ll add that league ID to the bottom of the middle. I set the page_num to 1000. I then click Test Endpoint and look at what I get.
Nothing! That’s an empty looking file on the right. Maybe this API doesn’t keep archived NBA? Who knows.
Let’s try another endpoint. Click on List of markets. Let’s see what this one has. In the middle, I’ll again use the codes for basketball and the NBA. I’ll set is_have_odds to True. Let’s test the endpoint and see what we get.
We can expand the result and look at the data structure. This is a more complicated one. I see 8 items under events. These correspond to the 8 games this weekend. Then, under each event, you can keep drilling down. The level 0 is kind of like the header for that event. It has the game, the start time, the teams, etc. You’ll see 4 more keys under periods. Each of these is a different betting line, with money lines, spreads, what I think are over/under point totals, etc.
Anyway, the main thing here is that we have indeed pulled some rather complex looking data. That data is current for upcoming games, not historical. But, we can still pull this in and use it to see how to work with a more complex JSON structure.
I’ll copy and paste the code again.
# import requests
# url = "https://pinnacle-odds.p.rapidapi.com/kit/v1/markets"
# querystring = {"sport_id":"3","league_ids":"487","is_have_odds":"true"}
# headers = {
# "X-RapidAPI-Key": RAPID_API_KEY,
# "X-RapidAPI-Host": "pinnacle-odds.p.rapidapi.com"
# }
# current = requests.request("GET", url, headers=headers, params=querystring)
# print(current.text)
Does that query string above make sense now?
I’ll convert that data to JSON below and peak at it.
#current_json = current.json()
#current_json
Wow, that’s a lot of stuff. OK, now this is the tricky part. How do we get this thing into a pandas
DataFrame? This is where we really have to think carefully. What do we actually want? Remember, a DataFrame, at its simplest, looks like a spreadsheet, with rows and columns. How could this thing possibly look like that?
#current_df = pd.json_normalize(data = current_json)
#current_df
We need to flatten this file. JSON files are nested. That’s what all those brackets are doing. Let’s think a little more about that.
JSON files are like dictionaries, as you can see in the picture above. There’s a key and a value. However, they can get complicated where there’s a list of dictionaries embedded in the same data structure. You can think of navigating them like working through the branches of a tree. Which branch do you want?
To do this, we’ll use the pd.json_normalize method. We’ve just used it, but that was with a simple JSON file. It didn’t really work with the current odds data, unless we add more arguments.
You can read more here.
Everything is packed into that events column. Let’s flatten it. This will take every item in it and convert it into a new column. Keys will be combined together to create compound names that combine different levels.
#current_df_events = pd.json_normalize(data = current_json, record_path=['events'])
#current_df_events
#list(current_df_events)
That’s a lot of columns! Do you see what it did? When I flattened the file, it worked its way through the dictionary and key values. So, periods to num_0 to totals to the various over/under values, etc. It then combined those permutations to create new columns, where each value is separated by a period. Then value at the end of the chain is what goes into the DataFrame.
That’s a quick introduction to Rapid API and dealing with its JSON output. Every API is different - you’ll have to play around.
Data on Kaggle#
Kaggle is also a great source for data. You can search their data sets here.
Searching for finance, I see one on consumer finance complaints that looks interesting. The Kaggle page describes the data, gives you a data dictionary, and some examples.
The data for Kaggle contests is usually pretty clean already. That said, you’ll usually have to do at least some work to get it ready to look at.