Machine learning and unsupervised learning #

Chapter 1 of Hull starts with a discussion of machine learning, data needs, and how to categorize different types of machine learning. Machine learning is a combination of statistics and computer science. It is a way to use data to make predictions. A lot of the machine learning math has been around a while, but really became useful with more data, more storage, and more computing power.

Machine learning and artificial intelligence are related.

Traditionally, economics and econometrics more specifically have been concerned with causality. Causality is a subtle, complicated idea. It is not the same as correlation. Causality is about understanding the mechanism behind a relationship. Machine learning is more about prediction. It is about using data to make predictions, and it does not (necessarily) care about the underlying mechanism. To understand the mechanism, we often want to have a theory.

Think about developing a drug. You want to understand how the drug works. You want to understand the biology. You want to know how it interacts with the body. You want to know how it interacts with other drugs. You want to know how it interacts with different people. If the drug does work, you want to understand why.

Machine learning is about patterns in data. It is traditionally about using data to make predictions. In practice, machine learning and causal inference is often used together now.

Finance is a tricky example of machine learning. We don’t have as much data as, say, image recognition. The images also don’t know that you are trying to categorize them, to predict them. The fact that someone is trying to predict a future stock price changes the stock price and affects the prediction process.

Of course, some areas of finance have more data than others. For example, if you are trying to understand the limit-order book and your trading costs, that’s more data than looking at quarterly accounting statements.

This is nice summary of how we are using machine learning in finance: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4501707. The authors are academics, but one of them also works at a hedge fund. The first chapter discusses some of the tensions between prediction and causality that I mention above.

Roughly speaking, there are two main types of machine learning:

  1. Supervised learning: This is where you have a target variable. You are trying to predict something. You have a set of features and a target variable. You are trying to understand the relationship between the features and the target variable. You are trying to predict the target variable using the features.

  2. Unsupervised learning: This is where you do not have a target variable. You are trying to understand the structure of the data. You are trying to find patterns in the data. You are trying to cluster the data.

There are others, but this is a helpful heuristic.

We’ll start with unsupervised learning below.


Unsupervised Learning #

These notes follow along with Chapter 2 of our Hull textbook. My notes are not meant to be comprehensive. Instead, I want to provide you with some additional resources and commentary on what is in this chapter.

As noted in Chapter 2,

… unsupervised learning is concerned with identifying patterns in data. The immediate objective is not to predict the value of a target variable. Instead, it is to understand the structure of data and find clusters.

Feature Scaling#

The chapter starts with a particular transformation of our data. We’ve already seen a bit about how to take our raw data and create new variables from it. This broad topic is called feature engineering. It is a crucial part of data analysis, as you are rarely taking raw data and using it to make a prediction or investigate a causal relationship.

Instead, you need to change your variables. Scale or normalize them. Create categories. Take logs. All to help with interpretation and create better models.

The purpose of feature scaling is to ensure that the features are given equal importance in a model. You’re trying to capture how different a variable is from its typical value and whether or not this relates to some other value, the one that you’re trying to predict.

The text discusses normalization. You’ve probably seen this calculation in your stats class.

(9)#\[\begin{align} \text{Scaled Feature Value} = \frac{V - \mu}{\sigma} \end{align}\]

where V is the value of an observation, \(\mu\) is the mean of all of the observations, and \(\sigma\) is the standard deviation of the values. This is called a Z-score normalization.

You can also do min-max scaling.

(10)#\[\begin{align} \text{Scaled Feature Value} = \frac{V - min(V)}{max(V) - min(V)} \end{align}\]

Finally, note that use of training set and validation set. We are going to start splitting our data up as we use machine learning techniques to make predictions. We use these data sets to build our model and, in particular, look for hyperparameters, a subject we’ll get to when discussing regression and logit in Chapter 3 of Hull. We’ll also create a testing set where we take the trained model and see how well it predicts values using data we’ve pulled out and not used.

Hull points out that, when scaling your data, you should pull the means and standard deviations that you need from the training data. Then, you scale your testing data using the means and standard deviations from the training data. That way, your testing data never touches your training data. If you use the means and standard deviations from all of your data to scale your variables, then observations in your testing data set are altering observations in your training data set, and vice-versa.

Additional materials on feature engineering#

Here are some examples from Datacamp on feature engineering for a Kaggle competition. Again, this is all about variable transformation and, more generally, data engineering. For example, see the example on binning data? This is like when we created some categories for house sizes in our problem sets.

Here’s an entire book on feature engineering. It is a work in progress, but gives you a sense for the types of transformations that you can make and why you might want to make them.

The Feature Engineering book has a section on normalization.

K-means Algorithm#

Clustering is about grouping our data. How are some observations like others? To do this, we are going to use the concept of a distance between observations. Are some observations closer to others, in some sense?

We are going to use Kmeans from scikit-learn.

Our text discusses Euclidean distance on pgs. 25 and 26. This is like measuring the length of a line, except that you can do this in any number of dimensions.

There is also the concept of a center of a cluster. Suppose that a certain set of observations is regarded as a cluster. The center is calculated by averaging the values of each of the features for the observations in the cluster.

../_images/10-kmeansalgo.png

Fig. 58 How the algo works. From Hull, Chapter 2.#

How do you know if your clustering method is any good? A measure of the performance of a clustering algorithm is the within-cluster sum of squares, also known as the inertia. Define \(d_i\) as the distance of the ith observation from the center of the cluster to which it belongs.

(11)#\[\begin{align} \text{Inertia} = \sum_{i=1}^{N} d_i \end{align}\]

The larger the inertia value, the worse the fit. Now, you can always get smaller inertia by adding more clusters. In fact, each observation could be its own cluster! But, that defeats the purpose. At some point, there’s an “optimal” number of clusters to try and divide your data into.

The example below will use the elbow method. Sometimes, there’s just a natural number of clusters that you’re looking for.

The text also discusses the silhouette method and gap method for choosing the optimal number of clusters. You want larger silhouette scores as you vary the number of clusters, k. You’ll also see an example of this below.

Finally, the text discusses the curse of dimensionality. Basically, as you have more and more features, forming clusters gets difficult. There are other ways to cluster your data, beyond Euclidean Distance. For example, the text mentions cosine similarity.

In the example below, we’ll start with four features. But, two of them are highly correlated, so we drop one. This is where looking at and understanding your data first is crucial.

The intuition#

Here is a simple way to think about K-means. Imagine you’re sorting countries into groups based on their risk characteristics. You don’t know ahead of time which countries belong together — you just have the data and a guess about how many groups (k) there should be.

K-means works like this:

  1. Start: Drop k pins randomly in the space of your data. These are the initial cluster centroids.

  2. Assign: Each country looks at all k centroids and says “I belong to the one closest to me.”

  3. Update: Each centroid moves to the average position of all the countries assigned to it.

  4. Repeat: Keep assigning and updating until no country switches groups.

The algorithm converges when the centroids stop moving. The result is k groups where each member is closer to its own group’s center than to any other group’s center.

The challenge is choosing k — how many clusters? We’ll use the elbow method and silhouette scores below to help with this decision.

An example with country risk#

I have re-created the code for this section below. We are now using the scikit-learn package. This is where the KMeans algo lives. I have uploaded the raw country risk data to our course Github page.

You can install scikit-learn with pip.


pip install -U scikit-learn

There is a bug with a package called threadpoolctl and scikit-learn. threadpoolctl needs to be version 3 or higher. This may not be an issue when running things in Github Codespaces, but it was on my local Mac install. You can get a later version by specifying the one that you want to install.

 pip install threadpoolctl==3.2.0

# Our set-up

import pandas as pd
import numpy as np

# plotting packages
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as clrs

# Kmeans algorithm from scikit-learn
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import StandardScaler
# load raw data

raw = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/Country_Risk_2019_Data.csv')

# check the raw data
print("Size of the dataset (row, col): ", raw.shape)
print("\nFirst 5 rows\n", raw.head(n=5))
Size of the dataset (row, col):  (121, 6)

First 5 rows
      Country Abbrev  Corruption  Peace  Legal  GDP Growth
0    Albania     AL          35  1.821  4.546       2.983
1    Algeria     DZ          35  2.219  4.435       2.553
2  Argentina     AR          45  1.989  5.087      -3.061
3    Armenia     AM          42  2.294  4.812       6.000
4  Australia     AU          77  1.419  8.363       1.713

Simple exploratory analysis#

Plot histogram#

Note that the distribution for GDP Growth is quite skewed.

# plot histograms in a 2x2 grid
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

raw['Corruption'].plot(kind='hist', title='Corruption', alpha=0.6, color='steelblue', edgecolor='white', ax=axes[0, 0])
raw['Peace'].plot(kind='hist', title='Peace', alpha=0.6, color='steelblue', edgecolor='white', ax=axes[0, 1])
raw['Legal'].plot(kind='hist', title='Legal', alpha=0.6, color='steelblue', edgecolor='white', ax=axes[1, 0])
raw['GDP Growth'].plot(kind='hist', title='GDP Growth', alpha=0.6, color='steelblue', edgecolor='white', ax=axes[1, 1])

plt.tight_layout()
plt.show()
../_images/73b5bca379d4711f47e93c252ead948de737a87900b5a43cc3d441fa632d4294.png

K means cluster#

Pick features & normalization#

Since Corruption and Legal are highly correlated, we drop the Corruption variable, i.e., we pick three features for this analysis, Peace, Legal and GDP Growth. Let’s normalize all the features, effectively making them equally weighted.

Ref. Feature normalization.

df_features = raw[['Peace', 'Legal', 'GDP Growth']]
df_features_scaled = (df_features - df_features.mean()) / df_features.std()
df_features_scaled.head(5)
Peace Legal GDP Growth
0 -0.390081 -0.878158 0.126952
1 0.472352 -0.958948 -0.040772
2 -0.026039 -0.484397 -2.230541
3 0.634871 -0.684553 1.303747
4 -1.261182 1.900001 -0.368418

I’ve included some code to drop missing values, just in case. Kmeans doesn’t run if any row has a missing value. Of course, if there were missing values, you’d want to understand why.

df_features_scaled = df_features_scaled.dropna()
df_features_scaled.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Peace       121 non-null    float64
 1   Legal       121 non-null    float64
 2   GDP Growth  121 non-null    float64
dtypes: float64(3)
memory usage: 3.0 KB

See, no missing values. Still 121 observations.

scikit-learn comes with built-in scalers. Here, I’m scaling the feature data in a similar manner. The output is an array, not a DataFrame.

df_features_scaled2 = StandardScaler().fit_transform(df_features)
df_features_scaled2[:5,:3]
array([[-0.39170288, -0.88180929,  0.12747948],
       [ 0.47431618, -0.96293527, -0.04094156],
       [-0.02614709, -0.48641154, -2.23981529],
       [ 0.63751073, -0.68739931,  1.30916848],
       [-1.26642564,  1.90790093, -0.3699501 ]])

Perform elbow method#

This is a way to visualize how good the fit is. There are decreasing returns to adding more clusters. The marginal gain of adding one cluster dropped quite a bit from k=3 to k=4. We will choose k=3 (not clear cut though).

Ref. Determining the number of clusters in a dataset.

Ks = range(1, 10)
inertia = [KMeans(n_clusters=i, n_init='auto', random_state=42).fit(df_features_scaled).inertia_ for i in Ks]

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(Ks, inertia, '-bo', markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Inertia (Within-Cluster Sum of Squares)')
ax.set_title('Elbow Method: Choosing the Number of Clusters')
ax.set_xticks(range(1, 10))

plt.tight_layout()
plt.show()
../_images/a74fd13195eabe2d59ce8068041b7d04f65e5b5634d9c56551fa99dd747c7328.png

K means with k=3#

Each country is getting labeled now. There are three categories (k=3).

k = 3
kmeans = KMeans(n_clusters=k, n_init='auto', random_state=1)
kmeans.fit(df_features_scaled)

# print inertia & cluster center
print("inertia for k=3 is", kmeans.inertia_)
print("cluster centers: ", kmeans.cluster_centers_)

# take a quick look at the result
y = kmeans.labels_
print("cluster labels: ", y)
inertia for k=3 is 161.2893804348309
cluster centers:  [[ 1.22973303 -0.68496051 -0.94125294]
 [ 0.16803028 -0.59729967  0.69048743]
 [-0.85097477  1.02149992 -0.23897931]]
cluster labels:  [1 1 0 1 2 2 1 0 1 2 1 1 1 2 0 1 0 1 2 0 2 1 0 2 1 2 2 0 2 1 0 1 1 2 1 2 2
 1 1 2 1 1 1 1 2 2 1 1 0 2 0 2 2 2 2 1 1 2 2 2 0 0 2 1 1 2 1 1 2 0 1 1 1 1
 1 2 2 0 0 2 2 0 1 0 1 1 2 2 2 2 0 1 0 1 1 1 2 2 2 0 2 1 2 2 2 1 1 1 0 1 0
 1 0 2 2 2 2 1 0 1 0]

Visualize the result (3D plot)#

Three groups of countries, clustered along three dimensions of risk.

Before you make any graph, think about what you want to show and why. Sketch it on a piece of paper. You can use Chat GPT to help you with syntax once you know what you want.

# set up the color
norm = clrs.Normalize(vmin=0., vmax=y.max() + 0.8)
cmap = cm.viridis

fig = plt.figure(figsize=(9, 7))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df_features_scaled.iloc[:, 0], df_features_scaled.iloc[:, 1], df_features_scaled.iloc[:, 2],
           c=cmap(norm(y)), marker='o', s=40)

centers = kmeans.cluster_centers_
ax.scatter(centers[:, 0], centers[:, 1], centers[:, 2], c='black', s=150, alpha=0.7, marker='X')

ax.set_xlabel('Peace')
ax.set_ylabel('Legal')
ax.set_zlabel('GDP Growth')
ax.set_title('K-Means Clusters (k=3) in 3D')

plt.tight_layout()
plt.show()
../_images/a4ea466781b45dc0aa55d8fbbb3d02a8dd8ad85237379720a6068b6a1bd82262.png
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

figs = [(0, 1), (0, 2), (1, 2)]
labels = ['Peace', 'Legal', 'GDP Growth']

for i, ax in enumerate(axes):
    ax.scatter(df_features_scaled.iloc[:, figs[i][0]], df_features_scaled.iloc[:, figs[i][1]],
               c=cmap(norm(y)), s=50, alpha=0.7)
    ax.scatter(centers[:, figs[i][0]], centers[:, figs[i][1]], c='black', s=200, alpha=0.5, marker='X')
    ax.set_xlabel(labels[figs[i][0]])
    ax.set_ylabel(labels[figs[i][1]])
    ax.set_title(f'{labels[figs[i][0]]} vs {labels[figs[i][1]]}')

plt.tight_layout()
plt.show()
../_images/6567492a13758dc48177bf5828891b76a82e71ffd2e02d15ef19e31c47592129.png

Visualize the result (3 2D plots)#

Let’s plot country abbreviations instead of dots. By being careful with our graphic design, we can show dimensionality a variety of ways.

figs = [(0, 1), (0, 2), (1, 2)]
labels = ['Peace', 'Legal', 'GDP Growth']
colors = ['blue', 'green', 'red']

for i in range(3):
    fig, ax = plt.subplots(figsize=(8, 8))
    x_1 = figs[i][0]
    x_2 = figs[i][1]
    ax.scatter(df_features_scaled.iloc[:, x_1], df_features_scaled.iloc[:, x_2], c=y, s=0, alpha=0)
    ax.scatter(centers[:, x_1], centers[:, x_2], c='black', s=200, alpha=0.5, marker='X')
    for j in range(df_features_scaled.shape[0]):
        ax.text(df_features_scaled.iloc[j, x_1], df_features_scaled.iloc[j, x_2], raw['Abbrev'].iloc[j],
                color=colors[y[j]], weight='semibold', fontsize=8,
                horizontalalignment='center', verticalalignment='center')
    ax.set_xlabel(labels[x_1])
    ax.set_ylabel(labels[x_2])
    ax.set_title(f'Country Clusters: {labels[x_1]} vs {labels[x_2]}')
    plt.tight_layout()

plt.show()
../_images/bd39c3154d8cad767933a2e4bbbfce6b3a85f76962709661f50295dee78ccef8.png ../_images/dacab3d4fa349caa3dc5cfd934071570f8ba07015b136f8c8a305207b3779be3.png ../_images/485e6e88e4c324700b4ea848514e1561011bfa6f18965d0aa86a98d3a5d4e566.png

List the result#

Let’s see how every country got classified. Who are in the same groups? Does this grouping make sense?

result = pd.DataFrame({'Country':raw['Country'], 'Abbrev':raw['Abbrev'], 'Label':y})
with pd.option_context('display.max_rows', None, 'display.max_columns', 3):
    print(result.sort_values('Label'))
                          Country Abbrev  Label
60                        Lebanon     LB      0
30                        Ecuador     EC      0
48                           Iran     IR      0
50                         Israel     IL      0
61                        Liberia     LR      0
69                         Mexico     MX      0
77                      Nicaragua     NI      0
78                        Nigeria     NG      0
27   Democratic Republic of Congo     CD      0
81                       Pakistan     PK      0
90                         Russia     RU      0
92                   Saudi Arabia     SA      0
99                   South Africa     ZA      0
108           Trinidad and Tobago     TT      0
110                        Turkey     TR      0
112                       Ukraine     UA      0
118                         Yemen     YE      0
83                       Paraguay     PY      0
22                       Colombia     CO      0
120                      Zimbabwe     ZW      0
7                         Bahrain     BH      0
2                       Argentina     AR      0
14                         Brazil     BR      0
19                           Chad     TD      0
16                        Burundi     BI      0
82                         Panama     PA      1
10                          Benin     BJ      1
11                        Bolivia     BO      1
74                          Nepal     NP      1
73                     Mozambique     MZ      1
71                     Montenegro     ME      1
8                      Bangladesh     BD      1
12         Bosnia and Herzegovina     BA      1
67                     Mauritania     MR      1
66                           Mali     ML      1
64                         Malawi     MW      1
72                        Morocco     MA      1
84                           Peru     PE      1
70                        Moldova     FM      1
63                     Madagascar     MG      1
1                         Algeria     DZ      1
117                       Vietnam     VI      1
111                        Uganda     UG      1
3                         Armenia     AM      1
109                       Tunisia     TN      1
107          The FYR of Macedonia     MK      1
85                    Philippines     PH      1
106                      Thailand     TH      1
101                     Sri Lanka     LK      1
95                   Sierra Leone     SL      1
94                         Serbia     RS      1
93                        Senegal     SN      1
6                      Azerbaijan     AZ      1
91                         Rwanda     RW      1
105                      Tanzania     TZ      1
119                        Zambia     ZM      1
0                         Albania     AL      1
29             Dominican Republic     DO      1
37                          Gabon     GA      1
56                          Kenya     KE      1
38                        Georgia     GE      1
40                          Ghana     GH      1
41                         Greece     GR      1
31                          Egypt     EG      1
42                      Guatemala     GT      1
43                       Honduras     HN      1
32                    El Salvador     SV      1
17                       Cameroon     CM      1
46                          India     IN      1
47                      Indonesia     ID      1
34                       Ethiopia     ET      1
21                          China     CN      1
24                        Croatia     HR      1
55                     Kazakhstan     KZ      1
15                       Bulgaria     BG      1
4                       Australia     AU      2
96                      Singapore     SG      2
97                       Slovakia     SK      2
18                         Canada     CA      2
98                       Slovenia     SI      2
5                         Austria     AT      2
100                         Spain     ES      2
116                       Uruguay     UY      2
115                 United States     US      2
102                        Sweden     SE      2
103                   Switzerland     CH      2
113          United Arab Emirates     AE      2
23                     Costa Rica     CR      2
104                        Taiwan     SY      2
28                        Denmark     DK      2
25                         Cyprus     CY      2
26                 Czech Republic     CZ      2
114                United Kingdom     GB      2
57                  Korea (South)     KI      2
88                          Qatar     QA      2
35                        Finland     FI      2
58                         Kuwait     KW      2
59                         Latvia     LV      2
13                       Botswana     BW      2
62                      Lithuania     LT      2
54                         Jordan     JO      2
53                          Japan     JP      2
65                       Malaysia     MY      2
52                        Jamaica     JM      2
51                          Italy     IT      2
68                      Mauritius     MU      2
49                        Ireland     IE      2
45                        Iceland     IS      2
44                        Hungary     HU      2
75                    Netherlands     NL      2
76                    New Zealand     NZ      2
79                         Norway     NO      2
80                           Oman     OM      2
9                         Belgium     BE      2
39                        Germany     DE      2
86                         Poland     PL      2
87                       Portugal     PT      2
89                        Romania     RO      2
36                         France     FR      2
33                        Estonia     EE      2
20                          Chile     CL      2

Silhouette Scores#

A larger score is better. Note how scikit-learn has these features built-in. We don’t have to do the calculation ourselves.

# Silhouette Analysis
range_n_clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10]

for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, n_init='auto', random_state=1)
    cluster_labels = clusterer.fit_predict(df_features_scaled)
    silhouette_avg = silhouette_score(df_features_scaled, cluster_labels)
    print(f"For n_clusters = {n_clusters}, the average silhouette score is: {silhouette_avg:.4f}")
For n_clusters = 2, the average silhouette score is: 0.3509
For n_clusters = 3, the average silhouette score is: 0.3529
For n_clusters = 4, the average silhouette score is: 0.3163
For n_clusters = 5, the average silhouette score is: 0.2770
For n_clusters = 6, the average silhouette score is: 0.3358
For n_clusters = 7, the average silhouette score is: 0.3110
For n_clusters = 8, the average silhouette score is: 0.3501
For n_clusters = 9, the average silhouette score is: 0.3460
For n_clusters = 10, the average silhouette score is: 0.3173

The silhouette scores are relatively flat across different values of k, hovering around 0.31-0.35. This tells us that the data doesn’t have dramatically distinct clusters — the boundaries between groups are somewhat fuzzy. That’s actually common with real-world data, especially country-level economic indicators where there’s a continuum from low-risk to high-risk.

The highest silhouette score is at k=3, which is consistent with our elbow method choice. A silhouette score ranges from -1 to 1:

  • Close to 1: The observation is well-matched to its own cluster and poorly matched to neighboring clusters

  • Close to 0: The observation is on the border between two clusters

  • Negative: The observation may have been assigned to the wrong cluster

A score around 0.35 is moderate — it suggests meaningful but not perfectly separated clusters. In finance, perfectly separated clusters are rare. Markets and economies exist on a spectrum.

Principal Component Analysis (PCA)#

Chapter 2 ends with a discussion of principal component analysis (PCA). PCA is one of the most important tools in data analysis, especially in finance.

PCA takes the data features and creates factors from the data that are uncorrelated with each other. From pg. 42:

PCA works best on normalized data. The first factor accounts for as much of the variability in the data as possible. Each succeeding factor then accounts for as much of the remaining variability in the data subject to the condition that it is uncorrelated to preceding factors. The quantity of a particular factor in a particular observation is the factor score.

What’s the intuition? Imagine you have a cloud of data points in 3D space, shaped like a football (an American football — elongated). If you wanted to summarize the shape of that cloud with just one direction, you’d pick the long axis — that’s where the data spreads out the most. That direction is your first principal component. The second component points across the width, capturing the next most variation while being perpendicular (uncorrelated) to the first. The third captures whatever’s left.

PCA does exactly this, but in any number of dimensions. It finds the directions (called principal components or eigenvectors) along which your data varies the most. The amount of variation captured by each component is the explained variance.

Why is this useful in finance? Suppose you have returns for 50 stocks. That’s 50 dimensions of data. PCA might reveal that just 3-4 principal components explain 80-90% of the variation in returns. Those components often correspond to interpretable factors — the overall market, sector rotation, value vs. growth. Instead of tracking 50 correlated stocks, you can focus on a few uncorrelated components that capture the essential behavior.

The key outputs from PCA are:

  • Components (loadings): The weights that define each principal component as a linear combination of the original features. They tell you what each component “means.”

  • Explained variance ratio: The fraction of total variance captured by each component. This tells you how important each component is.

  • Principal component scores: The transformed data — each observation’s coordinates in the new principal component space.

You can read more about how to do this with Scikit-learn PCA.

What is PCA really doing mathematically? You need some matrix algebra for that. Here’s another explainer, with the bare-minimum of math.

This video was recommended by QuantyMacro.

PCA Explained

Let’s start with some simple PCA code using our scaled features.

from sklearn.decomposition import PCA

df_features_scaled_pca = raw[['Corruption', 'Peace', 'Legal', 'GDP Growth']]

df_features_scaled_pca = (df_features_scaled_pca - df_features_scaled_pca.mean()) / df_features_scaled_pca.std()
df_features_scaled_pca.head(5)
Corruption Peace Legal GDP Growth
0 -0.633230 -0.390081 -0.878158 0.126952
1 -0.633230 0.472352 -0.958948 -0.040772
2 -0.098542 -0.026039 -0.484397 -2.230541
3 -0.258948 0.634871 -0.684553 1.303747
4 1.612460 -1.261182 1.900001 -0.368418

The code below defines an object called pca. We then apply the method .fit to this object using the full set of scaled data - I didn’t drop Corruption this time.

pca = PCA(n_components=4)
pca.fit(df_features_scaled_pca)
PCA(n_components=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("\nCumulative explained variance:", np.cumsum(pca.explained_variance_ratio_))

# Create a loadings DataFrame for easier interpretation
features = ['Corruption', 'Peace', 'Legal', 'GDP Growth']
loadings = pd.DataFrame(
    pca.components_.T,
    columns=['PC1', 'PC2', 'PC3', 'PC4'],
    index=features
)

print("\nPCA Loadings (how each feature contributes to each component):\n")
print(loadings.round(3))
Explained variance ratio: [0.64020312 0.25059185 0.09438808 0.01481695]

Cumulative explained variance: [0.64020312 0.89079497 0.98518305 1.        ]

PCA Loadings (how each feature contributes to each component):

              PC1    PC2    PC3    PC4
Corruption  0.602  0.015  0.328  0.728
Peace      -0.524 -0.201  0.825  0.065
Legal       0.594 -0.022  0.425 -0.683
GDP Growth -0.103  0.979  0.174 -0.013

Let’s interpret these results:

  • explained_variance_ratio_ tells us how much of the total variance each component captures. The first component explains about 64% of the variance, and the first two together explain about 89%. This means we could reduce our four features to just two principal components and still retain most of the information.

  • The loadings show how each original feature contributes to each principal component. PC1 loads heavily (negatively) on Corruption and Legal, and positively on Peace. Countries with high corruption tend to have weak legal systems, so PC1 is essentially a “governance quality” factor.

  • PC2 loads almost entirely on GDP Growth (loading of -0.98). This component captures economic dynamism, independent of governance.

Note how the signs can be reversed and it doesn’t matter, as Hull mentions. What matters is the relative magnitude and direction of the loadings.

# Visualize explained variance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar chart of individual explained variance
ax1.bar(range(1, 5), pca.explained_variance_ratio_, alpha=0.7, color='steelblue')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance Ratio')
ax1.set_title('Variance Explained by Each Component')
ax1.set_xticks(range(1, 5))

# Cumulative explained variance
cumulative = np.cumsum(pca.explained_variance_ratio_)
ax2.plot(range(1, 5), cumulative, '-bo')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Explained Variance')
ax2.set_title('Cumulative Variance Explained')
ax2.set_xticks(range(1, 5))
ax2.axhline(y=0.9, color='r', linestyle='--', alpha=0.5, label='90% threshold')
ax2.legend()

plt.tight_layout()
plt.show()
../_images/e7066e9bc8211f9ac5f36b58d170fdaca48448055710db1927937bac51fbaa00.png

The left chart is a scree plot — a bar chart of how much variance each principal component explains. PC1 dominates, capturing about 64% of the total variance. PC2 adds another 25%. Together, the first two components explain about 89% of the variation in our four features.

The right chart shows the cumulative explained variance. The red dashed line marks 90%. Two components get us very close to that threshold. In practice, you look at this chart and decide how many components to keep. A common rule of thumb is to keep enough components to explain 80-90% of the variance.

Let’s also visualize the countries in this reduced principal component space.

# Transform data to principal component space
pc_scores = pca.transform(df_features_scaled_pca)

# Plot first two principal components
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(pc_scores[:, 0], pc_scores[:, 1], alpha=0.5)

# Label selected countries for context
highlight = ['US', 'CN', 'RU', 'BR', 'IN', 'JP', 'DE', 'GB', 'AU', 'ZW', 'YE', 'CH', 'SG', 'CD']
for i in range(len(raw)):
    if raw['Abbrev'].iloc[i] in highlight:
        ax.annotate(raw['Abbrev'].iloc[i], (pc_scores[i, 0], pc_scores[i, 1]),
                     fontsize=9, fontweight='bold')

ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance explained)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance explained)')
ax.set_title('Countries in Principal Component Space')
ax.axhline(y=0, color='grey', linestyle='--', alpha=0.3)
ax.axvline(x=0, color='grey', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()
../_images/cf0fb39a42affd235e20c6659732a831cb71f6d36dcc1217fadf858cc47ff3d4.png

This scatter plot shows each country positioned by its first two principal component scores. PC1 (the x-axis) is mostly about governance quality — countries on the left have higher corruption and weaker legal systems, while countries on the right have better governance. PC2 (the y-axis) is mostly about GDP growth — countries lower on the plot are growing faster.

You can see natural groupings emerge. The developed nations (US, GB, JP, AU, CH, SG, DE) cluster on the right — strong governance but varying growth rates. Countries like Zimbabwe (ZW), Yemen (YE), and DR Congo (CD) are on the left — weaker governance. China (CN) and India (IN) sit in the middle on governance but tend toward higher GDP growth.

This is the power of PCA: by reducing four dimensions to two, we can visualize the essential structure of the data and see relationships that would be hard to spot in the raw numbers.

Hull notes when PCA is used. In particular, we can use it to shrink the number of features in our models. It can combine features together.

PCA is sometimes used in supervised learning as well as unsupervised learning. We can use it to replace a long list of features by a much smaller list of manufactured features derived from a PCA. The manufactured features are chosen so that they account for most of the variability in the data we are using for prediction and have the nice property that they are uncorrelated.

In our country risk example, PCA reduced four correlated features into four uncorrelated components. But the first two components capture about 89% of the total variance. This means we could represent each country with just two numbers (their PC1 and PC2 scores) instead of four, while losing very little information.

This dimensionality reduction becomes even more powerful with larger datasets. Imagine working with hundreds of financial variables — PCA can boil them down to a manageable number of uncorrelated factors that capture the essential patterns.

One common application is using PCA to shrink the number of factors in a variance-covariance matrix. This is especially relevant in portfolio risk management, where estimating a large covariance matrix directly is unreliable with limited data.