Working with data#

Chapter 4 of our course notes discusses how to store our data, indexing, column names, converting numpy arrays to pandas DataFrames, dates, and summarizing our data, especially by group.

We’ll see how to merge and reshape our data. I’ll also discuss SQL and the polars library, two alternatives to pandas.

DataFrames are the main way that we are going to organize our data. They come from the pandas package, which along with numpy, matplotlib, and a few others, form the core of our Python and finance toolbox.

DataFrames are a class. A class is like a blueprint. We create a DataFrame object from this class. The object then comes with certain characteristics and things that we can do to it. We call operations on a object methods.

Note

This section of the notes contains perhaps the most important topics that we’ll cover.

This is also where we can start looking at our Hull textbook. Chapter 1 introduces machine learning, very generally. What types of questions can we ask? And, most importantly for this section, how should we think about the data that we need to answer these questions?

We are going to be using time series, like stock prices and returns. We’ll be using cross-sectional data, like firm or house characteristics at a single point in time. And combinations of the two, sometimes call panel data.

Hull starts with discussing the differences between machine learning and statistics. I view this course as a blend - we’re covering both. When we have theory guiding us, we’re more in the traditional economics and statistics camps. When we let the models run the show, we’re more in the world of machine learning. This is a simplification, but has some truth to it.

But, no matter what, we need to understand how to collect and organize our data.

For now, focus on Section 1.5 of Hull on data cleaning. Data cleaning will lead to a discussion of feature engineering, which is about constructing the variables that will go into our models.

Note

Sketch your data! What do you have? What do you want it to look like? Then, go look for the syntax you need to get that done.

Using AI for pandas#

This is the chapter where AI assistance becomes incredibly valuable. The pandas library has hundreds of methods and many ways to accomplish the same task. You don’t need to memorize them all - you need to know what’s possible and be able to read the code that AI generates.

Why AI excels at pandas#

Task

Why AI Helps

Syntax lookup

“How do I filter rows where column X > 5?”

Method selection

“Should I use merge, join, or concat?”

Reshaping data

“Convert this wide data to long format”

Cleaning pipelines

“Remove duplicates and fill missing values”

Complex aggregations

“Group by date and calculate multiple statistics”

What you’ll see in AI-generated code#

AI-generated pandas code often uses method chaining - stringing multiple operations together with dots:

# AI often writes code like this
result = (df
    .query('year >= 2020')
    .groupby('sector')
    .agg({'returns': ['mean', 'std']})
    .reset_index()
)

This looks different from step-by-step code, but does the same thing. We’ll cover how to read these patterns in this chapter.

Tip

When asking AI for pandas help, be specific about your data structure. Say things like “I have a DataFrame with columns date, ticker, and price” rather than just “I have stock data.”

Other important sources#

As always, my notes are not comprehensive - that’s an impossible task.

From Coding for Economists

An excellent and concise guide for data manipulation in pandas.

The official pandas guide is also very helpful.

From Python Programming for Data Science:

  • Chapter 7 introduces pandas.

  • Chapter 8 of Python Programming for Data Science is all about pandas.

From Python for Data Analysis:

As you can see, this is a lot of material! That’s because using pandas to import, organize, clean, and summarize our data is about 90% of all analysis work. We could spend the entire semester just working through these chapters. In fact, we have an entire Data Wrangling class that essentially does that!