Using SQL in Python#

SQL stands for Structured Query Language and is perhaps the number one tool that any data analyst needs to know. Well, after Excel I guess. Everyone assumes that you know Excel.

You can use SQL everywhere. This is the OG database language. There are many different iterations. But, they all let you create databases and do the standard ETL (extract, transform, and load) processes. They key - there are standard SQL ways to do things. If you know the terms and ideas, you can use them across a variety of platforms and languages.

SQL is used for relational, structured databases, where there are keys that uniquely identify observations and where the data is the traditional column/variable format. Basically, the data that we are using.

Note

If you’re doing data science, business analytics, anything that involves something bigger than a financial model in Excel, learn SQL.

However, databases have changed a lot the last 25 years or so, as companies deal with more and more data with limited structure. These tools, broadly speaking, are sometimes called NoSQL. This is beyond the scope of what we’re doing, but would be taught in any data engineering or databases course.

If you’d like to learn more about SQL, there are also some DataCamp tutorials.

And here’s the cheat sheet.

I found this page helpful for some syntax.

Coding for Economists discusses alternatives to pandas.

Pandas SQL#

We are going to use pandassql to use SQL in Python. Using this method will let use use standard SQL queries with our pandas DataFrames. If you already know SQL, you might like this better!

I’ll use pip to install pandassql. Don’t forget to use ! pip in Google Colab.

pip install pandasql 

This is a much “lighter weight” way to use SQL. There are others, like pyspark from Apache Spark that are whole set of data engineering tools.

You’ll see my import statement below.

I’ll also bring in that NC Brewery data.

# Set-up

from pandasql import sqldf

import numpy as np
import pandas as pd

nc_df = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv')

From here, you can use the standard set of SQL queries. I won’t go over all of them. Let me give you a few examples, though, so that you can see what things look like.

Let’s start by turning this DataFrame into a SQL database.

nc = sqldf("SELECT * FROM nc_df")

nc

	Name	City	Type	Beer Count	Est	Status	URL
0	217 Brew Works	Wilson	Microbrewery	10	2017	Active	https://www.ratebeer.com//brewers/217-brew-wor...
1	3rd Rock Brewing Company	Trenton	Microbrewery	12	2016	Active	https://www.ratebeer.com//brewers/3rd-rock-bre...
2	7 Clans Brewing	Cherokee	Client Brewer	1	2018	Active	https://www.ratebeer.com//brewers/7-clans-brew...
3	Andrews Brewing Company	Andrews	Microbrewery	18	2014	Active	https://www.ratebeer.com//brewers/andrews-brew...
4	Angry Troll Brewing	Elkin	Microbrewery	8	2017	Active	https://www.ratebeer.com//brewers/angry-troll-...
...	...	...	...	...	...	...	...
246	Sweet Taters	Rocky Mount	Brewpub/Brewery	9	2016	Closed	https://www.ratebeer.com//brewers/sweet-taters...
247	Triangle Brewing Company	Durham	Microbrewery	21	2007	Closed	https://www.ratebeer.com//brewers/triangle-bre...
248	White Rabbit Brewing (NC)	Angier	Microbrewery	19	2013	Closed	https://www.ratebeer.com//brewers/white-rabbit...
249	Williamsville Brewery (formerly Wilmington Bre...	Farmville	Brewpub	5	2000	Closed	https://www.ratebeer.com//brewers/williamsvill...
250	Wolf Beer Company	Wilmington	Client Brewer	2	2009	Closed	https://www.ratebeer.com//brewers/wolf-beer-co...

251 rows × 7 columns

Can’t tell a difference from the output, right? If you’ve used SQL before, though, you should start to feel at home.

Here’s a standard query. Again, I’m not going over all of SQL here, but you can probably read this and figure out what it is doing. I print the query at the end. It is traditional in SQL to put the actions in all caps.

q = """SELECT 
      Type as brewery_type, City as city, AVG("Beer Count") as avg_beer_count 
      FROM 
         nc 
      GROUP BY 
         City
      ORDER BY
         City;"""

print(q)

SELECT 
      Type as brewery_type, City as city, AVG("Beer Count") as avg_beer_count 
      FROM 
         nc 
      GROUP BY 
         City
      ORDER BY
         City;

We can then run this query and save it to a new DataFrame.

nc2 = sqldf(q)
nc2

	brewery_type	city	avg_beer_count
0	Microbrewery	Aberdeen	9.000000
1	Microbrewery	Andrews	9.666667
2	Microbrewery	Angier	10.000000
3	Microbrewery	Apex	12.333333
4	Brewpub	Arden	5.000000
...	...	...	...
97	Brewpub/Brewery	West Jefferson	15.000000
98	Microbrewery	Whitsett	12.000000
99	Brewpub/Brewery	Wilmington	15.357143
100	Microbrewery	Wilson	10.000000
101	Microbrewery	Winston-Salem	50.400000

102 rows × 3 columns

type(nc2)

pandas.core.frame.DataFrame

In short, if you have SQL that you want to run, it is easy enough to copy and paste it into Python!

Data Analysis in Finance

Using SQL in Python

Contents

Using SQL in Python#

Pandas SQL#