4.7. Using SQL in Python#

SQL stands for Structured Query Language and is perhaps the number one tool that any data analyst needs to know. Well, after Excel I guess. Everyone assumes that you know Excel.

You can use SQL everywhere. This is the OG database language. There are many different iterations. But, they all let you create databases and do the standard ETL (extract, transform, and load) processes. They key - there are standard SQL ways to do things. If you know the terms and ideas, you can use them across a variety of platforms and languages.

SQL is used for relational, structured databases, where there are keys that uniquely identify observations and where the data is the traditional column/variable format. Basically, the data that we are using.

Note

If you’re doing data science, business analytics, anything that involves something bigger than a financial model in Excel, learn SQL.

However, databases have changed a lot the last 25 years or so, as companies deal with more and more data with limited structure. These tools, broadly speaking, are sometimes called NoSQL. This is beyond the scope of what we’re doing, but would be taught in any data engineering or databases course.

If you’d like to learn more about SQL, there are also some DataCamp tutorials.

And here’s the cheat sheet.

I found this page helpful for some syntax.

Coding for Economists discusses alternatives to pandas.

4.7.1. Pandas SQL#

We are going to use pandassql to use SQL in Python. Using this method will let use use standard SQL queries with our pandas DataFrames. If you already know SQL, you might like this better!

I’ll use pip to install pandassql. Don’t forget to use ! pip in Google Colab.

pip install pandasql 

This is a much “lighter weight” way to use SQL. There are others, like pyspark from Apache Spark that are whole set of data engineering tools.

You’ll see my import statement below.

I’ll also bring in that NC Brewery data.

# Set-up

from pandasql import sqldf

import numpy as np
import pandas as pd

nc_df = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv')

From here, you can use the standard set of SQL queries. I won’t go over all of them. Let me give you a few examples, though, so that you can see what things look like.

Let’s start by turning this DataFrame into a SQL database.

nc = sqldf("SELECT * FROM nc_df")

nc
Name City Type Beer Count Est Status URL
0 217 Brew Works Wilson Microbrewery 10 2017 Active https://www.ratebeer.com//brewers/217-brew-wor...
1 3rd Rock Brewing Company Trenton Microbrewery 12 2016 Active https://www.ratebeer.com//brewers/3rd-rock-bre...
2 7 Clans Brewing Cherokee Client Brewer 1 2018 Active https://www.ratebeer.com//brewers/7-clans-brew...
3 Andrews Brewing Company Andrews Microbrewery 18 2014 Active https://www.ratebeer.com//brewers/andrews-brew...
4 Angry Troll Brewing Elkin Microbrewery 8 2017 Active https://www.ratebeer.com//brewers/angry-troll-...
... ... ... ... ... ... ... ...
246 Sweet Taters Rocky Mount Brewpub/Brewery 9 2016 Closed https://www.ratebeer.com//brewers/sweet-taters...
247 Triangle Brewing Company Durham Microbrewery 21 2007 Closed https://www.ratebeer.com//brewers/triangle-bre...
248 White Rabbit Brewing (NC) Angier Microbrewery 19 2013 Closed https://www.ratebeer.com//brewers/white-rabbit...
249 Williamsville Brewery (formerly Wilmington Bre... Farmville Brewpub 5 2000 Closed https://www.ratebeer.com//brewers/williamsvill...
250 Wolf Beer Company Wilmington Client Brewer 2 2009 Closed https://www.ratebeer.com//brewers/wolf-beer-co...

251 rows × 7 columns

Can’t tell a difference from the output, right? If you’ve used SQL before, though, you should start to feel at home.

Here’s a standard query. Again, I’m not going over all of SQL here, but you can probably read this and figure out what it is doing. I print the query at the end. It is traditional in SQL to put the actions in all caps.

q = """SELECT 
      Type as brewery_type, City as city, AVG("Beer Count") as avg_beer_count 
      FROM 
         nc 
      GROUP BY 
         City
      ORDER BY
         City;"""

print(q)
SELECT 
      Type as brewery_type, City as city, AVG("Beer Count") as avg_beer_count 
      FROM 
         nc 
      GROUP BY 
         City
      ORDER BY
         City;

We can then run this query and save it to a new DataFrame.

nc2 = sqldf(q)
nc2
brewery_type city avg_beer_count
0 Microbrewery Aberdeen 9.000000
1 Microbrewery Andrews 9.666667
2 Microbrewery Angier 10.000000
3 Microbrewery Apex 12.333333
4 Brewpub Arden 5.000000
... ... ... ...
97 Brewpub/Brewery West Jefferson 15.000000
98 Microbrewery Whitsett 12.000000
99 Brewpub/Brewery Wilmington 15.357143
100 Microbrewery Wilson 10.000000
101 Microbrewery Winston-Salem 50.400000

102 rows × 3 columns

type(nc2)
pandas.core.frame.DataFrame

In short, if you have SQL that you want to run, it is easy enough to copy and paste it into Python!