Using SQL in Python
Contents
4.7. Using SQL in Python#
SQL stands for Structured Query Language and is perhaps the number one tool that any data analyst needs to know. Well, after Excel I guess. Everyone assumes that you know Excel.
You can use SQL everywhere. This is the OG database language. There are many different iterations. But, they all let you create databases and do the standard ETL (extract, transform, and load) processes. They key - there are standard SQL ways to do things. If you know the terms and ideas, you can use them across a variety of platforms and languages.
SQL is used for relational, structured databases, where there are keys that uniquely identify observations and where the data is the traditional column/variable format. Basically, the data that we are using.
Note
If you’re doing data science, business analytics, anything that involves something bigger than a financial model in Excel, learn SQL.
However, databases have changed a lot the last 25 years or so, as companies deal with more and more data with limited structure. These tools, broadly speaking, are sometimes called NoSQL. This is beyond the scope of what we’re doing, but would be taught in any data engineering or databases course.
If you’d like to learn more about SQL, there are also some DataCamp tutorials.
And here’s the cheat sheet.
I found this page helpful for some syntax.
Coding for Economists discusses alternatives to pandas
.
4.7.1. Pandas SQL#
We are going to use pandassql to use SQL in Python. Using this method will let use use standard SQL queries with our pandas
DataFrames. If you already know SQL, you might like this better!
I’ll use pip
to install pandassql
. Don’t forget to use ! pip
in Google Colab.
pip install pandasql
This is a much “lighter weight” way to use SQL. There are others, like pyspark from Apache Spark that are whole set of data engineering tools.
You’ll see my import statement below.
I’ll also bring in that NC Brewery data.
# Set-up
from pandasql import sqldf
import numpy as np
import pandas as pd
nc_df = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv')
From here, you can use the standard set of SQL queries. I won’t go over all of them. Let me give you a few examples, though, so that you can see what things look like.
Let’s start by turning this DataFrame into a SQL database.
nc = sqldf("SELECT * FROM nc_df")
nc
Name | City | Type | Beer Count | Est | Status | URL | |
---|---|---|---|---|---|---|---|
0 | 217 Brew Works | Wilson | Microbrewery | 10 | 2017 | Active | https://www.ratebeer.com//brewers/217-brew-wor... |
1 | 3rd Rock Brewing Company | Trenton | Microbrewery | 12 | 2016 | Active | https://www.ratebeer.com//brewers/3rd-rock-bre... |
2 | 7 Clans Brewing | Cherokee | Client Brewer | 1 | 2018 | Active | https://www.ratebeer.com//brewers/7-clans-brew... |
3 | Andrews Brewing Company | Andrews | Microbrewery | 18 | 2014 | Active | https://www.ratebeer.com//brewers/andrews-brew... |
4 | Angry Troll Brewing | Elkin | Microbrewery | 8 | 2017 | Active | https://www.ratebeer.com//brewers/angry-troll-... |
... | ... | ... | ... | ... | ... | ... | ... |
246 | Sweet Taters | Rocky Mount | Brewpub/Brewery | 9 | 2016 | Closed | https://www.ratebeer.com//brewers/sweet-taters... |
247 | Triangle Brewing Company | Durham | Microbrewery | 21 | 2007 | Closed | https://www.ratebeer.com//brewers/triangle-bre... |
248 | White Rabbit Brewing (NC) | Angier | Microbrewery | 19 | 2013 | Closed | https://www.ratebeer.com//brewers/white-rabbit... |
249 | Williamsville Brewery (formerly Wilmington Bre... | Farmville | Brewpub | 5 | 2000 | Closed | https://www.ratebeer.com//brewers/williamsvill... |
250 | Wolf Beer Company | Wilmington | Client Brewer | 2 | 2009 | Closed | https://www.ratebeer.com//brewers/wolf-beer-co... |
251 rows × 7 columns
Can’t tell a difference from the output, right? If you’ve used SQL before, though, you should start to feel at home.
Here’s a standard query. Again, I’m not going over all of SQL here, but you can probably read this and figure out what it is doing. I print the query at the end. It is traditional in SQL to put the actions in all caps.
q = """SELECT
Type as brewery_type, City as city, AVG("Beer Count") as avg_beer_count
FROM
nc
GROUP BY
City
ORDER BY
City;"""
print(q)
SELECT
Type as brewery_type, City as city, AVG("Beer Count") as avg_beer_count
FROM
nc
GROUP BY
City
ORDER BY
City;
We can then run this query and save it to a new DataFrame.
nc2 = sqldf(q)
nc2
brewery_type | city | avg_beer_count | |
---|---|---|---|
0 | Microbrewery | Aberdeen | 9.000000 |
1 | Microbrewery | Andrews | 9.666667 |
2 | Microbrewery | Angier | 10.000000 |
3 | Microbrewery | Apex | 12.333333 |
4 | Brewpub | Arden | 5.000000 |
... | ... | ... | ... |
97 | Brewpub/Brewery | West Jefferson | 15.000000 |
98 | Microbrewery | Whitsett | 12.000000 |
99 | Brewpub/Brewery | Wilmington | 15.357143 |
100 | Microbrewery | Wilson | 10.000000 |
101 | Microbrewery | Winston-Salem | 50.400000 |
102 rows × 3 columns
type(nc2)
pandas.core.frame.DataFrame
In short, if you have SQL that you want to run, it is easy enough to copy and paste it into Python!