{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Using SQL in Python\n", "\n", "SQL stands for [Structured Query Language](https://en.wikipedia.org/wiki/SQL) and is perhaps the number one tool that any data analyst needs to know. Well, after Excel I guess. Everyone assumes that you know Excel.\n", "\n", "You can use SQL everywhere. This is the OG database language. There are many different iterations. But, they all let you create databases and do the standard ETL (extract, transform, and load) [processes](https://blog.panoply.io/sql-and-etl-the-dynamic-data-duo). They key - there are standard SQL ways to do things. If you know the terms and ideas, you can use them across a variety of platforms and languages.\n", "\n", "SQL is used for [relational, structured databases](https://en.wikipedia.org/wiki/Relational_database), where there are keys that uniquely identify observations and where the data is the traditional column/variable format. Basically, the data that we are using.\n", "\n", "```{note}\n", "If you're doing data science, business analytics, anything that involves something bigger than a financial model in Excel, learn SQL.\n", "```\n", "\n", "However, databases have changed a lot the last 25 years or so, as companies deal with more and more data with limited structure. These tools, broadly speaking, are sometimes called [NoSQL](https://en.wikipedia.org/wiki/NoSQL). This is beyond the scope of what we're doing, but would be taught in any data engineering or databases course.\n", "\n", "If you'd like to learn more about SQL, there are also some [DataCamp](https://www.datacamp.com/learn/sql) tutorials. \n", "\n", "And here's the [cheat sheet](https://res.cloudinary.com/dyd911kmh/image/upload/v1675360372/Marketing/Blog/SQL_Basics_For_Data_Science.pdf).\n", "\n", "I found this [page helpful for some syntax](https://towardsdatascience.com/pandasql-interesting-way-to-run-sql-queries-in-python-18a4fc36406a). \n", "\n", "[Coding for Economists](https://aeturrell.github.io/coding-for-economists/data-advanced.html) discusses alternatives to `pandas`.\n", "\n", "## Pandas SQL\n", "\n", "We are going to use [pandassql](https://pypi.org/project/pandasql/) to use SQL in Python. Using this method will let use use **standard SQL queries** with our `pandas` DataFrames. If you already know SQL, you might like this better!\n", "\n", "I'll use `pip` to install `pandassql`. Don't forget to use `! pip` in Google Colab.\n", "\n", "```\n", "pip install pandasql \n", "```\n", "\n", "This is a much \"lighter weight\" way to use SQL. There are others, like [pyspark](https://spark.apache.org/docs/latest/api/python/index.html) from [Apache Spark](https://spark.apache.org) that are whole set of data engineering tools.\n", "\n", "You'll see my import statement below. \n", "\n", "I'll also bring in that NC Brewery data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Set-up\n", "\n", "from pandasql import sqldf\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "nc_df = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv')\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "From here, you can use the standard set of SQL queries. I won't go over all of them. Let me give you a few examples, though, so that you can see what things look like. \n", "\n", "Let's start by turning this DataFrame into a SQL database." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameCityTypeBeer CountEstStatusURL
0217 Brew WorksWilsonMicrobrewery102017Activehttps://www.ratebeer.com//brewers/217-brew-wor...
13rd Rock Brewing CompanyTrentonMicrobrewery122016Activehttps://www.ratebeer.com//brewers/3rd-rock-bre...
27 Clans BrewingCherokeeClient Brewer12018Activehttps://www.ratebeer.com//brewers/7-clans-brew...
3Andrews Brewing CompanyAndrewsMicrobrewery182014Activehttps://www.ratebeer.com//brewers/andrews-brew...
4Angry Troll BrewingElkinMicrobrewery82017Activehttps://www.ratebeer.com//brewers/angry-troll-...
........................
246Sweet TatersRocky MountBrewpub/Brewery92016Closedhttps://www.ratebeer.com//brewers/sweet-taters...
247Triangle Brewing CompanyDurhamMicrobrewery212007Closedhttps://www.ratebeer.com//brewers/triangle-bre...
248White Rabbit Brewing (NC)AngierMicrobrewery192013Closedhttps://www.ratebeer.com//brewers/white-rabbit...
249Williamsville Brewery (formerly Wilmington Bre...FarmvilleBrewpub52000Closedhttps://www.ratebeer.com//brewers/williamsvill...
250Wolf Beer CompanyWilmingtonClient Brewer22009Closedhttps://www.ratebeer.com//brewers/wolf-beer-co...
\n", "

251 rows × 7 columns

\n", "
" ], "text/plain": [ " Name City \\\n", "0 217 Brew Works Wilson \n", "1 3rd Rock Brewing Company Trenton \n", "2 7 Clans Brewing Cherokee \n", "3 Andrews Brewing Company Andrews \n", "4 Angry Troll Brewing Elkin \n", ".. ... ... \n", "246 Sweet Taters Rocky Mount \n", "247 Triangle Brewing Company Durham \n", "248 White Rabbit Brewing (NC) Angier \n", "249 Williamsville Brewery (formerly Wilmington Bre... Farmville \n", "250 Wolf Beer Company Wilmington \n", "\n", " Type Beer Count Est Status \\\n", "0 Microbrewery 10 2017 Active \n", "1 Microbrewery 12 2016 Active \n", "2 Client Brewer 1 2018 Active \n", "3 Microbrewery 18 2014 Active \n", "4 Microbrewery 8 2017 Active \n", ".. ... ... ... ... \n", "246 Brewpub/Brewery 9 2016 Closed \n", "247 Microbrewery 21 2007 Closed \n", "248 Microbrewery 19 2013 Closed \n", "249 Brewpub 5 2000 Closed \n", "250 Client Brewer 2 2009 Closed \n", "\n", " URL \n", "0 https://www.ratebeer.com//brewers/217-brew-wor... \n", "1 https://www.ratebeer.com//brewers/3rd-rock-bre... \n", "2 https://www.ratebeer.com//brewers/7-clans-brew... \n", "3 https://www.ratebeer.com//brewers/andrews-brew... \n", "4 https://www.ratebeer.com//brewers/angry-troll-... \n", ".. ... \n", "246 https://www.ratebeer.com//brewers/sweet-taters... \n", "247 https://www.ratebeer.com//brewers/triangle-bre... \n", "248 https://www.ratebeer.com//brewers/white-rabbit... \n", "249 https://www.ratebeer.com//brewers/williamsvill... \n", "250 https://www.ratebeer.com//brewers/wolf-beer-co... \n", "\n", "[251 rows x 7 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nc = sqldf(\"SELECT * FROM nc_df\")\n", "\n", "nc" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Can't tell a difference from the output, right? If you've used SQL before, though, you should start to feel at home. \n", "\n", "Here's a standard query. Again, I'm not going over all of SQL here, but you can probably read this and figure out what it is doing. I print the query at the end. It is traditional in SQL to put the **actions** in all caps." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SELECT \n", " Type as brewery_type, City as city, AVG(\"Beer Count\") as avg_beer_count \n", " FROM \n", " nc \n", " GROUP BY \n", " City\n", " ORDER BY\n", " City;\n" ] } ], "source": [ "q = \"\"\"SELECT \n", " Type as brewery_type, City as city, AVG(\"Beer Count\") as avg_beer_count \n", " FROM \n", " nc \n", " GROUP BY \n", " City\n", " ORDER BY\n", " City;\"\"\"\n", "\n", "print(q)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "We can then run this query and save it to a new DataFrame." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
brewery_typecityavg_beer_count
0MicrobreweryAberdeen9.000000
1MicrobreweryAndrews9.666667
2MicrobreweryAngier10.000000
3MicrobreweryApex12.333333
4BrewpubArden5.000000
............
97Brewpub/BreweryWest Jefferson15.000000
98MicrobreweryWhitsett12.000000
99Brewpub/BreweryWilmington15.357143
100MicrobreweryWilson10.000000
101MicrobreweryWinston-Salem50.400000
\n", "

102 rows × 3 columns

\n", "
" ], "text/plain": [ " brewery_type city avg_beer_count\n", "0 Microbrewery Aberdeen 9.000000\n", "1 Microbrewery Andrews 9.666667\n", "2 Microbrewery Angier 10.000000\n", "3 Microbrewery Apex 12.333333\n", "4 Brewpub Arden 5.000000\n", ".. ... ... ...\n", "97 Brewpub/Brewery West Jefferson 15.000000\n", "98 Microbrewery Whitsett 12.000000\n", "99 Brewpub/Brewery Wilmington 15.357143\n", "100 Microbrewery Wilson 10.000000\n", "101 Microbrewery Winston-Salem 50.400000\n", "\n", "[102 rows x 3 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nc2 = sqldf(q)\n", "nc2" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "pandas.core.frame.DataFrame" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(nc2)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In short, if you have SQL that you want to run, it is easy enough to copy and paste it into Python!" ] } ], "metadata": { "kernelspec": { "display_name": "base", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f" } } }, "nbformat": 4, "nbformat_minor": 2 }