{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Using SQL in Python\n", "\n", "SQL stands for [Structured Query Language](https://en.wikipedia.org/wiki/SQL) and is perhaps the number one tool that any data analyst needs to know. Well, after Excel I guess. Everyone assumes that you know Excel.\n", "\n", "You can use SQL everywhere. This is the OG database language. There are many different iterations. But, they all let you create databases and do the standard ETL (extract, transform, and load) [processes](https://blog.panoply.io/sql-and-etl-the-dynamic-data-duo). They key - there are standard SQL ways to do things. If you know the terms and ideas, you can use them across a variety of platforms and languages.\n", "\n", "SQL is used for [relational, structured databases](https://en.wikipedia.org/wiki/Relational_database), where there are keys that uniquely identify observations and where the data is the traditional column/variable format. Basically, the data that we are using.\n", "\n", "```{note}\n", "If you're doing data science, business analytics, anything that involves something bigger than a financial model in Excel, learn SQL.\n", "```\n", "\n", "However, databases have changed a lot the last 25 years or so, as companies deal with more and more data with limited structure. These tools, broadly speaking, are sometimes called [NoSQL](https://en.wikipedia.org/wiki/NoSQL). This is beyond the scope of what we're doing, but would be taught in any data engineering or databases course.\n", "\n", "If you'd like to learn more about SQL, there are also some [DataCamp](https://www.datacamp.com/learn/sql) tutorials. \n", "\n", "And here's the [cheat sheet](https://res.cloudinary.com/dyd911kmh/image/upload/v1675360372/Marketing/Blog/SQL_Basics_For_Data_Science.pdf).\n", "\n", "I found this [page helpful for some syntax](https://towardsdatascience.com/pandasql-interesting-way-to-run-sql-queries-in-python-18a4fc36406a). \n", "\n", "[Coding for Economists](https://aeturrell.github.io/coding-for-economists/data-advanced.html) discusses alternatives to `pandas`.\n", "\n", "## Pandas SQL\n", "\n", "We are going to use [pandassql](https://pypi.org/project/pandasql/) to use SQL in Python. Using this method will let use use **standard SQL queries** with our `pandas` DataFrames. If you already know SQL, you might like this better!\n", "\n", "I'll use `pip` to install `pandassql`. Don't forget to use `! pip` in Google Colab.\n", "\n", "```\n", "pip install pandasql \n", "```\n", "\n", "This is a much \"lighter weight\" way to use SQL. There are others, like [pyspark](https://spark.apache.org/docs/latest/api/python/index.html) from [Apache Spark](https://spark.apache.org) that are whole set of data engineering tools.\n", "\n", "You'll see my import statement below. \n", "\n", "I'll also bring in that NC Brewery data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Set-up\n", "\n", "from pandasql import sqldf\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "nc_df = pd.read_csv('https://raw.githubusercontent.com/aaiken1/fin-data-analysis-python/main/data/ncbreweries.csv')\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "From here, you can use the standard set of SQL queries. I won't go over all of them. Let me give you a few examples, though, so that you can see what things look like. \n", "\n", "Let's start by turning this DataFrame into a SQL database." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Name | \n", "City | \n", "Type | \n", "Beer Count | \n", "Est | \n", "Status | \n", "URL | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "217 Brew Works | \n", "Wilson | \n", "Microbrewery | \n", "10 | \n", "2017 | \n", "Active | \n", "https://www.ratebeer.com//brewers/217-brew-wor... | \n", "
1 | \n", "3rd Rock Brewing Company | \n", "Trenton | \n", "Microbrewery | \n", "12 | \n", "2016 | \n", "Active | \n", "https://www.ratebeer.com//brewers/3rd-rock-bre... | \n", "
2 | \n", "7 Clans Brewing | \n", "Cherokee | \n", "Client Brewer | \n", "1 | \n", "2018 | \n", "Active | \n", "https://www.ratebeer.com//brewers/7-clans-brew... | \n", "
3 | \n", "Andrews Brewing Company | \n", "Andrews | \n", "Microbrewery | \n", "18 | \n", "2014 | \n", "Active | \n", "https://www.ratebeer.com//brewers/andrews-brew... | \n", "
4 | \n", "Angry Troll Brewing | \n", "Elkin | \n", "Microbrewery | \n", "8 | \n", "2017 | \n", "Active | \n", "https://www.ratebeer.com//brewers/angry-troll-... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
246 | \n", "Sweet Taters | \n", "Rocky Mount | \n", "Brewpub/Brewery | \n", "9 | \n", "2016 | \n", "Closed | \n", "https://www.ratebeer.com//brewers/sweet-taters... | \n", "
247 | \n", "Triangle Brewing Company | \n", "Durham | \n", "Microbrewery | \n", "21 | \n", "2007 | \n", "Closed | \n", "https://www.ratebeer.com//brewers/triangle-bre... | \n", "
248 | \n", "White Rabbit Brewing (NC) | \n", "Angier | \n", "Microbrewery | \n", "19 | \n", "2013 | \n", "Closed | \n", "https://www.ratebeer.com//brewers/white-rabbit... | \n", "
249 | \n", "Williamsville Brewery (formerly Wilmington Bre... | \n", "Farmville | \n", "Brewpub | \n", "5 | \n", "2000 | \n", "Closed | \n", "https://www.ratebeer.com//brewers/williamsvill... | \n", "
250 | \n", "Wolf Beer Company | \n", "Wilmington | \n", "Client Brewer | \n", "2 | \n", "2009 | \n", "Closed | \n", "https://www.ratebeer.com//brewers/wolf-beer-co... | \n", "
251 rows × 7 columns
\n", "\n", " | brewery_type | \n", "city | \n", "avg_beer_count | \n", "
---|---|---|---|
0 | \n", "Microbrewery | \n", "Aberdeen | \n", "9.000000 | \n", "
1 | \n", "Microbrewery | \n", "Andrews | \n", "9.666667 | \n", "
2 | \n", "Microbrewery | \n", "Angier | \n", "10.000000 | \n", "
3 | \n", "Microbrewery | \n", "Apex | \n", "12.333333 | \n", "
4 | \n", "Brewpub | \n", "Arden | \n", "5.000000 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
97 | \n", "Brewpub/Brewery | \n", "West Jefferson | \n", "15.000000 | \n", "
98 | \n", "Microbrewery | \n", "Whitsett | \n", "12.000000 | \n", "
99 | \n", "Brewpub/Brewery | \n", "Wilmington | \n", "15.357143 | \n", "
100 | \n", "Microbrewery | \n", "Wilson | \n", "10.000000 | \n", "
101 | \n", "Microbrewery | \n", "Winston-Salem | \n", "50.400000 | \n", "
102 rows × 3 columns
\n", "