colleges[['name', 'undergrads']]pandas
Quick Reference for Grammar of Data
select columns
Documentation on indexing
Getting a single column as a Series:
colleges['name']Remember: column names are strings (not colleges[name])!
select rows (query)
colleges.query("state == 'AK'")Documentation: pandas.DataFrame.query
top and bottom
colleges.head(n=2)colleges.tail(n=3)arrange rows (sort_values)
colleges.sort_values("undergrads").head()colleges.sort_values("undergrads", ascending=False).head()Documentation: pandas.DataFrame.sort_values
counting
colleges.groupby("state", as_index=False).size()grouping / summarizing (general)
students_by_state = (
colleges
1 .groupby("state", as_index=False)
2 .agg(total_undergrads=('undergrads', 'sum'))
3 .sort_values("total_undergrads", ascending=False)
)
students_by_state- 1
- Group by state
- 2
-
aggregate: compute atotal_undergradscolumn by (a) extracting theundergradscolumn, and (b)summing it - 3
-
Sort the result by
total_undergrads
Documentation on grouping.
grouping by multiple columns
(
colleges
.groupby(["state", "online_only"], as_index=False)
.agg(total_undergrads=('undergrads', 'sum'))
)random sampling
Shuffle rows:
colleges.sample(frac=1.0).head()Select random rows:
colleges.sample(n=5)Deriving new columns
colleges['named_university'] = colleges['name'].str.contains('University')
colleges.groupby('named_university', as_index=False).agg(
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)Note: you can also chain these operations using assign, which returns a new DataFrame instead of modifying the original.
(
colleges
.assign(named_university = colleges['name'].str.contains('University'))
.groupby('named_university', as_index=False).agg(
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)
)SettingWithCopyWarning
Deriving columns on a subset of rows can cause a warning. For example:
mi_colleges = colleges.query("state == 'MI'")
mi_colleges['named_college'] = mi_colleges['name'].str.contains('College')This is pandas trying to keep you from making certain types of mistakes that can be subtle; see the documentation. To avoid it, use assign:
mi_colleges = colleges.query("state == 'MI'")
mi_colleges = mi_colleges.assign(
named_college = mi_colleges['name'].str.contains('College'))or .copy:
mi_colleges = colleges.query("state == 'MI'").copy()
mi_colleges['named_college'] = mi_colleges['name'].str.contains('College')“Broadcast” operations
Apply an operation to an entire column at a time.
colleges['undergrads'] / 1000pd.isna(colleges['undergrads'])Pairs of columns
When broadcasting with a pair of columns, the operation is applied element-wise.
colleges['city'] + ', ' + colleges['state']For more advanced string concatenation, you may want to use str.cat: `
colleges['city'].str.cat(colleges['state'], sep=', ')Deriving new columns, advanced
If you need an operation that isn’t built into pandas, you can apply your own function. First let’s define a function:
def classify_name(school_name):
school_name = school_name.lower()
if 'university' in school_name:
return 'university'
elif 'college' in school_name:
return 'college'
else:
return 'other'
# example:
classify_name('University of California, Berkeley')'university'
Then we apply that to each row:
colleges['name_type'] = colleges['name'].apply(classify_name)colleges.groupby('name_type', as_index=False).agg(
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select).
Joining (merge)
If you think you want to:
for each row in table A:
look up value in table B
then you want a join.
Example: looking up state names.
states = pd.read_csv("../static/data/states.csv")
states.head()| name | postal_code | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | AK |
| 2 | Arizona | AZ |
| 3 | Arkansas | AR |
| 4 | California | CA |
students_by_state_with_names = (
students_by_state
.merge(
states,
left_on='state', right_on='postal_code',
how='left', indicator=True
)
)
students_by_state_with_names.head()| state | total_undergrads | name | postal_code | _merge | |
|---|---|---|---|---|---|
| 0 | CA | 2256934.0 | California | CA | both |
| 1 | TX | 1287654.0 | Texas | TX | both |
| 2 | NY | 914704.0 | New York | NY | both |
| 3 | FL | 888334.0 | Florida | FL | both |
| 4 | PA | 535743.0 | Pennsylvania | PA | both |
students_by_state_with_names.tail()| state | total_undergrads | name | postal_code | _merge | |
|---|---|---|---|---|---|
| 54 | VI | 1744.0 | U.S. Virgin Islands | VI | both |
| 55 | MP | 1228.0 | Northern Mariana Islands | MP | both |
| 56 | AS | 992.0 | American Samoa | AS | both |
| 57 | MH | 985.0 | NaN | NaN | left_only |
| 58 | PW | 441.0 | NaN | NaN | left_only |
Join types
- Left join: use keys from left table
- Right join: use keys from right table
- Outer join: use union of both sets of keys
- Inner join: use intersection of both sets of keys
See documentation
Debugging joins
Always check your joins!
- Check the number of rows in the result
- Use
indicator=Trueto see which rows came from which table - Use
validateto check that the join is correct. Options:- ‘one_to_one’ or ‘1:1’: check if merge keys are unique in both left and right datasets.
- ‘one_to_many’ or ‘1:m’: check if merge keys are unique in left dataset.
- ‘many_to_one’ or ‘m:1’: check if merge keys are unique in right dataset.
- ‘many_to_many’ or ‘m:m’: allowed, but does not result in checks.
Debugging tool: Do an outer join, then groupby your primary key and the _merge indicator:
students_by_state_with_names = (
students_by_state
.merge(
states,
left_on='state', right_on='postal_code',
how='outer', indicator=True
)
)
(
students_by_state_with_names
.groupby(['state', '_merge'], as_index=False)
.size()
.query("_merge != 'both' and size > 0")
)| state | _merge | size | |
|---|---|---|---|
| 33 | FM | left_only | 1 |
| 75 | MH | left_only | 1 |
| 135 | PW | left_only | 1 |
Things that might be wrong:
- You mis-specified the primary key (e.g., it actually should be two columns)
- The keys don’t match (e.g., one has “US” and the other has “United States”)
Tools to fix key mismatches:
replaceto fix a small number of key mismatches- when you have a large number of mismatches,
mergewith a translation table, thenfillnafrom the translation table. - To make a translation table, you might use fuzzy matching, using tools like thefuzz (but be careful)
Web Scraping
You can use pd.read_html to read tables from web pages. This is actually how I made the states.csv file that we just read.
states_tables_raw = pd.read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States", match="postal abbr")
print("Found", len(states_tables_raw), "tables")
states_tables_raw[-1]Expect the result to be messy. Cleaning it up often requires doing things that are beyond the scope of this class, but here’s one example:
def clean_states_table(df):
return (
df
.iloc[:, :2] # we only care about the first two columns
.set_axis(['name', 'postal_code'], axis=1) # assign column names
.dropna() # drop rows with missing values
.assign(name = lambda df: df['name'].str.replace(r'\[.*\]', ''))
)
cleaned_territories = clean_states_table(states_tables_raw[-1])
cleaned_territories| name | postal_code | |
|---|---|---|
| 0 | American Samoa | AS |
| 2 | Guam | GU |
| 4 | Northern Mariana Islands | MP |
| 6 | Puerto Rico | PR |
| 8 | U.S. Virgin Islands | VI |
This is a good case for using concat:
state_names = pd.concat([
clean_states_table(table) for table in states_tables_raw
], ignore_index=True)
state_names| name | postal_code | |
|---|---|---|
| 0 | Alabama | AL |
| 1 | Alaska | AK |
| 2 | Arizona | AZ |
| 3 | Arkansas | AR |
| 4 | California | CA |
| 5 | Colorado | CO |
| 6 | Connecticut | CT |
| 7 | Delaware | DE |
| 8 | Florida | FL |
| 9 | Georgia | GA |
| 10 | Hawaii | HI |
| 11 | Idaho | ID |
| 12 | Illinois | IL |
| 13 | Indiana | IN |
| 14 | Iowa | IA |
| 15 | Kansas | KS |
| 16 | Kentucky | KY |
| 17 | Louisiana | LA |
| 18 | Maine | ME |
| 19 | Maryland | MD |
| 20 | Massachusetts | MA |
| 21 | Michigan | MI |
| 22 | Minnesota | MN |
| 23 | Mississippi | MS |
| 24 | Missouri | MO |
| 25 | Montana | MT |
| 26 | Nebraska | NE |
| 27 | Nevada | NV |
| 28 | New Hampshire | NH |
| 29 | New Jersey | NJ |
| 30 | New Mexico | NM |
| 31 | New York | NY |
| 32 | North Carolina | NC |
| 33 | North Dakota | ND |
| 34 | Ohio | OH |
| 35 | Oklahoma | OK |
| 36 | Oregon | OR |
| 37 | Pennsylvania | PA |
| 38 | Rhode Island | RI |
| 39 | South Carolina | SC |
| 40 | South Dakota | SD |
| 41 | Tennessee | TN |
| 42 | Texas | TX |
| 43 | Utah | UT |
| 44 | Vermont | VT |
| 45 | Virginia | VA |
| 46 | Washington | WA |
| 47 | West Virginia | WV |
| 48 | Wisconsin | WI |
| 49 | Wyoming | WY |
| 50 | District of Columbia | DC |
| 51 | American Samoa | AS |
| 52 | Guam | GU |
| 53 | Northern Mariana Islands | MP |
| 54 | Puerto Rico | PR |
| 55 | U.S. Virgin Islands | VI |
Style notes
We’ll use a consistent pandas style in this class.
- access columns with
df['col'], notdf.col - no index
groupbywithas_index=False- explicit aggregation
.sizeof groups, notvalue_counts
Know the rules in order to know when to break them.
No index
- pandas has
indexing, an unusual concept- not used in SQL, tidyverse, spreadsheets, etc.
- even Ibis, next-gen dataframe library by Wes McKinney (pandas creator), doesn’t use it.
- it’s sometimes useful, but often confusing, so we’ll avoid it.
- so we’ll use
as_index=Falsein all ourgroupbycalls.- so all our aggregation results will be
DataFrames, notSeries.
- so all our aggregation results will be
Explicit aggregation
Lazy method:
colleges.groupby("state", as_index=False).undergrads.mean()Explicit method:
colleges.groupby("state", as_index=False).agg(
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)Size of a group
Preferred to value_counts:
- works with multiple columns
- returns a
DataFrame, not aSeries
(
colleges
.groupby("state", as_index=False)
.size()
)(
colleges
.groupby(["state", "online_only"], as_index=False)
.size()
)Compare with value_counts:
colleges["state"].value_counts()