'name', 'undergrads']] colleges[[
pandas
Quick Reference for Grammar of Data
select columns
Documentation on indexing
Getting a single column as a Series
:
'name'] colleges[
Remember: column names are strings (not colleges[name]
)!
select rows (query
)
"state == 'AK'") colleges.query(
Documentation: pandas.DataFrame.query
top and bottom
=2) colleges.head(n
=3) colleges.tail(n
arrange rows (sort_values
)
"undergrads").head() colleges.sort_values(
"undergrads", ascending=False).head() colleges.sort_values(
Documentation: pandas.DataFrame.sort_values
counting
"state", as_index=False).size() colleges.groupby(
grouping / summarizing (general)
= (
students_by_state
colleges1"state", as_index=False)
.groupby(2=('undergrads', 'sum'))
.agg(total_undergrads3"total_undergrads", ascending=False)
.sort_values(
) students_by_state
- 1
- Group by state
- 2
-
agg
regate: compute atotal_undergrads
column by (a) extracting theundergrads
column, and (b)sum
ming it - 3
-
Sort the result by
total_undergrads
Documentation on grouping.
grouping by multiple columns
(
colleges"state", "online_only"], as_index=False)
.groupby([=('undergrads', 'sum'))
.agg(total_undergrads )
random sampling
Shuffle rows:
=1.0).head() colleges.sample(frac
Select random rows:
=5) colleges.sample(n
Deriving new columns
'named_university'] = colleges['name'].str.contains('University')
colleges['named_university', as_index=False).agg(
colleges.groupby(=('undergrads', 'mean'),
mean_undergrads=('undergrads', 'max')
max_undergrads )
Note: you can also chain these operations using assign
, which returns a new DataFrame
instead of modifying the original.
(
colleges= colleges['name'].str.contains('University'))
.assign(named_university 'named_university', as_index=False).agg(
.groupby(=('undergrads', 'mean'),
mean_undergrads=('undergrads', 'max')
max_undergrads
) )
SettingWithCopyWarning
Deriving columns on a subset of rows can cause a warning. For example:
= colleges.query("state == 'MI'")
mi_colleges 'named_college'] = mi_colleges['name'].str.contains('College') mi_colleges[
This is pandas trying to keep you from making certain types of mistakes that can be subtle; see the documentation. To avoid it, use assign
:
= colleges.query("state == 'MI'")
mi_colleges = mi_colleges.assign(
mi_colleges = mi_colleges['name'].str.contains('College')) named_college
or .copy
:
= colleges.query("state == 'MI'").copy()
mi_colleges 'named_college'] = mi_colleges['name'].str.contains('College') mi_colleges[
“Broadcast” operations
Apply an operation to an entire column at a time.
'undergrads'] / 1000 colleges[
'undergrads']) pd.isna(colleges[
Pairs of columns
When broadcasting with a pair of columns, the operation is applied element-wise.
'city'] + ', ' + colleges['state'] colleges[
For more advanced string concatenation, you may want to use str.cat
: `
'city'].str.cat(colleges['state'], sep=', ') colleges[
Deriving new columns, advanced
If you need an operation that isn’t built into pandas, you can apply
your own function. First let’s define a function:
def classify_name(school_name):
= school_name.lower()
school_name if 'university' in school_name:
return 'university'
elif 'college' in school_name:
return 'college'
else:
return 'other'
# example:
'University of California, Berkeley') classify_name(
'university'
Then we apply
that to each row:
'name_type'] = colleges['name'].apply(classify_name) colleges[
'name_type', as_index=False).agg(
colleges.groupby(=('undergrads', 'mean'),
mean_undergrads=('undergrads', 'max')
max_undergrads )
Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select
).
Joining (merge
)
If you think you want to:
for each row in table A:
look up value in table B
then you want a join.
Example: looking up state names.
= pd.read_csv("../static/data/states.csv")
states states.head()
name | postal_code | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
= (
students_by_state_with_names
students_by_state
.merge(
states,='state', right_on='postal_code',
left_on='left', indicator=True
how
)
)
students_by_state_with_names.head()
state | total_undergrads | name | postal_code | _merge | |
---|---|---|---|---|---|
0 | CA | 2256934.0 | California | CA | both |
1 | TX | 1287654.0 | Texas | TX | both |
2 | NY | 914704.0 | New York | NY | both |
3 | FL | 888334.0 | Florida | FL | both |
4 | PA | 535743.0 | Pennsylvania | PA | both |
students_by_state_with_names.tail()
state | total_undergrads | name | postal_code | _merge | |
---|---|---|---|---|---|
54 | VI | 1744.0 | U.S. Virgin Islands | VI | both |
55 | MP | 1228.0 | Northern Mariana Islands | MP | both |
56 | AS | 992.0 | American Samoa | AS | both |
57 | MH | 985.0 | NaN | NaN | left_only |
58 | PW | 441.0 | NaN | NaN | left_only |
Join types
- Left join: use keys from left table
- Right join: use keys from right table
- Outer join: use union of both sets of keys
- Inner join: use intersection of both sets of keys
See documentation
Debugging joins
Always check your joins!
- Check the number of rows in the result
- Use
indicator=True
to see which rows came from which table - Use
validate
to check that the join is correct. Options:- ‘one_to_one’ or ‘1:1’: check if merge keys are unique in both left and right datasets.
- ‘one_to_many’ or ‘1:m’: check if merge keys are unique in left dataset.
- ‘many_to_one’ or ‘m:1’: check if merge keys are unique in right dataset.
- ‘many_to_many’ or ‘m:m’: allowed, but does not result in checks.
Debugging tool: Do an outer
join, then groupby
your primary key and the _merge
indicator:
= (
students_by_state_with_names
students_by_state
.merge(
states,='state', right_on='postal_code',
left_on='outer', indicator=True
how
)
)
(
students_by_state_with_names'state', '_merge'], as_index=False)
.groupby([
.size()"_merge != 'both' and size > 0")
.query( )
state | _merge | size | |
---|---|---|---|
33 | FM | left_only | 1 |
75 | MH | left_only | 1 |
135 | PW | left_only | 1 |
Things that might be wrong:
- You mis-specified the primary key (e.g., it actually should be two columns)
- The keys don’t match (e.g., one has “US” and the other has “United States”)
Tools to fix key mismatches:
replace
to fix a small number of key mismatches- when you have a large number of mismatches,
merge
with a translation table, thenfillna
from the translation table. - To make a translation table, you might use fuzzy matching, using tools like thefuzz (but be careful)
Web Scraping
You can use pd.read_html
to read tables from web pages. This is actually how I made the states.csv
file that we just read.
= pd.read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States", match="postal abbr")
states_tables_raw print("Found", len(states_tables_raw), "tables")
-1] states_tables_raw[
Expect the result to be messy. Cleaning it up often requires doing things that are beyond the scope of this class, but here’s one example:
def clean_states_table(df):
return (
df2] # we only care about the first two columns
.iloc[:, :'name', 'postal_code'], axis=1) # assign column names
.set_axis([# drop rows with missing values
.dropna() = lambda df: df['name'].str.replace(r'\[.*\]', ''))
.assign(name
)= clean_states_table(states_tables_raw[-1])
cleaned_territories cleaned_territories
name | postal_code | |
---|---|---|
0 | American Samoa | AS |
2 | Guam | GU |
4 | Northern Mariana Islands | MP |
6 | Puerto Rico | PR |
8 | U.S. Virgin Islands | VI |
This is a good case for using concat
:
= pd.concat([
state_names for table in states_tables_raw
clean_states_table(table) =True)
], ignore_index state_names
name | postal_code | |
---|---|---|
0 | Alabama | AL |
1 | Alaska | AK |
2 | Arizona | AZ |
3 | Arkansas | AR |
4 | California | CA |
5 | Colorado | CO |
6 | Connecticut | CT |
7 | Delaware | DE |
8 | Florida | FL |
9 | Georgia | GA |
10 | Hawaii | HI |
11 | Idaho | ID |
12 | Illinois | IL |
13 | Indiana | IN |
14 | Iowa | IA |
15 | Kansas | KS |
16 | Kentucky | KY |
17 | Louisiana | LA |
18 | Maine | ME |
19 | Maryland | MD |
20 | Massachusetts | MA |
21 | Michigan | MI |
22 | Minnesota | MN |
23 | Mississippi | MS |
24 | Missouri | MO |
25 | Montana | MT |
26 | Nebraska | NE |
27 | Nevada | NV |
28 | New Hampshire | NH |
29 | New Jersey | NJ |
30 | New Mexico | NM |
31 | New York | NY |
32 | North Carolina | NC |
33 | North Dakota | ND |
34 | Ohio | OH |
35 | Oklahoma | OK |
36 | Oregon | OR |
37 | Pennsylvania | PA |
38 | Rhode Island | RI |
39 | South Carolina | SC |
40 | South Dakota | SD |
41 | Tennessee | TN |
42 | Texas | TX |
43 | Utah | UT |
44 | Vermont | VT |
45 | Virginia | VA |
46 | Washington | WA |
47 | West Virginia | WV |
48 | Wisconsin | WI |
49 | Wyoming | WY |
50 | District of Columbia | DC |
51 | American Samoa | AS |
52 | Guam | GU |
53 | Northern Mariana Islands | MP |
54 | Puerto Rico | PR |
55 | U.S. Virgin Islands | VI |
Style notes
We’ll use a consistent pandas
style in this class.
- access columns with
df['col']
, notdf.col
- no index
groupby
withas_index=False
- explicit aggregation
.size
of groups, notvalue_counts
Know the rules in order to know when to break them.
No index
- pandas has
index
ing, an unusual concept- not used in SQL, tidyverse, spreadsheets, etc.
- even Ibis, next-gen dataframe library by Wes McKinney (pandas creator), doesn’t use it.
- it’s sometimes useful, but often confusing, so we’ll avoid it.
- so we’ll use
as_index=False
in all ourgroupby
calls.- so all our aggregation results will be
DataFrames
, notSeries
.
- so all our aggregation results will be
Explicit aggregation
Lazy method:
"state", as_index=False).undergrads.mean() colleges.groupby(
Explicit method:
"state", as_index=False).agg(
colleges.groupby(=('undergrads', 'mean'),
mean_undergrads=('undergrads', 'max')
max_undergrads )
Size of a group
Preferred to value_counts
:
- works with multiple columns
- returns a
DataFrame
, not aSeries
(
colleges"state", as_index=False)
.groupby(
.size() )
(
colleges"state", "online_only"], as_index=False)
.groupby([
.size() )
Compare with value_counts
:
"state"].value_counts() colleges[