pandas

Quick Reference for Grammar of Data

select columns

colleges[['name', 'undergrads']]

Documentation on indexing

Getting a single column as a Series:

colleges['name']

Remember: column names are strings (not colleges[name])!

select rows (`query`)

colleges.query("state == 'AK'")

Documentation: pandas.DataFrame.query

top and bottom

colleges.head(n=2)

colleges.tail(n=3)

arrange rows (`sort_values`)

colleges.sort_values("undergrads").head()

colleges.sort_values("undergrads", ascending=False).head()

Documentation: pandas.DataFrame.sort_values

counting

colleges.groupby("state", as_index=False).size()

grouping / summarizing (general)

students_by_state = (
  colleges
1  .groupby("state", as_index=False)
2  .agg(total_undergrads=('undergrads', 'sum'))
3  .sort_values("total_undergrads", ascending=False)
)
students_by_state

1: Group by state
2: aggregate: compute a total_undergrads column by (a) extracting the undergrads column, and (b) summing it
3: Sort the result by total_undergrads

Documentation on grouping.

grouping by multiple columns

(
  colleges
  .groupby(["state", "online_only"], as_index=False)
  .agg(total_undergrads=('undergrads', 'sum'))
)

random sampling

Shuffle rows:

colleges.sample(frac=1.0).head()

Select random rows:

colleges.sample(n=5)

Deriving new columns

colleges['named_university'] = colleges['name'].str.contains('University')
colleges.groupby('named_university', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

Note: you can also chain these operations using assign, which returns a new DataFrame instead of modifying the original.

(
  colleges
  .assign(named_university = colleges['name'].str.contains('University'))
  .groupby('named_university', as_index=False).agg(
    mean_undergrads=('undergrads', 'mean'),
    max_undergrads=('undergrads', 'max')
  )
)

`SettingWithCopyWarning`

Deriving columns on a subset of rows can cause a warning. For example:

mi_colleges = colleges.query("state == 'MI'")
mi_colleges['named_college'] = mi_colleges['name'].str.contains('College')

This is pandas trying to keep you from making certain types of mistakes that can be subtle; see the documentation. To avoid it, use assign:

mi_colleges = colleges.query("state == 'MI'")
mi_colleges = mi_colleges.assign(
  named_college = mi_colleges['name'].str.contains('College'))

or .copy:

mi_colleges = colleges.query("state == 'MI'").copy()
mi_colleges['named_college'] = mi_colleges['name'].str.contains('College')

“Broadcast” operations

Apply an operation to an entire column at a time.

colleges['undergrads'] / 1000

pd.isna(colleges['undergrads'])

Pairs of columns

When broadcasting with a pair of columns, the operation is applied element-wise.

colleges['city'] + ', ' + colleges['state']

Note

For more advanced string concatenation, you may want to use str.cat: `

colleges['city'].str.cat(colleges['state'], sep=', ')

Deriving new columns, advanced

If you need an operation that isn’t built into pandas, you can apply your own function. First let’s define a function:

def classify_name(school_name):
    school_name = school_name.lower()
    if 'university' in school_name:
        return 'university'
    elif 'college' in school_name:
        return 'college'
    else:
        return 'other'

# example:
classify_name('University of California, Berkeley')

'university'

Then we apply that to each row:

colleges['name_type'] = colleges['name'].apply(classify_name)

colleges.groupby('name_type', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select).

Joining (`merge`)

If you think you want to:

for each row in table A:
  look up value in table B

then you want a join.

Example: looking up state names.

states = pd.read_csv("../static/data/states.csv")
states.head()

	name	postal_code
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA

students_by_state_with_names = (
  students_by_state
  .merge(
    states,
    left_on='state', right_on='postal_code',
    how='left', indicator=True
  )
)

students_by_state_with_names.head()

	state	total_undergrads	name	postal_code	_merge
0	CA	2256934.0	California	CA	both
1	TX	1287654.0	Texas	TX	both
2	NY	914704.0	New York	NY	both
3	FL	888334.0	Florida	FL	both
4	PA	535743.0	Pennsylvania	PA	both

students_by_state_with_names.tail()

	state	total_undergrads	name	postal_code	_merge
54	VI	1744.0	U.S. Virgin Islands	VI	both
55	MP	1228.0	Northern Mariana Islands	MP	both
56	AS	992.0	American Samoa	AS	both
57	MH	985.0	NaN	NaN	left_only
58	PW	441.0	NaN	NaN	left_only

Join types

Left join: use keys from left table
Right join: use keys from right table
Outer join: use union of both sets of keys
Inner join: use intersection of both sets of keys

See documentation

Debugging joins

Always check your joins!

Check the number of rows in the result
Use indicator=True to see which rows came from which table
Use validate to check that the join is correct. Options:
- ‘one_to_one’ or ‘1:1’: check if merge keys are unique in both left and right datasets.
- ‘one_to_many’ or ‘1:m’: check if merge keys are unique in left dataset.
- ‘many_to_one’ or ‘m:1’: check if merge keys are unique in right dataset.
- ‘many_to_many’ or ‘m:m’: allowed, but does not result in checks.

Debugging tool: Do an outer join, then groupby your primary key and the _merge indicator:

students_by_state_with_names = (
  students_by_state
  .merge(
    states,
    left_on='state', right_on='postal_code',
    how='outer', indicator=True
  )
)

(
  students_by_state_with_names
  .groupby(['state', '_merge'], as_index=False)
  .size()
  .query("_merge != 'both' and size > 0")
)

	state	_merge	size
33	FM	left_only	1
75	MH	left_only	1
135	PW	left_only	1

Things that might be wrong:

You mis-specified the primary key (e.g., it actually should be two columns)
The keys don’t match (e.g., one has “US” and the other has “United States”)

Tools to fix key mismatches:

replace to fix a small number of key mismatches
when you have a large number of mismatches, merge with a translation table, then fillna from the translation table.
To make a translation table, you might use fuzzy matching, using tools like thefuzz (but be careful)

Web Scraping

You can use pd.read_html to read tables from web pages. This is actually how I made the states.csv file that we just read.

states_tables_raw = pd.read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States", match="postal abbr")
print("Found", len(states_tables_raw), "tables")
states_tables_raw[-1]

Expect the result to be messy. Cleaning it up often requires doing things that are beyond the scope of this class, but here’s one example:

def clean_states_table(df):
  return (
    df
    .iloc[:, :2] # we only care about the first two columns
    .set_axis(['name', 'postal_code'], axis=1) # assign column names
    .dropna() # drop rows with missing values
    .assign(name = lambda df: df['name'].str.replace(r'\[.*\]', ''))
  )
cleaned_territories = clean_states_table(states_tables_raw[-1])
cleaned_territories

	name	postal_code
0	American Samoa	AS
2	Guam	GU
4	Northern Mariana Islands	MP
6	Puerto Rico	PR
8	U.S. Virgin Islands	VI

This is a good case for using concat:

state_names = pd.concat([
  clean_states_table(table) for table in states_tables_raw
], ignore_index=True)
state_names

	name	postal_code
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA
5	Colorado	CO
6	Connecticut	CT
7	Delaware	DE
8	Florida	FL
9	Georgia	GA
10	Hawaii	HI
11	Idaho	ID
12	Illinois	IL
13	Indiana	IN
14	Iowa	IA
15	Kansas	KS
16	Kentucky	KY
17	Louisiana	LA
18	Maine	ME
19	Maryland	MD
20	Massachusetts	MA
21	Michigan	MI
22	Minnesota	MN
23	Mississippi	MS
24	Missouri	MO
25	Montana	MT
26	Nebraska	NE
27	Nevada	NV
28	New Hampshire	NH
29	New Jersey	NJ
30	New Mexico	NM
31	New York	NY
32	North Carolina	NC
33	North Dakota	ND
34	Ohio	OH
35	Oklahoma	OK
36	Oregon	OR
37	Pennsylvania	PA
38	Rhode Island	RI
39	South Carolina	SC
40	South Dakota	SD
41	Tennessee	TN
42	Texas	TX
43	Utah	UT
44	Vermont	VT
45	Virginia	VA
46	Washington	WA
47	West Virginia	WV
48	Wisconsin	WI
49	Wyoming	WY
50	District of Columbia	DC
51	American Samoa	AS
52	Guam	GU
53	Northern Mariana Islands	MP
54	Puerto Rico	PR
55	U.S. Virgin Islands	VI

Style notes

We’ll use a consistent pandas style in this class.

access columns with df['col'], not df.col
no index
groupby with as_index=False
explicit aggregation
.size of groups, not value_counts

Know the rules in order to know when to break them.

No index

pandas has indexing, an unusual concept
- not used in SQL, tidyverse, spreadsheets, etc.
- even Ibis, next-gen dataframe library by Wes McKinney (pandas creator), doesn’t use it.
it’s sometimes useful, but often confusing, so we’ll avoid it.
so we’ll use as_index=False in all our groupby calls.
- so all our aggregation results will be DataFrames, not Series.

Explicit aggregation

Lazy method:

colleges.groupby("state", as_index=False).undergrads.mean()

Explicit method:

colleges.groupby("state", as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

Size of a group

Preferred to value_counts:

works with multiple columns
returns a DataFrame, not a Series

(
  colleges
  .groupby("state", as_index=False)
  .size()
)

(
  colleges
  .groupby(["state", "online_only"], as_index=False)
  .size()
)

Compare with value_counts:

colleges["state"].value_counts()

Quick Reference for Grammar of Data

select columns

select rows (query)

top and bottom

arrange rows (sort_values)

counting

grouping / summarizing (general)

grouping by multiple columns

random sampling

Deriving new columns

SettingWithCopyWarning

“Broadcast” operations

Pairs of columns

Deriving new columns, advanced

Joining (merge)

Join types

Debugging joins

Web Scraping

Style notes

No index

Explicit aggregation

Size of a group

select rows (`query`)

arrange rows (`sort_values`)

`SettingWithCopyWarning`

Joining (`merge`)