pandas

Quick Reference for Grammar of Data

select columns

colleges[['name', 'undergrads']]

Documentation on indexing

Getting a single column as a Series:

colleges['name']

Remember: column names are strings (not colleges[name])!

select rows (query)

colleges.query("state == 'AK'")

Documentation: pandas.DataFrame.query

top and bottom

colleges.head(n=2)
colleges.tail(n=3)

arrange rows (sort_values)

colleges.sort_values("undergrads").head()
colleges.sort_values("undergrads", ascending=False).head()

Documentation: pandas.DataFrame.sort_values

counting

colleges.groupby("state", as_index=False).size()

grouping / summarizing (general)

students_by_state = (
  colleges
1  .groupby("state", as_index=False)
2  .agg(total_undergrads=('undergrads', 'sum'))
3  .sort_values("total_undergrads", ascending=False)
)
students_by_state
1
Group by state
2
aggregate: compute a total_undergrads column by (a) extracting the undergrads column, and (b) summing it
3
Sort the result by total_undergrads

Documentation on grouping.

grouping by multiple columns

(
  colleges
  .groupby(["state", "online_only"], as_index=False)
  .agg(total_undergrads=('undergrads', 'sum'))
)

random sampling

Shuffle rows:

colleges.sample(frac=1.0).head()

Select random rows:

colleges.sample(n=5)

Deriving new columns

colleges['named_university'] = colleges['name'].str.contains('University')
colleges.groupby('named_university', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

Note: you can also chain these operations using assign, which returns a new DataFrame instead of modifying the original.

(
  colleges
  .assign(named_university = colleges['name'].str.contains('University'))
  .groupby('named_university', as_index=False).agg(
    mean_undergrads=('undergrads', 'mean'),
    max_undergrads=('undergrads', 'max')
  )
)

SettingWithCopyWarning

Deriving columns on a subset of rows can cause a warning. For example:

mi_colleges = colleges.query("state == 'MI'")
mi_colleges['named_college'] = mi_colleges['name'].str.contains('College')

This is pandas trying to keep you from making certain types of mistakes that can be subtle; see the documentation. To avoid it, use assign:

mi_colleges = colleges.query("state == 'MI'")
mi_colleges = mi_colleges.assign(
  named_college = mi_colleges['name'].str.contains('College'))

or .copy:

mi_colleges = colleges.query("state == 'MI'").copy()
mi_colleges['named_college'] = mi_colleges['name'].str.contains('College')

“Broadcast” operations

Apply an operation to an entire column at a time.

colleges['undergrads'] / 1000
pd.isna(colleges['undergrads'])

Pairs of columns

When broadcasting with a pair of columns, the operation is applied element-wise.

colleges['city'] + ', ' + colleges['state']
Note

For more advanced string concatenation, you may want to use str.cat: `

colleges['city'].str.cat(colleges['state'], sep=', ')

Deriving new columns, advanced

If you need an operation that isn’t built into pandas, you can apply your own function. First let’s define a function:

def classify_name(school_name):
    school_name = school_name.lower()
    if 'university' in school_name:
        return 'university'
    elif 'college' in school_name:
        return 'college'
    else:
        return 'other'

# example:
classify_name('University of California, Berkeley')
'university'

Then we apply that to each row:

colleges['name_type'] = colleges['name'].apply(classify_name)
colleges.groupby('name_type', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select).

Joining (merge)

If you think you want to:

for each row in table A:
  look up value in table B

then you want a join.

Example: looking up state names.

states = pd.read_csv("../static/data/states.csv")
states.head()
name postal_code
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
students_by_state_with_names = (
  students_by_state
  .merge(
    states,
    left_on='state', right_on='postal_code',
    how='left', indicator=True
  )
)

students_by_state_with_names.head()
state total_undergrads name postal_code _merge
0 CA 2256934.0 California CA both
1 TX 1287654.0 Texas TX both
2 NY 914704.0 New York NY both
3 FL 888334.0 Florida FL both
4 PA 535743.0 Pennsylvania PA both
students_by_state_with_names.tail()
state total_undergrads name postal_code _merge
54 VI 1744.0 U.S. Virgin Islands VI both
55 MP 1228.0 Northern Mariana Islands MP both
56 AS 992.0 American Samoa AS both
57 MH 985.0 NaN NaN left_only
58 PW 441.0 NaN NaN left_only

Join types

  • Left join: use keys from left table
  • Right join: use keys from right table
  • Outer join: use union of both sets of keys
  • Inner join: use intersection of both sets of keys

See documentation

Debugging joins

Always check your joins!

  • Check the number of rows in the result
  • Use indicator=True to see which rows came from which table
  • Use validate to check that the join is correct. Options:
    • ‘one_to_one’ or ‘1:1’: check if merge keys are unique in both left and right datasets.
    • ‘one_to_many’ or ‘1:m’: check if merge keys are unique in left dataset.
    • ‘many_to_one’ or ‘m:1’: check if merge keys are unique in right dataset.
    • ‘many_to_many’ or ‘m:m’: allowed, but does not result in checks.

Debugging tool: Do an outer join, then groupby your primary key and the _merge indicator:

students_by_state_with_names = (
  students_by_state
  .merge(
    states,
    left_on='state', right_on='postal_code',
    how='outer', indicator=True
  )
)

(
  students_by_state_with_names
  .groupby(['state', '_merge'], as_index=False)
  .size()
  .query("_merge != 'both' and size > 0")
)
state _merge size
33 FM left_only 1
75 MH left_only 1
135 PW left_only 1

Things that might be wrong:

  • You mis-specified the primary key (e.g., it actually should be two columns)
  • The keys don’t match (e.g., one has “US” and the other has “United States”)

Tools to fix key mismatches:

  • replace to fix a small number of key mismatches
  • when you have a large number of mismatches, merge with a translation table, then fillna from the translation table.
  • To make a translation table, you might use fuzzy matching, using tools like thefuzz (but be careful)

Web Scraping

You can use pd.read_html to read tables from web pages. This is actually how I made the states.csv file that we just read.

states_tables_raw = pd.read_html("https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States", match="postal abbr")
print("Found", len(states_tables_raw), "tables")
states_tables_raw[-1]

Expect the result to be messy. Cleaning it up often requires doing things that are beyond the scope of this class, but here’s one example:

def clean_states_table(df):
  return (
    df
    .iloc[:, :2] # we only care about the first two columns
    .set_axis(['name', 'postal_code'], axis=1) # assign column names
    .dropna() # drop rows with missing values
    .assign(name = lambda df: df['name'].str.replace(r'\[.*\]', ''))
  )
cleaned_territories = clean_states_table(states_tables_raw[-1])
cleaned_territories
name postal_code
0 American Samoa AS
2 Guam GU
4 Northern Mariana Islands MP
6 Puerto Rico PR
8 U.S. Virgin Islands VI

This is a good case for using concat:

state_names = pd.concat([
  clean_states_table(table) for table in states_tables_raw
], ignore_index=True)
state_names
name postal_code
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
5 Colorado CO
6 Connecticut CT
7 Delaware DE
8 Florida FL
9 Georgia GA
10 Hawaii HI
11 Idaho ID
12 Illinois IL
13 Indiana IN
14 Iowa IA
15 Kansas KS
16 Kentucky KY
17 Louisiana LA
18 Maine ME
19 Maryland MD
20 Massachusetts MA
21 Michigan MI
22 Minnesota MN
23 Mississippi MS
24 Missouri MO
25 Montana MT
26 Nebraska NE
27 Nevada NV
28 New Hampshire NH
29 New Jersey NJ
30 New Mexico NM
31 New York NY
32 North Carolina NC
33 North Dakota ND
34 Ohio OH
35 Oklahoma OK
36 Oregon OR
37 Pennsylvania PA
38 Rhode Island RI
39 South Carolina SC
40 South Dakota SD
41 Tennessee TN
42 Texas TX
43 Utah UT
44 Vermont VT
45 Virginia VA
46 Washington WA
47 West Virginia WV
48 Wisconsin WI
49 Wyoming WY
50 District of Columbia DC
51 American Samoa AS
52 Guam GU
53 Northern Mariana Islands MP
54 Puerto Rico PR
55 U.S. Virgin Islands VI

Style notes

We’ll use a consistent pandas style in this class.

  • access columns with df['col'], not df.col
  • no index
  • groupby with as_index=False
  • explicit aggregation
  • .size of groups, not value_counts

Know the rules in order to know when to break them.

No index

  • pandas has indexing, an unusual concept
    • not used in SQL, tidyverse, spreadsheets, etc.
    • even Ibis, next-gen dataframe library by Wes McKinney (pandas creator), doesn’t use it.
  • it’s sometimes useful, but often confusing, so we’ll avoid it.
  • so we’ll use as_index=False in all our groupby calls.
    • so all our aggregation results will be DataFrames, not Series.

Explicit aggregation

Lazy method:

colleges.groupby("state", as_index=False).undergrads.mean()

Explicit method:

colleges.groupby("state", as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

Size of a group

Preferred to value_counts:

  • works with multiple columns
  • returns a DataFrame, not a Series
(
  colleges
  .groupby("state", as_index=False)
  .size()
)
(
  colleges
  .groupby(["state", "online_only"], as_index=False)
  .size()
)

Compare with value_counts:

colleges["state"].value_counts()