Slides 5.1: Tabular Data with pandas

Introduction

Objectives

  • Describe the organization of tabular data into rows and columns
  • Subset rows of data using boolean expressions
  • Use the split-apply-combine pattern to compute summary statistics

Why?

  • Data doesn’t come neatly wrapped.
  • Wrangling decisions have big impacts on analysis and conclusions.

pandas

pandas logo

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.” https://pandas.pydata.org/

import pandas as pd
pd.options.plotting.backend = "plotly"

but not just pandas

  • Like visualization, there’s a “grammar” of data wrangling.
  • Operations:
    • filter or query: subset of rows
    • select: subset of columns
    • derive new columns
    • aggregate: summarize groups of rows
    • pivot: reshape data
  • We’ll use pandas, but the ideas apply to other tools.
    • R: tidyverse
    • SQL: SELECT, WHERE, GROUP BY
    • even Excel / Google Sheets

Tabular data Example

Source: Department of Education College Scorecard

Simplified dataset for this exercise: https://calvin-data-science.github.io/data202/data/college_scorecard.csv

colleges = pd.read_csv("https://calvin-data-science.github.io/data202/data/college_scorecard.csv")
colleges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6694 entries, 0 to 6693
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         6694 non-null   object 
 1   city         6694 non-null   object 
 2   state        6694 non-null   object 
 3   online_only  6189 non-null   float64
 4   undergrads   5902 non-null   float64
dtypes: float64(2), object(3)
memory usage: 261.6+ KB

colleges
name city state online_only undergrads
0 Alabama A & M University Normal AL 0.0 5271.0
1 University of Alabama... Birmingham AL 0.0 13328.0
2 Amridge University Montgomery AL 1.0 365.0
3 University of Alabama... Huntsville AL 0.0 7785.0
4 Alabama State University Montgomery AL 0.0 3750.0
... ... ... ... ... ...
6689 Georgia Military Coll... Eastman GA NaN NaN
6690 American College of B... Florence KY NaN NaN
6691 HCI College - Fort La... Fort Lauderdale FL NaN NaN
6692 ABC Adult School - Ca... Cerritos CA NaN NaN
6693 Urban Barber College ... San Jose CA NaN NaN

6694 rows × 5 columns

Selecting a row

iloc = locate by integer index

colleges.iloc[0]
name           Alabama A & M University
city                             Normal
state                                AL
online_only                         0.0
undergrads                       5271.0
Name: 0, dtype: object

Selecting a column

colleges["undergrads"]
0        5271.0
1       13328.0
2         365.0
3        7785.0
4        3750.0
         ...   
6689        NaN
6690        NaN
6691        NaN
6692        NaN
6693        NaN
Name: undergrads, Length: 6694, dtype: float64
type(colleges["undergrads"])
pandas.core.series.Series

Calculating on columns

colleges["undergrads"].mean()
2559.6435106743475
colleges["undergrads"].max()
98630.0
colleges["undergrads"].describe()
count     5902.000000
mean      2559.643511
std       5801.780841
min          0.000000
25%        108.000000
50%        475.000000
75%       2151.500000
max      98630.000000
Name: undergrads, dtype: float64

Querying

is_small = colleges["undergrads"] < 10000
is_small
0        True
1       False
2        True
3        True
4        True
        ...  
6689    False
6690    False
6691    False
6692    False
6693    False
Name: undergrads, Length: 6694, dtype: bool

Making new columns

colleges["online_only"] = colleges["online_only"] == 1.0
colleges["is_small"] = colleges["undergrads"] < 10000
colleges.head()
name city state online_only undergrads is_small
0 Alabama A & M University Normal AL False 5271.0 True
1 University of Alabama... Birmingham AL False 13328.0 False
2 Amridge University Montgomery AL True 365.0 True
3 University of Alabama... Huntsville AL False 7785.0 True
4 Alabama State University Montgomery AL False 3750.0 True

Exercise: data type

What data type is is_small?

is_small = colleges["undergrads"] < 10000
type(is_small)
  • bool
  • int
  • DataFrame
  • Series

Filtering

colleges[
  colleges["undergrads"] < 10000
]
name city state online_only undergrads is_small
0 Alabama A & M University Normal AL False 5271.0 True
2 Amridge University Montgomery AL True 365.0 True
3 University of Alabama... Huntsville AL False 7785.0 True
4 Alabama State University Montgomery AL False 3750.0 True
6 Central Alabama Commu... Alexander City AL False 1201.0 True
... ... ... ... ... ... ...
6184 Educational Technical... Lares PR False 81.0 True
6185 Fortis College-Landover Landover MD False 625.0 True
6186 Stautzenberger Colleg... Rockford IL False 472.0 True
6187 WellSpring School of ... Wichita KS False 38.0 True
6188 Access Careers-Islandia Islandia NY False 63.0 True

5520 rows × 6 columns

Filtering, using a column instead

colleges[
  colleges['is_small']
]
name city state online_only undergrads is_small
0 Alabama A & M University Normal AL False 5271.0 True
2 Amridge University Montgomery AL True 365.0 True
3 University of Alabama... Huntsville AL False 7785.0 True
4 Alabama State University Montgomery AL False 3750.0 True
6 Central Alabama Commu... Alexander City AL False 1201.0 True
... ... ... ... ... ... ...
6184 Educational Technical... Lares PR False 81.0 True
6185 Fortis College-Landover Landover MD False 625.0 True
6186 Stautzenberger Colleg... Rockford IL False 472.0 True
6187 WellSpring School of ... Wichita KS False 38.0 True
6188 Access Careers-Islandia Islandia NY False 63.0 True

5520 rows × 6 columns

query method

colleges.query("undergrads < 10000")
name city state online_only undergrads is_small
0 Alabama A & M University Normal AL False 5271.0 True
2 Amridge University Montgomery AL True 365.0 True
3 University of Alabama... Huntsville AL False 7785.0 True
4 Alabama State University Montgomery AL False 3750.0 True
6 Central Alabama Commu... Alexander City AL False 1201.0 True
... ... ... ... ... ... ...
6184 Educational Technical... Lares PR False 81.0 True
6185 Fortis College-Landover Landover MD False 625.0 True
6186 Stautzenberger Colleg... Rockford IL False 472.0 True
6187 WellSpring School of ... Wichita KS False 38.0 True
6188 Access Careers-Islandia Islandia NY False 63.0 True

5520 rows × 6 columns

or

colleges.query("is_small")
name city state online_only undergrads is_small
0 Alabama A & M University Normal AL False 5271.0 True
2 Amridge University Montgomery AL True 365.0 True
3 University of Alabama... Huntsville AL False 7785.0 True
4 Alabama State University Montgomery AL False 3750.0 True
6 Central Alabama Commu... Alexander City AL False 1201.0 True
... ... ... ... ... ... ...
6184 Educational Technical... Lares PR False 81.0 True
6185 Fortis College-Landover Landover MD False 625.0 True
6186 Stautzenberger Colleg... Rockford IL False 472.0 True
6187 WellSpring School of ... Wichita KS False 38.0 True
6188 Access Careers-Islandia Islandia NY False 63.0 True

5520 rows × 6 columns

Exercise: How many colleges are there in Michigan?

colleges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6694 entries, 0 to 6693
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         6694 non-null   object 
 1   city         6694 non-null   object 
 2   state        6694 non-null   object 
 3   online_only  6694 non-null   bool   
 4   undergrads   5902 non-null   float64
 5   is_small     6694 non-null   bool   
dtypes: bool(2), float64(1), object(3)
memory usage: 222.4+ KB

colleges.query("state == 'MI'")
name city state online_only undergrads is_small
1065 West Michigan College... Kalamazoo MI False 56.0 True
1531 Adrian College Adrian MI False 1727.0 True
1532 Albion College Albion MI False 1451.0 True
1533 Hillsdale Beauty College Hillsdale MI False 22.0 True
1534 Northwestern Technolo... Southfield MI False 1000.0 True
... ... ... ... ... ... ...
6346 Davenport University ... Clinton Township MI False NaN False
6347 Davenport University ... Livonia MI False NaN False
6348 Davenport University ... Detroit MI False NaN False
6618 L'esprit Academy - Ro... Royal Oak MI False NaN False
6640 Keweenaw Bay Ojibwa C... L'Anse MI False NaN False

192 rows × 6 columns

alternative style:

colleges[
  colleges["state"] == "MI"
]
name city state online_only undergrads is_small
1065 West Michigan College... Kalamazoo MI False 56.0 True
1531 Adrian College Adrian MI False 1727.0 True
1532 Albion College Albion MI False 1451.0 True
1533 Hillsdale Beauty College Hillsdale MI False 22.0 True
1534 Northwestern Technolo... Southfield MI False 1000.0 True
... ... ... ... ... ... ...
6346 Davenport University ... Clinton Township MI False NaN False
6347 Davenport University ... Livonia MI False NaN False
6348 Davenport University ... Detroit MI False NaN False
6618 L'esprit Academy - Ro... Royal Oak MI False NaN False
6640 Keweenaw Bay Ojibwa C... L'Anse MI False NaN False

192 rows × 6 columns

String operations

colleges.query(
  "name.str.startswith('Calvin')"
)
name city state online_only undergrads is_small
1542 Calvin University Grand Rapids MI False 3382.0 True
1543 Calvin Theological Se... Grand Rapids MI False NaN False

alternative style:

colleges[
  colleges["name"]
  .str
  .startswith("Calvin")
]
name city state online_only undergrads is_small
1542 Calvin University Grand Rapids MI False 3382.0 True
1543 Calvin Theological Se... Grand Rapids MI False NaN False

Aggregation

  • Split the data into groups
  • Apply some operation to each group
  • Combine the results

Goal

Total number of undergrads in each state:

state total_undergrads
5 CA 2256934.0
50 TX 1287654.0
39 NY 914704.0
10 FL 888334.0
43 PA 535743.0
40 OH 491978.0
17 IL 488680.0
4 AZ 454182.0
32 NC 433710.0
12 GA 422079.0
26 MI 416196.0
52 VA 395748.0
36 NJ 339620.0
22 MA 332261.0
18 IN 316037.0
51 UT 291101.0
6 CO 270335.0
23 MD 266875.0
49 TN 257396.0
55 WA 257173.0
56 WI 248426.0
28 MO 245277.0
27 MN 235812.0
1 AL 233872.0
47 SC 196690.0
21 LA 190803.0
42 OR 173710.0
20 KY 172354.0
44 PR 167645.0
41 OK 164129.0
7 CT 158636.0
15 IA 147794.0
19 KS 147272.0
35 NH 134509.0
30 MS 131455.0
2 AR 118326.0
57 WV 103472.0
38 NV 98145.0
16 ID 92240.0
34 NE 89930.0
37 NM 84696.0
46 RI 67738.0
8 DC 51925.0
24 ME 49911.0
9 DE 47011.0
14 HI 44035.0
31 MT 38086.0
33 ND 37780.0
48 SD 37398.0
54 VT 31605.0
58 WY 21175.0
0 AK 16390.0
13 GU 4739.0
11 FM 1900.0
53 VI 1744.0
29 MP 1228.0
3 AS 992.0
25 MH 985.0
45 PW 441.0

Approach

  • Split the data into groups
  • Apply some operation to each group
  • Combine the results

We’ll build up the code step-by-step. (The earlier steps will include code that we won’t usually write.)

Split, apply, combine

for state, group in colleges.groupby("state"):
  print(state)
  print(group)
  break # stop early, so we can see just the first group
AK
                          name       city state  online_only  undergrads  \
56    University of Alaska ...  Anchorage    AK        False      9257.0   
57        Alaska Bible College     Palmer    AK        False        30.0   
58    University of Alaska ...  Fairbanks    AK        False      4410.0   
59    University of Alaska ...     Juneau    AK        False      1156.0   
60    Alaska Pacific Univer...  Anchorage    AK        False       318.0   
61    Alaska Vocational Tec...     Seward    AK        False       631.0   
63       Alaska Career College  Anchorage    AK        False       361.0   
4433         Ilisagvik College     Barrow    AK        False       132.0   
4589  Alaska Christian College   Soldotna    AK        False        95.0   

      is_small  
56        True  
57        True  
58        True  
59        True  
60        True  
61        True  
63        True  
4433      True  
4589      True  

Split, apply, combine

for state, group in colleges.groupby("state"):
  print(state, "has", len(group))
AK has 9
AL has 89
AR has 92
AS has 1
AZ has 114
CA has 693
CO has 101
CT has 79
DC has 25
DE has 21
FL has 378
FM has 1
GA has 175
GU has 3
HI has 21
IA has 80
ID has 38
IL has 258
IN has 139
KS has 80
KY has 90
LA has 121
MA has 156
MD has 85
ME has 38
MH has 1
MI has 192
MN has 112
MO has 158
MP has 1
MS has 59
MT has 32
NC has 179
ND has 26
NE has 42
NH has 37
NJ has 168
NM has 49
NV has 38
NY has 455
OH has 289
OK has 108
OR has 78
PA has 353
PR has 144
PW has 1
RI has 23
SC has 100
SD has 28
TN has 155
TX has 433
UT has 71
VA has 168
VI has 2
VT has 22
WA has 105
WI has 95
WV has 73
WY has 10

Split, apply, combine

for state, group in colleges.groupby("state"):
  print(state, "totals", group["undergrads"].sum())
AK totals 16390.0
AL totals 233872.0
AR totals 118326.0
AS totals 992.0
AZ totals 454182.0
CA totals 2256934.0
CO totals 270335.0
CT totals 158636.0
DC totals 51925.0
DE totals 47011.0
FL totals 888334.0
FM totals 1900.0
GA totals 422079.0
GU totals 4739.0
HI totals 44035.0
IA totals 147794.0
ID totals 92240.0
IL totals 488680.0
IN totals 316037.0
KS totals 147272.0
KY totals 172354.0
LA totals 190803.0
MA totals 332261.0
MD totals 266875.0
ME totals 49911.0
MH totals 985.0
MI totals 416196.0
MN totals 235812.0
MO totals 245277.0
MP totals 1228.0
MS totals 131455.0
MT totals 38086.0
NC totals 433710.0
ND totals 37780.0
NE totals 89930.0
NH totals 134509.0
NJ totals 339620.0
NM totals 84696.0
NV totals 98145.0
NY totals 914704.0
OH totals 491978.0
OK totals 164129.0
OR totals 173710.0
PA totals 535743.0
PR totals 167645.0
PW totals 441.0
RI totals 67738.0
SC totals 196690.0
SD totals 37398.0
TN totals 257396.0
TX totals 1287654.0
UT totals 291101.0
VA totals 395748.0
VI totals 1744.0
VT totals 31605.0
WA totals 257173.0
WI totals 248426.0
WV totals 103472.0
WY totals 21175.0

Split, apply, combine

This is the code you’ll typically write.

students_by_state = (
  colleges
1  .groupby("state", as_index=False)
2  .agg(total_undergrads=('undergrads', 'sum'))
)
students_by_state
1
Group by state
2
aggregate: compute a total_undergrads column by (a) extracting the undergrads column, and (b) summing it
state total_undergrads
0 AK 16390.0
1 AL 233872.0
2 AR 118326.0
3 AS 992.0
4 AZ 454182.0
5 CA 2256934.0
6 CO 270335.0
7 CT 158636.0
8 DC 51925.0
9 DE 47011.0
10 FL 888334.0
11 FM 1900.0
12 GA 422079.0
13 GU 4739.0
14 HI 44035.0
15 IA 147794.0
16 ID 92240.0
17 IL 488680.0
18 IN 316037.0
19 KS 147272.0
20 KY 172354.0
21 LA 190803.0
22 MA 332261.0
23 MD 266875.0
24 ME 49911.0
25 MH 985.0
26 MI 416196.0
27 MN 235812.0
28 MO 245277.0
29 MP 1228.0
30 MS 131455.0
31 MT 38086.0
32 NC 433710.0
33 ND 37780.0
34 NE 89930.0
35 NH 134509.0
36 NJ 339620.0
37 NM 84696.0
38 NV 98145.0
39 NY 914704.0
40 OH 491978.0
41 OK 164129.0
42 OR 173710.0
43 PA 535743.0
44 PR 167645.0
45 PW 441.0
46 RI 67738.0
47 SC 196690.0
48 SD 37398.0
49 TN 257396.0
50 TX 1287654.0
51 UT 291101.0
52 VA 395748.0
53 VI 1744.0
54 VT 31605.0
55 WA 257173.0
56 WI 248426.0
57 WV 103472.0
58 WY 21175.0

Style notes

We’ll use a consistent pandas style in this class.

  • access columns with df['col'], not df.col
  • no index
  • groupby with as_index=False
  • explicit aggregation
  • .size of groups, not value_counts
    • Caution: the size attribute on a groupby has a different meaning from size on a DataFrame.

Know the rules in order to know when to break them.

Class style: no index

  • pandas has indexing, an unusual concept
    • not used in SQL, tidyverse, spreadsheets, etc.
    • even Ibis, next-gen dataframe library by Wes McKinney (pandas creator), doesn’t use it.
  • it’s sometimes useful, but often confusing, so we’ll avoid it.
  • so we’ll use as_index=False in all our groupby calls.
    • so all our aggregation results will be DataFrames, not Series.

Class style: explicit aggregation

Lazy method:

colleges.groupby("state", as_index=False)['undergrads'].mean()
state undergrads
0 AK 1821.111111
1 AL 2960.405063
2 AR 1425.614458
3 AS 992.000000
4 AZ 4587.696970
5 CA 3918.288194
6 CO 3037.471910
7 CT 2299.072464
8 DC 3054.411765
9 DE 2611.722222
10 FL 2767.395639
11 FM 1900.000000
12 GA 3014.850000
13 GU 1579.666667
14 HI 2201.750000
15 IA 1919.402597
16 ID 2492.972973
17 IL 2191.390135
18 IN 3068.320388
19 KS 1990.162162
20 KY 2076.554217
21 LA 1718.945946
22 MA 2339.866197
23 MD 3421.474359
24 ME 1426.028571
25 MH 985.000000
26 MI 2650.929936
27 MN 2591.340659
28 MO 1816.866667
29 MP 1228.000000
30 MS 2480.283019
31 MT 1269.533333
32 NC 2693.850932
33 ND 1453.076923
34 NE 2141.190476
35 NH 3956.147059
36 NJ 2135.974843
37 NM 1882.133333
38 NV 2726.250000
39 NY 2275.383085
40 OH 1842.614232
41 OK 1641.290000
42 OR 2285.657895
43 PA 1695.389241
44 PR 1279.732824
45 PW 441.000000
46 RI 3225.619048
47 SC 2114.946237
48 SD 1438.384615
49 TN 1812.647887
50 TX 3370.821990
51 UT 4410.621212
52 VA 2867.739130
53 VI 1744.000000
54 VT 1755.833333
55 WA 2678.885417
56 WI 2888.674419
57 WV 1457.352113
58 WY 2117.500000

Explicit method:

colleges.groupby("state", as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)
state mean_undergrads max_undergrads
0 AK 1821.111111 9257.0
1 AL 2960.405063 31900.0
2 AR 1425.614458 22766.0
3 AS 992.000000 992.0
4 AZ 4587.696970 68833.0
5 CA 3918.288194 35087.0
6 CO 3037.471910 30630.0
7 CT 2299.072464 18585.0
8 DC 3054.411765 12031.0
9 DE 2611.722222 19047.0
10 FL 2767.395639 58940.0
11 FM 1900.000000 1900.0
12 GA 3014.850000 33439.0
13 GU 1579.666667 3169.0
14 HI 2201.750000 12255.0
15 IA 1919.402597 27930.0
16 ID 2492.972973 38672.0
17 IL 2191.390135 33080.0
18 IN 3068.320388 58978.0
19 KS 1990.162162 19003.0
20 KY 2076.554217 21864.0
21 LA 1718.945946 23706.0
22 MA 2339.866197 23907.0
23 MD 3421.474359 42888.0
24 ME 1426.028571 8832.0
25 MH 985.000000 985.0
26 MI 2650.929936 38950.0
27 MN 2591.340659 31367.0
28 MO 1816.866667 21933.0
29 MP 1228.000000 1228.0
30 MS 2480.283019 18388.0
31 MT 1269.533333 14402.0
32 NC 2693.850932 24239.0
33 ND 1453.076923 10579.0
34 NE 2141.190476 20252.0
35 NH 3956.147059 90196.0
36 NJ 2135.974843 35760.0
37 NM 1882.133333 18120.0
38 NV 2726.250000 30090.0
39 NY 2275.383085 26613.0
40 OH 1842.614232 45657.0
41 OK 1641.290000 21329.0
42 OR 2285.657895 25339.0
43 PA 1695.389241 74630.0
44 PR 1279.732824 12508.0
45 PW 441.000000 441.0
46 RI 3225.619048 13826.0
47 SC 2114.946237 27066.0
48 SD 1438.384615 9034.0
49 TN 1812.647887 23152.0
50 TX 3370.821990 55990.0
51 UT 4410.621212 98630.0
52 VA 2867.739130 46438.0
53 VI 1744.000000 1744.0
54 VT 1755.833333 10700.0
55 WA 2678.885417 30839.0
56 WI 2888.674419 31185.0
57 WV 1457.352113 35656.0
58 WY 2117.500000 9646.0

Size of a group

Preferred to value_counts:

  • works with multiple columns
  • returns a DataFrame, not a Series
(
  colleges
  .groupby("state", as_index=False)
  .size()
)
state size
0 AK 9
1 AL 89
2 AR 92
3 AS 1
4 AZ 114
5 CA 693
6 CO 101
7 CT 79
8 DC 25
9 DE 21
10 FL 378
11 FM 1
12 GA 175
13 GU 3
14 HI 21
15 IA 80
16 ID 38
17 IL 258
18 IN 139
19 KS 80
20 KY 90
21 LA 121
22 MA 156
23 MD 85
24 ME 38
25 MH 1
26 MI 192
27 MN 112
28 MO 158
29 MP 1
30 MS 59
31 MT 32
32 NC 179
33 ND 26
34 NE 42
35 NH 37
36 NJ 168
37 NM 49
38 NV 38
39 NY 455
40 OH 289
41 OK 108
42 OR 78
43 PA 353
44 PR 144
45 PW 1
46 RI 23
47 SC 100
48 SD 28
49 TN 155
50 TX 433
51 UT 71
52 VA 168
53 VI 2
54 VT 22
55 WA 105
56 WI 95
57 WV 73
58 WY 10
(
  colleges
  .groupby(["state", "online_only"], as_index=False)
  .size()
)
state online_only size
0 AK False 9
1 AL False 87
2 AL True 2
3 AR False 91
4 AR True 1
... ... ... ...
79 WI False 93
80 WI True 2
81 WV False 70
82 WV True 3
83 WY False 10

84 rows × 3 columns

Compare with value_counts:

colleges["state"].value_counts()
CA    693
NY    455
TX    433
FL    378
PA    353
OH    289
IL    258
MI    192
NC    179
GA    175
VA    168
NJ    168
MO    158
MA    156
TN    155
PR    144
IN    139
LA    121
AZ    114
MN    112
OK    108
WA    105
CO    101
SC    100
WI     95
AR     92
KY     90
AL     89
MD     85
IA     80
KS     80
CT     79
OR     78
WV     73
UT     71
MS     59
NM     49
NE     42
NV     38
ME     38
ID     38
NH     37
MT     32
SD     28
ND     26
DC     25
RI     23
VT     22
HI     21
DE     21
WY     10
AK      9
GU      3
VI      2
AS      1
MP      1
FM      1
PW      1
MH      1
Name: state, dtype: int64

Exercise: are online-only colleges bigger?

Make a table of the average (mean) number of undergrads broken down by online_only.

(
  colleges
  .groupby("online_only", as_index=False)
  .agg(mean_undergrads=('undergrads', 'mean'))
)
online_only mean_undergrads
0 False 2534.801128
1 True 5527.040816

Plots integrated

colleges['undergrads'].hist()
students_by_state.plot.bar(x='total_undergrads', y='state')