Slides 5.2: Wrangling Example

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pd.options.plotting.backend = "plotly"
pio.templates.default = "plotly_white"

Review: pandas and the grammar of data

colleges = pd.read_csv("https://calvin-data-science.github.io/data202/data/college_scorecard.csv")
colleges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6694 entries, 0 to 6693
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         6694 non-null   object 
 1   city         6694 non-null   object 
 2   state        6694 non-null   object 
 3   online_only  6189 non-null   float64
 4   undergrads   5902 non-null   float64
dtypes: float64(2), object(3)
memory usage: 261.6+ KB

Grammar of Data Wrangling

Generally useful conceptual operations, common to many tools (pandas, SQL, R, …).

  • select columns
  • filter rows
  • sort rows
  • group rows and summarize groups
  • derive new columns

next week:

  • join tables
  • reshape tables

select columns

colleges[['name', 'undergrads']]
name undergrads
0 Alabama A & M University 5271.0
1 University of Alabama at Birmingham 13328.0
2 Amridge University 365.0
3 University of Alabama in Huntsville 7785.0
4 Alabama State University 3750.0
... ... ...
6689 Georgia Military College - Eastman NaN
6690 American College of Barbering - Florence NaN
6691 HCI College - Fort Lauderdale Campus NaN
6692 ABC Adult School - Cabrillo Lane NaN
6693 Urban Barber College - San Jose NaN

6694 rows × 2 columns

Documentation on indexing

Getting a single column as a Series:

colleges['name']
0                       Alabama A & M University
1            University of Alabama at Birmingham
2                             Amridge University
3            University of Alabama in Huntsville
4                       Alabama State University
                          ...                   
6689          Georgia Military College - Eastman
6690    American College of Barbering - Florence
6691        HCI College - Fort Lauderdale Campus
6692            ABC Adult School - Cabrillo Lane
6693             Urban Barber College - San Jose
Name: name, Length: 6694, dtype: object

Remember: column names are strings (not colleges[name])!

select rows (query)

colleges.query("state == 'AK'")
name city state online_only undergrads
56 University of Alaska Anchorage Anchorage AK 0.0 9257.0
57 Alaska Bible College Palmer AK 0.0 30.0
58 University of Alaska Fairbanks Fairbanks AK 0.0 4410.0
59 University of Alaska Southeast Juneau AK 0.0 1156.0
60 Alaska Pacific University Anchorage AK 0.0 318.0
61 Alaska Vocational Technical Center Seward AK 0.0 631.0
63 Alaska Career College Anchorage AK 0.0 361.0
4433 Ilisagvik College Barrow AK 0.0 132.0
4589 Alaska Christian College Soldotna AK 0.0 95.0

Documentation: pandas.DataFrame.query

top and bottom

colleges.head()
name city state online_only undergrads
0 Alabama A & M University Normal AL 0.0 5271.0
1 University of Alabama at Birmingham Birmingham AL 0.0 13328.0
2 Amridge University Montgomery AL 1.0 365.0
3 University of Alabama in Huntsville Huntsville AL 0.0 7785.0
4 Alabama State University Montgomery AL 0.0 3750.0
colleges.tail()
name city state online_only undergrads
6689 Georgia Military College - Eastman Eastman GA NaN NaN
6690 American College of Barbering - Florence Florence KY NaN NaN
6691 HCI College - Fort Lauderdale Campus Fort Lauderdale FL NaN NaN
6692 ABC Adult School - Cabrillo Lane Cerritos CA NaN NaN
6693 Urban Barber College - San Jose San Jose CA NaN NaN

arrange rows (sort_values)

colleges.sort_values("undergrads").head()
name city state online_only undergrads
6092 National American University-Killeen Killeen TX 0.0 0.0
5542 Ukiah Adult School Ukiah CA 0.0 0.0
2912 Pennsylvania State University-Penn State Brand... Media PA 0.0 0.0
2919 Pennsylvania State University-Penn State Abington Abington PA 0.0 0.0
2903 Pennsylvania State University-Penn State New K... New Kensington PA 0.0 0.0
colleges.sort_values("undergrads", ascending=False).head()
name city state online_only undergrads
4412 Western Governors University Salt Lake City UT 1.0 98630.0
1928 Southern New Hampshire University Manchester NH 0.0 90196.0
2916 Pennsylvania State University-Main Campus University Park PA 0.0 74630.0
5700 University of Phoenix-Arizona Phoenix AZ 0.0 68833.0
1040 Ivy Tech Community College Indianapolis IN 0.0 58978.0

Documentation: pandas.DataFrame.sort_values

counting

colleges.groupby("state", as_index=False).size()
state size
0 AK 9
1 AL 89
2 AR 92
3 AS 1
4 AZ 114
5 CA 693
6 CO 101
7 CT 79
8 DC 25
9 DE 21
10 FL 378
11 FM 1
12 GA 175
13 GU 3
14 HI 21
15 IA 80
16 ID 38
17 IL 258
18 IN 139
19 KS 80
20 KY 90
21 LA 121
22 MA 156
23 MD 85
24 ME 38
25 MH 1
26 MI 192
27 MN 112
28 MO 158
29 MP 1
30 MS 59
31 MT 32
32 NC 179
33 ND 26
34 NE 42
35 NH 37
36 NJ 168
37 NM 49
38 NV 38
39 NY 455
40 OH 289
41 OK 108
42 OR 78
43 PA 353
44 PR 144
45 PW 1
46 RI 23
47 SC 100
48 SD 28
49 TN 155
50 TX 433
51 UT 71
52 VA 168
53 VI 2
54 VT 22
55 WA 105
56 WI 95
57 WV 73
58 WY 10

grouping / summarizing (general)

colleges.groupby("state", as_index=False).agg(
  count=('undergrads', 'count'),
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)
state count mean_undergrads max_undergrads
0 AK 9 1821.111111 9257.0
1 AL 79 2960.405063 31900.0
2 AR 83 1425.614458 22766.0
3 AS 1 992.000000 992.0
4 AZ 99 4587.696970 68833.0
5 CA 576 3918.288194 35087.0
6 CO 89 3037.471910 30630.0
7 CT 69 2299.072464 18585.0
8 DC 17 3054.411765 12031.0
9 DE 18 2611.722222 19047.0
10 FL 321 2767.395639 58940.0
11 FM 1 1900.000000 1900.0
12 GA 140 3014.850000 33439.0
13 GU 3 1579.666667 3169.0
14 HI 20 2201.750000 12255.0
15 IA 77 1919.402597 27930.0
16 ID 37 2492.972973 38672.0
17 IL 223 2191.390135 33080.0
18 IN 103 3068.320388 58978.0
19 KS 74 1990.162162 19003.0
20 KY 83 2076.554217 21864.0
21 LA 111 1718.945946 23706.0
22 MA 142 2339.866197 23907.0
23 MD 78 3421.474359 42888.0
24 ME 35 1426.028571 8832.0
25 MH 1 985.000000 985.0
26 MI 157 2650.929936 38950.0
27 MN 91 2591.340659 31367.0
28 MO 135 1816.866667 21933.0
29 MP 1 1228.000000 1228.0
30 MS 53 2480.283019 18388.0
31 MT 30 1269.533333 14402.0
32 NC 161 2693.850932 24239.0
33 ND 26 1453.076923 10579.0
34 NE 42 2141.190476 20252.0
35 NH 34 3956.147059 90196.0
36 NJ 159 2135.974843 35760.0
37 NM 45 1882.133333 18120.0
38 NV 36 2726.250000 30090.0
39 NY 402 2275.383085 26613.0
40 OH 267 1842.614232 45657.0
41 OK 100 1641.290000 21329.0
42 OR 76 2285.657895 25339.0
43 PA 316 1695.389241 74630.0
44 PR 131 1279.732824 12508.0
45 PW 1 441.000000 441.0
46 RI 21 3225.619048 13826.0
47 SC 93 2114.946237 27066.0
48 SD 26 1438.384615 9034.0
49 TN 142 1812.647887 23152.0
50 TX 382 3370.821990 55990.0
51 UT 66 4410.621212 98630.0
52 VA 138 2867.739130 46438.0
53 VI 1 1744.000000 1744.0
54 VT 18 1755.833333 10700.0
55 WA 96 2678.885417 30839.0
56 WI 86 2888.674419 31185.0
57 WV 71 1457.352113 35656.0
58 WY 10 2117.500000 9646.0

random sampling

Shuffle rows:

colleges.sample(frac=1.0).head()
name city state online_only undergrads
5721 Paul Mitchell the School-Toledo Sylvania OH 0.0 143.0
5725 Beau Monde Academy of Cosmetology and Barbering Portland OR 0.0 115.0
5044 Oliver Finley Academy of Cosmetology Boise ID 0.0 254.0
2160 Hilbert College Hamburg NY 0.0 711.0
6250 Porter and Chester Institute of Enfield Enfield CT NaN NaN

Select random rows:

colleges.sample(n=5)
name city state online_only undergrads
4094 ASM Beauty World Academy Hollywood FL 0.0 90.0
1179 Hays Academy of Hair Design Hays KS 0.0 70.0
104 Universal Technical Institute of Arizona Inc Avondale AZ 0.0 1873.0
3328 South Texas Barber College Inc Corpus Christi TX 0.0 50.0
5573 Empire Beauty School-Rochester Rochester NY 0.0 73.0

Deriving new columns

colleges['named_university'] = colleges['name'].str.contains('University')
colleges.groupby('named_university', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)
named_university mean_undergrads max_undergrads
0 False 1578.685231 58978.0
1 True 5525.253579 98630.0

“Broadcast” operations

Apply an operation to an entire column at a time.

colleges['undergrads'] / 1000
0        5.271
1       13.328
2        0.365
3        7.785
4        3.750
         ...  
6689       NaN
6690       NaN
6691       NaN
6692       NaN
6693       NaN
Name: undergrads, Length: 6694, dtype: float64
pd.isna(colleges['undergrads'])
0       False
1       False
2       False
3       False
4       False
        ...  
6689     True
6690     True
6691     True
6692     True
6693     True
Name: undergrads, Length: 6694, dtype: bool

Pairs of columns

When broadcasting with a pair of columns, the operation is applied element-wise.

colleges['city'].str.cat(colleges['state'], sep=', ')
0                Normal, AL
1            Birmingham, AL
2            Montgomery, AL
3            Huntsville, AL
4            Montgomery, AL
               ...         
6689            Eastman, GA
6690           Florence, KY
6691    Fort Lauderdale, FL
6692           Cerritos, CA
6693           San Jose, CA
Name: city, Length: 6694, dtype: object

Deriving new columns, advanced

def classify_name(school_name):
    school_name = school_name.lower()
    if 'university' in school_name:
        return 'university'
    elif 'college' in school_name:
        return 'college'
    else:
        return 'other'
colleges['name_type'] = colleges['name'].apply(classify_name)
colleges.groupby('name_type', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)
name_type mean_undergrads max_undergrads
0 college 2668.155462 58978.0
1 other 316.914355 18195.0
2 university 5525.253579 98630.0

Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select).

Wrangling example

Data: Hotel bookings

  • Data from two hotels: one resort and one city hotel
  • Observations: Each row represents a hotel booking
  • Goal for original data collection: Development of prediction models to classify a hotel booking’s likelihood to be cancelled (Antonia et al., 2019)

Data download

hotels_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
import pathlib, requests

cache_filename = pathlib.Path('data/hotels.csv')
if not cache_filename.exists():
    cache_filename.parent.mkdir(parents=True, exist_ok=True)
    cache_filename.write_bytes(requests.get(hotels_url).content)
hotels = pd.read_csv(cache_filename)
hotels.head()
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

Template code

---
title: "A title"
author: "Your name"
format:
  html:
    embed-resources: true
    code-tools: true
    code-fold: true
---

```{python}
#| echo: true
import numpy as np
import pandas as pd
import plotly.express as px
pd.options.plotting.backend = "plotly"
```

```{python}
#| echo: false
# Hack to make plotly work in RStudio
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
  r[".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
  def show_plot(p): r.to_html(p._repr_html_())
else:
  def show_plot(p): return p
```

```{python}
hotels_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
import pathlib, requests

cache_filename = pathlib.Path('data/hotels.csv')
if not cache_filename.exists():
    cache_filename.parent.mkdir(parents=True, exist_ok=True)
    cache_filename.write_bytes(requests.get(hotels_url).content)
```

```{python}
hotels = pd.read_csv(cache_filename)
hotels.head()
```

Suggested questions

Try each of these yourself, before looking at a possible solution in the following slides.

  • What are some of the variables and their data types?
  • How many different hotels are there? Which hotel has more bookings?
  • Which hotel has longer stays?
  • Which market segment is most common?
  • How much in advance (lead_time) are bookings made?
  • Which market segments book most in advance?
  • How long are typical stays for each market segment?

What are some of the variables and their data types?

hotels.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
hotels.describe()
is_canceled lead_time arrival_date_year arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes agent company days_in_waiting_list adr required_car_parking_spaces total_of_special_requests
count 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119386.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 103050.000000 6797.000000 119390.000000 119390.000000 119390.000000 119390.000000
mean 0.370416 104.011416 2016.156554 27.165173 15.798241 0.927599 2.500302 1.856403 0.103890 0.007949 0.031912 0.087118 0.137097 0.221124 86.693382 189.266735 2.321149 101.831122 0.062518 0.571363
std 0.482918 106.863097 0.707476 13.605138 8.780829 0.998613 1.908286 0.579261 0.398561 0.097436 0.175767 0.844336 1.497437 0.652306 110.774548 131.655015 17.594721 50.535790 0.245291 0.792798
min 0.000000 0.000000 2015.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 6.000000 0.000000 -6.380000 0.000000 0.000000
25% 0.000000 18.000000 2016.000000 16.000000 8.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 62.000000 0.000000 69.290000 0.000000 0.000000
50% 0.000000 69.000000 2016.000000 28.000000 16.000000 1.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000 179.000000 0.000000 94.575000 0.000000 0.000000
75% 1.000000 160.000000 2017.000000 38.000000 23.000000 2.000000 3.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 229.000000 270.000000 0.000000 126.000000 0.000000 1.000000
max 1.000000 737.000000 2017.000000 53.000000 31.000000 19.000000 50.000000 55.000000 10.000000 10.000000 1.000000 26.000000 72.000000 21.000000 535.000000 543.000000 391.000000 5400.000000 8.000000 5.000000

How many different hotels are there? Which hotel has more bookings?

Approach: group by hotel, compute size of each group.

hotels.groupby('hotel', as_index=False).size()
hotel size
0 City Hotel 79330
1 Resort Hotel 40060
# alternative approach
hotels.groupby('hotel', as_index=False).agg(size=('hotel', 'count'))
hotel size
0 City Hotel 79330
1 Resort Hotel 40060

Which hotel has longer stays?

Approach:

  • Add stays_in_week_nights and stays_in_weekend_nights to get total length of stay.
  • Group by hotel, compute mean, median, and max.
hotels['total_stay'] = hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights']
hotels.groupby('hotel').agg(
  mean_stay=('total_stay', 'mean'),
  median_stay=('total_stay', 'median'),
  max_stay=('total_stay', 'max')
)
mean_stay median_stay max_stay
hotel
City Hotel 2.978142 3.0 57
Resort Hotel 4.318547 3.0 69

Which market segment is most common?

Approach: group by market segment, compute size of each group, sort by size.

(
  hotels.groupby('market_segment', as_index=False)
  .size()
  .sort_values('size', ascending=False)
)
market_segment size
6 Online TA 56477
5 Offline TA/TO 24219
4 Groups 19811
3 Direct 12606
2 Corporate 5295
1 Complementary 743
0 Aviation 237
7 Undefined 2

How much in advance (lead_time) are bookings made?

Approach: compute mean and max of lead_time.

# quick way
hotels['lead_time'].describe()
count    119390.000000
mean        104.011416
std         106.863097
min           0.000000
25%          18.000000
50%          69.000000
75%         160.000000
max         737.000000
Name: lead_time, dtype: float64
# more explicit way
print("Lead time: mean={:.1f} days, max={:.1f} days".format(
  hotels['lead_time'].mean(),
  hotels['lead_time'].max())
)
Lead time: mean=104.0 days, max=737.0 days

Which market segments book most in advance?

Approach: group by market segment, compute mean and max of lead_time.

hotels.groupby('market_segment', as_index=False).agg(
  mean_lead_time=('lead_time', 'mean'),
  max_lead_time=('lead_time', 'max')
)
market_segment mean_lead_time max_lead_time
0 Aviation 4.443038 23
1 Complementary 13.286676 386
2 Corporate 22.125590 343
3 Direct 49.859115 737
4 Groups 186.973096 629
5 Offline TA/TO 135.004459 532
6 Online TA 82.998725 403
7 Undefined 1.500000 2

How long are typical stays for each market segment?

Approach: compute total stay length, group by market segment, compute mean and max, sort by mean.

(
  hotels
  .assign(total_stay=hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights'])
  .groupby('market_segment', as_index=False)
  .agg(
    mean_stay=('total_stay', 'mean'),
    max_stay=('total_stay', 'max')
  )
  .sort_values('mean_stay', ascending=False)
)
market_segment mean_stay max_stay
5 Offline TA/TO 3.903877 56
0 Aviation 3.607595 16
6 Online TA 3.573986 57
3 Direct 3.205775 69
4 Groups 2.992529 38
2 Corporate 2.092918 49
1 Complementary 1.647376 30
7 Undefined 1.500000 2