Slides 5.2: Wrangling Example

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
pd.options.plotting.backend = "plotly"
pio.templates.default = "plotly_white"

Review: pandas and the grammar of data

colleges = pd.read_csv("https://calvin-data-science.github.io/data202/data/college_scorecard.csv")
colleges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6694 entries, 0 to 6693
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         6694 non-null   object 
 1   city         6694 non-null   object 
 2   state        6694 non-null   object 
 3   online_only  6189 non-null   float64
 4   undergrads   5902 non-null   float64
dtypes: float64(2), object(3)
memory usage: 261.6+ KB

Grammar of Data Wrangling

Generally useful conceptual operations, common to many tools (pandas, SQL, R, …).

select columns
filter rows
sort rows
group rows and summarize groups
derive new columns

next week:

join tables
reshape tables

select columns

colleges[['name', 'undergrads']]

	name	undergrads
0	Alabama A & M University	5271.0
1	University of Alabama at Birmingham	13328.0
2	Amridge University	365.0
3	University of Alabama in Huntsville	7785.0
4	Alabama State University	3750.0
...	...	...
6689	Georgia Military College - Eastman	NaN
6690	American College of Barbering - Florence	NaN
6691	HCI College - Fort Lauderdale Campus	NaN
6692	ABC Adult School - Cabrillo Lane	NaN
6693	Urban Barber College - San Jose	NaN

6694 rows × 2 columns

Documentation on indexing

Getting a single column as a Series:

colleges['name']

0                       Alabama A & M University
1            University of Alabama at Birmingham
2                             Amridge University
3            University of Alabama in Huntsville
4                       Alabama State University
                          ...                   
6689          Georgia Military College - Eastman
6690    American College of Barbering - Florence
6691        HCI College - Fort Lauderdale Campus
6692            ABC Adult School - Cabrillo Lane
6693             Urban Barber College - San Jose
Name: name, Length: 6694, dtype: object

Remember: column names are strings (not colleges[name])!

select rows (`query`)

colleges.query("state == 'AK'")

	name	city	state	undergrads
56	University of Alaska Anchorage	Anchorage	AK	9257.0
57	Alaska Bible College	Palmer	AK	30.0
58	University of Alaska Fairbanks	Fairbanks	AK	4410.0
59	University of Alaska Southeast	Juneau	AK	1156.0
60	Alaska Pacific University	Anchorage	AK	318.0
61	Alaska Vocational Technical Center	Seward	AK	631.0
63	Alaska Career College	Anchorage	AK	361.0
4433	Ilisagvik College	Barrow	AK	132.0
4589	Alaska Christian College	Soldotna	AK	95.0

Documentation: pandas.DataFrame.query

top and bottom

colleges.head()

	name	city	state	online_only	undergrads
0	Alabama A & M University	Normal	AL	0.0	5271.0
1	University of Alabama at Birmingham	Birmingham	AL	0.0	13328.0
2	Amridge University	Montgomery	AL	1.0	365.0
3	University of Alabama in Huntsville	Huntsville	AL	0.0	7785.0
4	Alabama State University	Montgomery	AL	0.0	3750.0

colleges.tail()

	name	city	state	online_only	undergrads
6689	Georgia Military College - Eastman	Eastman	GA	NaN	NaN
6690	American College of Barbering - Florence	Florence	KY	NaN	NaN
6691	HCI College - Fort Lauderdale Campus	Fort Lauderdale	FL	NaN	NaN
6692	ABC Adult School - Cabrillo Lane	Cerritos	CA	NaN	NaN
6693	Urban Barber College - San Jose	San Jose	CA	NaN	NaN

arrange rows (`sort_values`)

colleges.sort_values("undergrads").head()

	name	city	state
6092	National American University-Killeen	Killeen	TX
5542	Ukiah Adult School	Ukiah	CA
2912	Pennsylvania State University-Penn State Brand...	Media	PA
2919	Pennsylvania State University-Penn State Abington	Abington	PA
2903	Pennsylvania State University-Penn State New K...	New Kensington	PA

colleges.sort_values("undergrads", ascending=False).head()

	name	city	state	online_only	undergrads
4412	Western Governors University	Salt Lake City	UT	1.0	98630.0
1928	Southern New Hampshire University	Manchester	NH	0.0	90196.0
2916	Pennsylvania State University-Main Campus	University Park	PA	0.0	74630.0
5700	University of Phoenix-Arizona	Phoenix	AZ	0.0	68833.0
1040	Ivy Tech Community College	Indianapolis	IN	0.0	58978.0

Documentation: pandas.DataFrame.sort_values

counting

colleges.groupby("state", as_index=False).size()

	state	size
0	AK	9
1	AL	89
2	AR	92
3	AS	1
4	AZ	114
5	CA	693
6	CO	101
7	CT	79
8	DC	25
9	DE	21
10	FL	378
11	FM	1
12	GA	175
13	GU	3
14	HI	21
15	IA	80
16	ID	38
17	IL	258
18	IN	139
19	KS	80
20	KY	90
21	LA	121
22	MA	156
23	MD	85
24	ME	38
25	MH	1
26	MI	192
27	MN	112
28	MO	158
29	MP	1
30	MS	59
31	MT	32
32	NC	179
33	ND	26
34	NE	42
35	NH	37
36	NJ	168
37	NM	49
38	NV	38
39	NY	455
40	OH	289
41	OK	108
42	OR	78
43	PA	353
44	PR	144
45	PW	1
46	RI	23
47	SC	100
48	SD	28
49	TN	155
50	TX	433
51	UT	71
52	VA	168
53	VI	2
54	VT	22
55	WA	105
56	WI	95
57	WV	73
58	WY	10

grouping / summarizing (general)

colleges.groupby("state", as_index=False).agg(
  count=('undergrads', 'count'),
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

	state	count	mean_undergrads	max_undergrads
0	AK	9	1821.111111	9257.0
1	AL	79	2960.405063	31900.0
2	AR	83	1425.614458	22766.0
3	AS	1	992.000000	992.0
4	AZ	99	4587.696970	68833.0
5	CA	576	3918.288194	35087.0
6	CO	89	3037.471910	30630.0
7	CT	69	2299.072464	18585.0
8	DC	17	3054.411765	12031.0
9	DE	18	2611.722222	19047.0
10	FL	321	2767.395639	58940.0
11	FM	1	1900.000000	1900.0
12	GA	140	3014.850000	33439.0
13	GU	3	1579.666667	3169.0
14	HI	20	2201.750000	12255.0
15	IA	77	1919.402597	27930.0
16	ID	37	2492.972973	38672.0
17	IL	223	2191.390135	33080.0
18	IN	103	3068.320388	58978.0
19	KS	74	1990.162162	19003.0
20	KY	83	2076.554217	21864.0
21	LA	111	1718.945946	23706.0
22	MA	142	2339.866197	23907.0
23	MD	78	3421.474359	42888.0
24	ME	35	1426.028571	8832.0
25	MH	1	985.000000	985.0
26	MI	157	2650.929936	38950.0
27	MN	91	2591.340659	31367.0
28	MO	135	1816.866667	21933.0
29	MP	1	1228.000000	1228.0
30	MS	53	2480.283019	18388.0
31	MT	30	1269.533333	14402.0
32	NC	161	2693.850932	24239.0
33	ND	26	1453.076923	10579.0
34	NE	42	2141.190476	20252.0
35	NH	34	3956.147059	90196.0
36	NJ	159	2135.974843	35760.0
37	NM	45	1882.133333	18120.0
38	NV	36	2726.250000	30090.0
39	NY	402	2275.383085	26613.0
40	OH	267	1842.614232	45657.0
41	OK	100	1641.290000	21329.0
42	OR	76	2285.657895	25339.0
43	PA	316	1695.389241	74630.0
44	PR	131	1279.732824	12508.0
45	PW	1	441.000000	441.0
46	RI	21	3225.619048	13826.0
47	SC	93	2114.946237	27066.0
48	SD	26	1438.384615	9034.0
49	TN	142	1812.647887	23152.0
50	TX	382	3370.821990	55990.0
51	UT	66	4410.621212	98630.0
52	VA	138	2867.739130	46438.0
53	VI	1	1744.000000	1744.0
54	VT	18	1755.833333	10700.0
55	WA	96	2678.885417	30839.0
56	WI	86	2888.674419	31185.0
57	WV	71	1457.352113	35656.0
58	WY	10	2117.500000	9646.0

random sampling

Shuffle rows:

colleges.sample(frac=1.0).head()

	name	city	state	online_only	undergrads
5721	Paul Mitchell the School-Toledo	Sylvania	OH	0.0	143.0
5725	Beau Monde Academy of Cosmetology and Barbering	Portland	OR	0.0	115.0
5044	Oliver Finley Academy of Cosmetology	Boise	ID	0.0	254.0
2160	Hilbert College	Hamburg	NY	0.0	711.0
6250	Porter and Chester Institute of Enfield	Enfield	CT	NaN	NaN

Select random rows:

colleges.sample(n=5)

	name	city	state	undergrads
4094	ASM Beauty World Academy	Hollywood	FL	90.0
1179	Hays Academy of Hair Design	Hays	KS	70.0
104	Universal Technical Institute of Arizona Inc	Avondale	AZ	1873.0
3328	South Texas Barber College Inc	Corpus Christi	TX	50.0
5573	Empire Beauty School-Rochester	Rochester	NY	73.0

Deriving new columns

colleges['named_university'] = colleges['name'].str.contains('University')
colleges.groupby('named_university', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

	named_university	mean_undergrads	max_undergrads
0	False	1578.685231	58978.0
1	True	5525.253579	98630.0

“Broadcast” operations

Apply an operation to an entire column at a time.

colleges['undergrads'] / 1000

0        5.271
1       13.328
2        0.365
3        7.785
4        3.750
         ...  
6689       NaN
6690       NaN
6691       NaN
6692       NaN
6693       NaN
Name: undergrads, Length: 6694, dtype: float64

pd.isna(colleges['undergrads'])

0       False
1       False
2       False
3       False
4       False
        ...  
6689     True
6690     True
6691     True
6692     True
6693     True
Name: undergrads, Length: 6694, dtype: bool

Pairs of columns

When broadcasting with a pair of columns, the operation is applied element-wise.

colleges['city'].str.cat(colleges['state'], sep=', ')

0                Normal, AL
1            Birmingham, AL
2            Montgomery, AL
3            Huntsville, AL
4            Montgomery, AL
               ...         
6689            Eastman, GA
6690           Florence, KY
6691    Fort Lauderdale, FL
6692           Cerritos, CA
6693           San Jose, CA
Name: city, Length: 6694, dtype: object

Deriving new columns, advanced

def classify_name(school_name):
    school_name = school_name.lower()
    if 'university' in school_name:
        return 'university'
    elif 'college' in school_name:
        return 'college'
    else:
        return 'other'
colleges['name_type'] = colleges['name'].apply(classify_name)
colleges.groupby('name_type', as_index=False).agg(
  mean_undergrads=('undergrads', 'mean'),
  max_undergrads=('undergrads', 'max')
)

	name_type	mean_undergrads	max_undergrads
0	college	2668.155462	58978.0
1	other	316.914355	18195.0
2	university	5525.253579	98630.0

Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select).

Wrangling example

Data: Hotel bookings

Data from two hotels: one resort and one city hotel
Observations: Each row represents a hotel booking
Goal for original data collection: Development of prediction models to classify a hotel booking’s likelihood to be cancelled (Antonia et al., 2019)

Data download

hotels_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
import pathlib, requests

cache_filename = pathlib.Path('data/hotels.csv')
if not cache_filename.exists():
    cache_filename.parent.mkdir(parents=True, exist_ok=True)
    cache_filename.write_bytes(requests.get(hotels_url).content)

hotels = pd.read_csv(cache_filename)
hotels.head()

	hotel	lead_time	arrival_date_year	arrival_date_month	arrival_date_week_number	arrival_date_day_of_month	stays_in_week_nights	adults	...	deposit_type	agent	company	customer_type	adr	total_of_special_requests	reservation_status	reservation_status_date
0	Resort Hotel	342	2015	July	27	1	0	2	...	No Deposit	NaN	NaN	Transient	0.0	0	Check-Out	2015-07-01
1	Resort Hotel	737	2015	July	27	1	0	2	...	No Deposit	NaN	NaN	Transient	0.0	0	Check-Out	2015-07-01
2	Resort Hotel	7	2015	July	27	1	1	1	...	No Deposit	NaN	NaN	Transient	75.0	0	Check-Out	2015-07-02
3	Resort Hotel	13	2015	July	27	1	1	1	...	No Deposit	304.0	NaN	Transient	75.0	0	Check-Out	2015-07-02
4	Resort Hotel	14	2015	July	27	1	2	2	...	No Deposit	240.0	NaN	Transient	98.0	1	Check-Out	2015-07-03

5 rows × 32 columns

Template code

---
title: "A title"
author: "Your name"
format:
  html:
    embed-resources: true
    code-tools: true
    code-fold: true
---

```{python}
#| echo: true
import numpy as np
import pandas as pd
import plotly.express as px
pd.options.plotting.backend = "plotly"
```

```{python}
#| echo: false
# Hack to make plotly work in RStudio
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
  r[".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
  def show_plot(p): r.to_html(p._repr_html_())
else:
  def show_plot(p): return p
```

```{python}
hotels_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
import pathlib, requests

cache_filename = pathlib.Path('data/hotels.csv')
if not cache_filename.exists():
    cache_filename.parent.mkdir(parents=True, exist_ok=True)
    cache_filename.write_bytes(requests.get(hotels_url).content)
```

```{python}
hotels = pd.read_csv(cache_filename)
hotels.head()
```

What are some of the variables and their data types?

hotels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

hotels.describe()

	is_canceled	lead_time	arrival_date_year	arrival_date_week_number	arrival_date_day_of_month	stays_in_weekend_nights	stays_in_week_nights	adults	children	babies	is_repeated_guest	previous_cancellations	previous_bookings_not_canceled	booking_changes	agent	company	days_in_waiting_list	adr	required_car_parking_spaces	total_of_special_requests
count	119390.000000	119390.000000	119390.000000	119390.000000	119390.000000	119390.000000	119390.000000	119390.000000	119386.000000	119390.000000	119390.000000	119390.000000	119390.000000	119390.000000	103050.000000	6797.000000	119390.000000	119390.000000	119390.000000	119390.000000
mean	0.370416	104.011416	2016.156554	27.165173	15.798241	0.927599	2.500302	1.856403	0.103890	0.007949	0.031912	0.087118	0.137097	0.221124	86.693382	189.266735	2.321149	101.831122	0.062518	0.571363
std	0.482918	106.863097	0.707476	13.605138	8.780829	0.998613	1.908286	0.579261	0.398561	0.097436	0.175767	0.844336	1.497437	0.652306	110.774548	131.655015	17.594721	50.535790	0.245291	0.792798
min	0.000000	0.000000	2015.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	6.000000	0.000000	-6.380000	0.000000	0.000000
25%	0.000000	18.000000	2016.000000	16.000000	8.000000	0.000000	1.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	9.000000	62.000000	0.000000	69.290000	0.000000	0.000000
50%	0.000000	69.000000	2016.000000	28.000000	16.000000	1.000000	2.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	14.000000	179.000000	0.000000	94.575000	0.000000	0.000000
75%	1.000000	160.000000	2017.000000	38.000000	23.000000	2.000000	3.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	229.000000	270.000000	0.000000	126.000000	0.000000	1.000000
max	1.000000	737.000000	2017.000000	53.000000	31.000000	19.000000	50.000000	55.000000	10.000000	10.000000	1.000000	26.000000	72.000000	21.000000	535.000000	543.000000	391.000000	5400.000000	8.000000	5.000000

How many different hotels are there? Which hotel has more bookings?

Approach: group by hotel, compute size of each group.

hotels.groupby('hotel', as_index=False).size()

	hotel	size
0	City Hotel	79330
1	Resort Hotel	40060

# alternative approach
hotels.groupby('hotel', as_index=False).agg(size=('hotel', 'count'))

	hotel	size
0	City Hotel	79330
1	Resort Hotel	40060

Which hotel has longer stays?

Approach:

Add stays_in_week_nights and stays_in_weekend_nights to get total length of stay.
Group by hotel, compute mean, median, and max.

hotels['total_stay'] = hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights']
hotels.groupby('hotel').agg(
  mean_stay=('total_stay', 'mean'),
  median_stay=('total_stay', 'median'),
  max_stay=('total_stay', 'max')
)

	mean_stay	median_stay	max_stay
hotel
City Hotel	2.978142	3.0	57
Resort Hotel	4.318547	3.0	69

Which market segment is most common?

Approach: group by market segment, compute size of each group, sort by size.

(
  hotels.groupby('market_segment', as_index=False)
  .size()
  .sort_values('size', ascending=False)
)

	market_segment	size
6	Online TA	56477
5	Offline TA/TO	24219
4	Groups	19811
3	Direct	12606
2	Corporate	5295
1	Complementary	743
0	Aviation	237
7	Undefined	2

How much in advance (`lead_time`) are bookings made?

Approach: compute mean and max of lead_time.

# quick way
hotels['lead_time'].describe()

count    119390.000000
mean        104.011416
std         106.863097
min           0.000000
25%          18.000000
50%          69.000000
75%         160.000000
max         737.000000
Name: lead_time, dtype: float64

# more explicit way
print("Lead time: mean={:.1f} days, max={:.1f} days".format(
  hotels['lead_time'].mean(),
  hotels['lead_time'].max())
)

Lead time: mean=104.0 days, max=737.0 days

Which market segments book most in advance?

Approach: group by market segment, compute mean and max of lead_time.

hotels.groupby('market_segment', as_index=False).agg(
  mean_lead_time=('lead_time', 'mean'),
  max_lead_time=('lead_time', 'max')
)

	market_segment	mean_lead_time	max_lead_time
0	Aviation	4.443038	23
1	Complementary	13.286676	386
2	Corporate	22.125590	343
3	Direct	49.859115	737
4	Groups	186.973096	629
5	Offline TA/TO	135.004459	532
6	Online TA	82.998725	403
7	Undefined	1.500000	2

How long are typical stays for each market segment?

Approach: compute total stay length, group by market segment, compute mean and max, sort by mean.

(
  hotels
  .assign(total_stay=hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights'])
  .groupby('market_segment', as_index=False)
  .agg(
    mean_stay=('total_stay', 'mean'),
    max_stay=('total_stay', 'max')
  )
  .sort_values('mean_stay', ascending=False)
)

	market_segment	mean_stay	max_stay
5	Offline TA/TO	3.903877	56
0	Aviation	3.607595	16
6	Online TA	3.573986	57
3	Direct	3.205775	69
4	Groups	2.992529	38
2	Corporate	2.092918	49
1	Complementary	1.647376	30
7	Undefined	1.500000	2

	state	size
0	AK	9
1	AL	89
2	AR	92
3	AS	1
4	AZ	114
5	CA	693
6	CO	101
7	CT	79
8	DC	25
9	DE	21
10	FL	378
11	FM	1
12	GA	175
13	GU	3
14	HI	21
15	IA	80
16	ID	38
17	IL	258
18	IN	139
19	KS	80
20	KY	90
21	LA	121
22	MA	156
23	MD	85
24	ME	38
25	MH	1
26	MI	192
27	MN	112
28	MO	158
29	MP	1
30	MS	59
31	MT	32
32	NC	179
33	ND	26
34	NE	42
35	NH	37
36	NJ	168
37	NM	49
38	NV	38
39	NY	455
40	OH	289
41	OK	108
42	OR	78
43	PA	353
44	PR	144
45	PW	1
46	RI	23
47	SC	100
48	SD	28
49	TN	155
50	TX	433
51	UT	71
52	VA	168
53	VI	2
54	VT	22
55	WA	105
56	WI	95
57	WV	73
58	WY	10

	state	size
0	AK	9
1	AL	89
2	AR	92
3	AS	1
4	AZ	114
5	CA	693
6	CO	101
7	CT	79
8	DC	25
9	DE	21
10	FL	378
11	FM	1
12	GA	175
13	GU	3
14	HI	21
15	IA	80
16	ID	38
17	IL	258
18	IN	139
19	KS	80
20	KY	90
21	LA	121
22	MA	156
23	MD	85
24	ME	38
25	MH	1
26	MI	192
27	MN	112
28	MO	158
29	MP	1
30	MS	59
31	MT	32
32	NC	179
33	ND	26
34	NE	42
35	NH	37
36	NJ	168
37	NM	49
38	NV	38
39	NY	455
40	OH	289
41	OK	108
42	OR	78
43	PA	353
44	PR	144
45	PW	1
46	RI	23
47	SC	100
48	SD	28
49	TN	155
50	TX	433
51	UT	71
52	VA	168
53	VI	2
54	VT	22
55	WA	105
56	WI	95
57	WV	73
58	WY	10