colleges = pd.read_csv("https://calvin-data-science.github.io/data202/data/college_scorecard.csv")
colleges.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6694 entries, 0 to 6693
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 6694 non-null object
1 city 6694 non-null object
2 state 6694 non-null object
3 online_only 6189 non-null float64
4 undergrads 5902 non-null float64
dtypes: float64(2), object(3)
memory usage: 261.6+ KB
Generally useful conceptual operations, common to many tools (pandas, SQL, R, …).
next week:
name | undergrads | |
---|---|---|
0 | Alabama A & M University | 5271.0 |
1 | University of Alabama at Birmingham | 13328.0 |
2 | Amridge University | 365.0 |
3 | University of Alabama in Huntsville | 7785.0 |
4 | Alabama State University | 3750.0 |
... | ... | ... |
6689 | Georgia Military College - Eastman | NaN |
6690 | American College of Barbering - Florence | NaN |
6691 | HCI College - Fort Lauderdale Campus | NaN |
6692 | ABC Adult School - Cabrillo Lane | NaN |
6693 | Urban Barber College - San Jose | NaN |
6694 rows × 2 columns
Documentation on indexing
Getting a single column as a Series
:
0 Alabama A & M University
1 University of Alabama at Birmingham
2 Amridge University
3 University of Alabama in Huntsville
4 Alabama State University
...
6689 Georgia Military College - Eastman
6690 American College of Barbering - Florence
6691 HCI College - Fort Lauderdale Campus
6692 ABC Adult School - Cabrillo Lane
6693 Urban Barber College - San Jose
Name: name, Length: 6694, dtype: object
Remember: column names are strings (not colleges[name]
)!
query
)name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
56 | University of Alaska Anchorage | Anchorage | AK | 0.0 | 9257.0 |
57 | Alaska Bible College | Palmer | AK | 0.0 | 30.0 |
58 | University of Alaska Fairbanks | Fairbanks | AK | 0.0 | 4410.0 |
59 | University of Alaska Southeast | Juneau | AK | 0.0 | 1156.0 |
60 | Alaska Pacific University | Anchorage | AK | 0.0 | 318.0 |
61 | Alaska Vocational Technical Center | Seward | AK | 0.0 | 631.0 |
63 | Alaska Career College | Anchorage | AK | 0.0 | 361.0 |
4433 | Ilisagvik College | Barrow | AK | 0.0 | 132.0 |
4589 | Alaska Christian College | Soldotna | AK | 0.0 | 95.0 |
Documentation: pandas.DataFrame.query
name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
0 | Alabama A & M University | Normal | AL | 0.0 | 5271.0 |
1 | University of Alabama at Birmingham | Birmingham | AL | 0.0 | 13328.0 |
2 | Amridge University | Montgomery | AL | 1.0 | 365.0 |
3 | University of Alabama in Huntsville | Huntsville | AL | 0.0 | 7785.0 |
4 | Alabama State University | Montgomery | AL | 0.0 | 3750.0 |
name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
6689 | Georgia Military College - Eastman | Eastman | GA | NaN | NaN |
6690 | American College of Barbering - Florence | Florence | KY | NaN | NaN |
6691 | HCI College - Fort Lauderdale Campus | Fort Lauderdale | FL | NaN | NaN |
6692 | ABC Adult School - Cabrillo Lane | Cerritos | CA | NaN | NaN |
6693 | Urban Barber College - San Jose | San Jose | CA | NaN | NaN |
sort_values
)name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
6092 | National American University-Killeen | Killeen | TX | 0.0 | 0.0 |
5542 | Ukiah Adult School | Ukiah | CA | 0.0 | 0.0 |
2912 | Pennsylvania State University-Penn State Brand... | Media | PA | 0.0 | 0.0 |
2919 | Pennsylvania State University-Penn State Abington | Abington | PA | 0.0 | 0.0 |
2903 | Pennsylvania State University-Penn State New K... | New Kensington | PA | 0.0 | 0.0 |
name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
4412 | Western Governors University | Salt Lake City | UT | 1.0 | 98630.0 |
1928 | Southern New Hampshire University | Manchester | NH | 0.0 | 90196.0 |
2916 | Pennsylvania State University-Main Campus | University Park | PA | 0.0 | 74630.0 |
5700 | University of Phoenix-Arizona | Phoenix | AZ | 0.0 | 68833.0 |
1040 | Ivy Tech Community College | Indianapolis | IN | 0.0 | 58978.0 |
Documentation: pandas.DataFrame.sort_values
state | size | |
---|---|---|
0 | AK | 9 |
1 | AL | 89 |
2 | AR | 92 |
3 | AS | 1 |
4 | AZ | 114 |
5 | CA | 693 |
6 | CO | 101 |
7 | CT | 79 |
8 | DC | 25 |
9 | DE | 21 |
10 | FL | 378 |
11 | FM | 1 |
12 | GA | 175 |
13 | GU | 3 |
14 | HI | 21 |
15 | IA | 80 |
16 | ID | 38 |
17 | IL | 258 |
18 | IN | 139 |
19 | KS | 80 |
20 | KY | 90 |
21 | LA | 121 |
22 | MA | 156 |
23 | MD | 85 |
24 | ME | 38 |
25 | MH | 1 |
26 | MI | 192 |
27 | MN | 112 |
28 | MO | 158 |
29 | MP | 1 |
30 | MS | 59 |
31 | MT | 32 |
32 | NC | 179 |
33 | ND | 26 |
34 | NE | 42 |
35 | NH | 37 |
36 | NJ | 168 |
37 | NM | 49 |
38 | NV | 38 |
39 | NY | 455 |
40 | OH | 289 |
41 | OK | 108 |
42 | OR | 78 |
43 | PA | 353 |
44 | PR | 144 |
45 | PW | 1 |
46 | RI | 23 |
47 | SC | 100 |
48 | SD | 28 |
49 | TN | 155 |
50 | TX | 433 |
51 | UT | 71 |
52 | VA | 168 |
53 | VI | 2 |
54 | VT | 22 |
55 | WA | 105 |
56 | WI | 95 |
57 | WV | 73 |
58 | WY | 10 |
colleges.groupby("state", as_index=False).agg(
count=('undergrads', 'count'),
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)
state | count | mean_undergrads | max_undergrads | |
---|---|---|---|---|
0 | AK | 9 | 1821.111111 | 9257.0 |
1 | AL | 79 | 2960.405063 | 31900.0 |
2 | AR | 83 | 1425.614458 | 22766.0 |
3 | AS | 1 | 992.000000 | 992.0 |
4 | AZ | 99 | 4587.696970 | 68833.0 |
5 | CA | 576 | 3918.288194 | 35087.0 |
6 | CO | 89 | 3037.471910 | 30630.0 |
7 | CT | 69 | 2299.072464 | 18585.0 |
8 | DC | 17 | 3054.411765 | 12031.0 |
9 | DE | 18 | 2611.722222 | 19047.0 |
10 | FL | 321 | 2767.395639 | 58940.0 |
11 | FM | 1 | 1900.000000 | 1900.0 |
12 | GA | 140 | 3014.850000 | 33439.0 |
13 | GU | 3 | 1579.666667 | 3169.0 |
14 | HI | 20 | 2201.750000 | 12255.0 |
15 | IA | 77 | 1919.402597 | 27930.0 |
16 | ID | 37 | 2492.972973 | 38672.0 |
17 | IL | 223 | 2191.390135 | 33080.0 |
18 | IN | 103 | 3068.320388 | 58978.0 |
19 | KS | 74 | 1990.162162 | 19003.0 |
20 | KY | 83 | 2076.554217 | 21864.0 |
21 | LA | 111 | 1718.945946 | 23706.0 |
22 | MA | 142 | 2339.866197 | 23907.0 |
23 | MD | 78 | 3421.474359 | 42888.0 |
24 | ME | 35 | 1426.028571 | 8832.0 |
25 | MH | 1 | 985.000000 | 985.0 |
26 | MI | 157 | 2650.929936 | 38950.0 |
27 | MN | 91 | 2591.340659 | 31367.0 |
28 | MO | 135 | 1816.866667 | 21933.0 |
29 | MP | 1 | 1228.000000 | 1228.0 |
30 | MS | 53 | 2480.283019 | 18388.0 |
31 | MT | 30 | 1269.533333 | 14402.0 |
32 | NC | 161 | 2693.850932 | 24239.0 |
33 | ND | 26 | 1453.076923 | 10579.0 |
34 | NE | 42 | 2141.190476 | 20252.0 |
35 | NH | 34 | 3956.147059 | 90196.0 |
36 | NJ | 159 | 2135.974843 | 35760.0 |
37 | NM | 45 | 1882.133333 | 18120.0 |
38 | NV | 36 | 2726.250000 | 30090.0 |
39 | NY | 402 | 2275.383085 | 26613.0 |
40 | OH | 267 | 1842.614232 | 45657.0 |
41 | OK | 100 | 1641.290000 | 21329.0 |
42 | OR | 76 | 2285.657895 | 25339.0 |
43 | PA | 316 | 1695.389241 | 74630.0 |
44 | PR | 131 | 1279.732824 | 12508.0 |
45 | PW | 1 | 441.000000 | 441.0 |
46 | RI | 21 | 3225.619048 | 13826.0 |
47 | SC | 93 | 2114.946237 | 27066.0 |
48 | SD | 26 | 1438.384615 | 9034.0 |
49 | TN | 142 | 1812.647887 | 23152.0 |
50 | TX | 382 | 3370.821990 | 55990.0 |
51 | UT | 66 | 4410.621212 | 98630.0 |
52 | VA | 138 | 2867.739130 | 46438.0 |
53 | VI | 1 | 1744.000000 | 1744.0 |
54 | VT | 18 | 1755.833333 | 10700.0 |
55 | WA | 96 | 2678.885417 | 30839.0 |
56 | WI | 86 | 2888.674419 | 31185.0 |
57 | WV | 71 | 1457.352113 | 35656.0 |
58 | WY | 10 | 2117.500000 | 9646.0 |
Shuffle rows:
name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
5721 | Paul Mitchell the School-Toledo | Sylvania | OH | 0.0 | 143.0 |
5725 | Beau Monde Academy of Cosmetology and Barbering | Portland | OR | 0.0 | 115.0 |
5044 | Oliver Finley Academy of Cosmetology | Boise | ID | 0.0 | 254.0 |
2160 | Hilbert College | Hamburg | NY | 0.0 | 711.0 |
6250 | Porter and Chester Institute of Enfield | Enfield | CT | NaN | NaN |
Select random rows:
name | city | state | online_only | undergrads | |
---|---|---|---|---|---|
4094 | ASM Beauty World Academy | Hollywood | FL | 0.0 | 90.0 |
1179 | Hays Academy of Hair Design | Hays | KS | 0.0 | 70.0 |
104 | Universal Technical Institute of Arizona Inc | Avondale | AZ | 0.0 | 1873.0 |
3328 | South Texas Barber College Inc | Corpus Christi | TX | 0.0 | 50.0 |
5573 | Empire Beauty School-Rochester | Rochester | NY | 0.0 | 73.0 |
colleges['named_university'] = colleges['name'].str.contains('University')
colleges.groupby('named_university', as_index=False).agg(
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)
named_university | mean_undergrads | max_undergrads | |
---|---|---|---|
0 | False | 1578.685231 | 58978.0 |
1 | True | 5525.253579 | 98630.0 |
Apply an operation to an entire column at a time.
0 5.271
1 13.328
2 0.365
3 7.785
4 3.750
...
6689 NaN
6690 NaN
6691 NaN
6692 NaN
6693 NaN
Name: undergrads, Length: 6694, dtype: float64
When broadcasting with a pair of columns, the operation is applied element-wise.
def classify_name(school_name):
school_name = school_name.lower()
if 'university' in school_name:
return 'university'
elif 'college' in school_name:
return 'college'
else:
return 'other'
colleges['name_type'] = colleges['name'].apply(classify_name)
colleges.groupby('name_type', as_index=False).agg(
mean_undergrads=('undergrads', 'mean'),
max_undergrads=('undergrads', 'max')
)
name_type | mean_undergrads | max_undergrads | |
---|---|---|---|
0 | college | 2668.155462 | 58978.0 |
1 | other | 316.914355 | 18195.0 |
2 | university | 5525.253579 | 98630.0 |
Note: this is slower because it runs unoptimized Python code for each row. First, make it correct, then make it fast (e.g., np.select
).
hotels_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
import pathlib, requests
cache_filename = pathlib.Path('data/hotels.csv')
if not cache_filename.exists():
cache_filename.parent.mkdir(parents=True, exist_ok=True)
cache_filename.write_bytes(requests.get(hotels_url).content)
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
---
title: "A title"
author: "Your name"
format:
html:
embed-resources: true
code-tools: true
code-fold: true
---
```{python}
#| echo: true
import numpy as np
import pandas as pd
import plotly.express as px
pd.options.plotting.backend = "plotly"
```
```{python}
#| echo: false
# Hack to make plotly work in RStudio
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
r[".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
def show_plot(p): r.to_html(p._repr_html_())
else:
def show_plot(p): return p
```
```{python}
hotels_url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv"
import pathlib, requests
cache_filename = pathlib.Path('data/hotels.csv')
if not cache_filename.exists():
cache_filename.parent.mkdir(parents=True, exist_ok=True)
cache_filename.write_bytes(requests.get(hotels_url).content)
```
```{python}
hotels = pd.read_csv(cache_filename)
hotels.head()
```
Try each of these yourself, before looking at a possible solution in the following slides.
lead_time
) are bookings made?<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 market_segment 119390 non-null object
15 distribution_channel 119390 non-null object
16 is_repeated_guest 119390 non-null int64
17 previous_cancellations 119390 non-null int64
18 previous_bookings_not_canceled 119390 non-null int64
19 reserved_room_type 119390 non-null object
20 assigned_room_type 119390 non-null object
21 booking_changes 119390 non-null int64
22 deposit_type 119390 non-null object
23 agent 103050 non-null float64
24 company 6797 non-null float64
25 days_in_waiting_list 119390 non-null int64
26 customer_type 119390 non-null object
27 adr 119390 non-null float64
28 required_car_parking_spaces 119390 non-null int64
29 total_of_special_requests 119390 non-null int64
30 reservation_status 119390 non-null object
31 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119386.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 103050.000000 | 6797.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 |
mean | 0.370416 | 104.011416 | 2016.156554 | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | 0.103890 | 0.007949 | 0.031912 | 0.087118 | 0.137097 | 0.221124 | 86.693382 | 189.266735 | 2.321149 | 101.831122 | 0.062518 | 0.571363 |
std | 0.482918 | 106.863097 | 0.707476 | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | 0.398561 | 0.097436 | 0.175767 | 0.844336 | 1.497437 | 0.652306 | 110.774548 | 131.655015 | 17.594721 | 50.535790 | 0.245291 | 0.792798 |
min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
25% | 0.000000 | 18.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 62.000000 | 0.000000 | 69.290000 | 0.000000 | 0.000000 |
50% | 0.000000 | 69.000000 | 2016.000000 | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 | 179.000000 | 0.000000 | 94.575000 | 0.000000 | 0.000000 |
75% | 1.000000 | 160.000000 | 2017.000000 | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 229.000000 | 270.000000 | 0.000000 | 126.000000 | 0.000000 | 1.000000 |
max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 535.000000 | 543.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
Approach: group by hotel, compute size of each group.
Approach:
stays_in_week_nights
and stays_in_weekend_nights
to get total length of stay.hotels['total_stay'] = hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights']
hotels.groupby('hotel').agg(
mean_stay=('total_stay', 'mean'),
median_stay=('total_stay', 'median'),
max_stay=('total_stay', 'max')
)
mean_stay | median_stay | max_stay | |
---|---|---|---|
hotel | |||
City Hotel | 2.978142 | 3.0 | 57 |
Resort Hotel | 4.318547 | 3.0 | 69 |
Approach: group by market segment, compute size of each group, sort by size.
lead_time
) are bookings made?Approach: compute mean and max of lead_time
.
count 119390.000000
mean 104.011416
std 106.863097
min 0.000000
25% 18.000000
50% 69.000000
75% 160.000000
max 737.000000
Name: lead_time, dtype: float64
Approach: group by market segment, compute mean and max of lead_time
.
hotels.groupby('market_segment', as_index=False).agg(
mean_lead_time=('lead_time', 'mean'),
max_lead_time=('lead_time', 'max')
)
market_segment | mean_lead_time | max_lead_time | |
---|---|---|---|
0 | Aviation | 4.443038 | 23 |
1 | Complementary | 13.286676 | 386 |
2 | Corporate | 22.125590 | 343 |
3 | Direct | 49.859115 | 737 |
4 | Groups | 186.973096 | 629 |
5 | Offline TA/TO | 135.004459 | 532 |
6 | Online TA | 82.998725 | 403 |
7 | Undefined | 1.500000 | 2 |
Approach: compute total stay length, group by market segment, compute mean and max, sort by mean.
(
hotels
.assign(total_stay=hotels['stays_in_week_nights'] + hotels['stays_in_weekend_nights'])
.groupby('market_segment', as_index=False)
.agg(
mean_stay=('total_stay', 'mean'),
max_stay=('total_stay', 'max')
)
.sort_values('mean_stay', ascending=False)
)
market_segment | mean_stay | max_stay | |
---|---|---|---|
5 | Offline TA/TO | 3.903877 | 56 |
0 | Aviation | 3.607595 | 16 |
6 | Online TA | 3.573986 | 57 |
3 | Direct | 3.205775 | 69 |
4 | Groups | 2.992529 | 38 |
2 | Corporate | 2.092918 | 49 |
1 | Complementary | 1.647376 | 30 |
7 | Undefined | 1.500000 | 2 |