import requests, zipfile, io, pathlib
import pandas as pd
import numpy as np
import glob
for year in [2011, 2012]:
= f'data/bikeshare_{year}.parquet'
parquet_filename if pathlib.Path(parquet_filename).exists():
continue
= sorted(glob.glob(f"data/{year}*tripdata.csv"))
filenames if not filenames:
print("Downloading", year, "data")
= f"https://s3.amazonaws.com/capitalbikeshare-data/{year}-capitalbikeshare-tripdata.zip"
url = requests.get(url)
r = zipfile.ZipFile(io.BytesIO(r.content))
z "data")
z.extractall(= sorted(glob.glob(f"data/{year}*tripdata.csv"))
filenames
= pd.concat([
df
pd.read_csv(filename,=["Duration", "Start date", "Member type"], parse_dates=["Start date"])
usecolsfor filename in filenames], ignore_index=True)
={"Start date": "start_time", "Duration": "duration", "Member type": "rider_type"}, inplace=True)
df.rename(columns=False) df.to_parquet(parquet_filename, index
Exercise 5: Bikeshare Wrangling
In this exercise, we’ll continue our study of the bike sharing data, but we’re starting to take off the training wheels: rather than using pre-wrangled data, you’ll be doing some of the data wrangling yourself!
In this exercise you’ll:
- refine your skills in creating reproducible documents (Quarto) and graphics (Plotly)
- practice grouping and aggregation functions in pandas
- see some ways to work with dates
Getting Started
Create a folder for this exercise and create a new Quarto file in that folder named ex05-bikeshare-wrangling.qmd
. Start with the usual setup code, given in the course notes.
Load Data
We downloaded the data from the Capital BikeShare program. Note the license agreement there. We removed some columns of the data that we aren’t using to reduce the size of the dataset you have to download.
If you’re really curious, here’s the code I used to make the parquet files. For this exercise, though, just used the provided data files.
We’ll distribute the data for this exercise in a format called Parquet, an efficient emerging standard for dataframe interchange. You can read Parquet files with pandas.
The data files are broken up by year. Here we read in the data for 2011, which is the first of the two years included in the dataset for previous exercises. Download the data file here and upload it to a data
folder in your exercise folder.
= pd.read_parquet("data/bikeshare_2011.parquet")
rides_2011
= rides_2011
rides rides.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1226767 entries, 0 to 1226766
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 duration 1226767 non-null int64
1 start_time 1226767 non-null datetime64[ns]
2 rider_type 1226767 non-null object
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 28.1+ MB
Once you’ve finished with this exercise, you can use the following instructions to load the 2012 data also.
- Download the data file here
- Upload it to the
data
folder. - Use the following code to load it and concatenate it with the 2011 data (replacing the
rides = rides_2011
line above):
```{python}
rides_2012 = pd.read_parquet("data/bikeshare_2012.parquet")
rides = pd.concat([rides_2011, rides_2012], ignore_index=True)
```
Each row is a ride. For each ride we’re given the start time of the trip, its duration, and rider type.
According to the website:
Member Type – Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)
Part 1: Inspecting the data
Part 1A: Basic information
Report some basic size and content information about the rides
data frame. For example, how many rows and columns does it have? What are the data types of the columns? What fact does the first row of the data tell us? You may wish to refer to the data dictionary given on the Capital BikeShare program website.
Checklist:
Part 1B: duration
column
Note that the data dictionary does not specify the units of the duration
column. It turns out that this is in seconds. Confirm this by computing summary statistics of the duration
column.
If you wish to make a plot, use rides.sample(n=1000)
instead of rides
to get a random sample of 1000 rows, since the full dataset pushes the limits of what Plotly can plot efficiently.
Use your domain knowledge about bike riding and cities to check whether the summary statistics make sense. For example, what is the longest ride you would expect to see? What is the shortest ride you would expect to see? What is the typical ride length?
Finally, for practice, make a duration_minutes
column that expresses the duration in minutes instead of seconds. Remember that there are 60 seconds in a minute.
Checklist:
Part 1C: start_time
column
What are the earliest and latest start times in the dataset? Look at the first and last rows in the dataframe, and check your observations using min
and max
methods on the column.
Checklist:
Part 1D: rider_type
column
Make a table that shows how many rides were taken by registered riders and how many were taken by non-registered riders. Example result:
rider_type | size | |
---|---|---|
0 | Casual | 246949 |
1 | Member | 979814 |
2 | Unknown | 4 |
What does this table tell you that the data dictionary did not tell you?
Checklist:
Part 2: Counting Rides By Day
Our goal is to create a dataframe like the one used in Exercise 1 that includes the number of rides in each day. We’ll do this in two steps:
Part 2A: Date Wrangling
The start_time
column gives the timestamp (date and time) of when the ride started. We want to count the number of rides per day, so we need to extract the date from the timestamp. We can use the dt
accessor to access the date part of the timestamp:
'date'] = rides['start_time'].dt.date rides[
The original dataframe also had columns for year, month, and day broken out. We won’t need these (you can derive them all from year
), but if you want them, you can use the dt
accessor (short for “datetime”). For example:
'year'] = rides['start_time'].dt.year rides[
Check that the first few rows look like:
duration | start_time | rider_type | date | |
---|---|---|---|---|
0 | 3548 | 2011-01-01 00:01:29 | Member | 2011-01-01 |
1 | 346 | 2011-01-01 00:02:46 | Casual | 2011-01-01 |
2 | 562 | 2011-01-01 00:06:13 | Member | 2011-01-01 |
(All of the code for this part has been provided for you above; you just need to include it in your document.)
Part 2B: Counting
Count the number of rides for each date. Call the result daily_rides
.
The first few rows of result should look like:
date | total_rides | |
---|---|---|
0 | 2011-01-01 | 959 |
1 | 2011-01-02 | 781 |
2 | 2011-01-03 | 1301 |
3 | 2011-01-04 | 1536 |
4 | 2011-01-05 | 1571 |
To get the column to be named total_rides
instead of size
, you can either use an explicit aggregation, or rename the result using .rename(columns={'size': 'total_rides'})
. Remember that by default, DataFrame
methods return a new DataFrame
, so you can just include the rename
in the method chain.
The resulting date
column is a datetime.date
object, which is a Python object that represents a date (but not a time). pandas and Plotly don’t handle this type of object well, so let’s convert it to a datetime
object. You can convert a column to a datetime
object using pd.to_datetime
:
'date'] = pd.to_datetime(daily_rides['date']) daily_rides[
Checklist:
Part 2C: Validating the result
How many rows does daily_rides
have? How many rows should it have, given your knowledge of how calendars work? If the number of rows is different, what might be the cause?
Checklist:
Part 2D: Recreating a plot from Exercise 1
Recreate the “total number of rides by day” plot we made in Exercise 1:
When I first tried adding a trendline like we did, I hit a Plotly / pandas bug about date
(not datetime
) columns. If you also get a TypeError
, see above for code for how to use pd.to_datetime
to convert the column to a datetime
column.
Checklist:
(anything beyond that is nice to have, but not required)
Part 3: Breaking down by rider type
Repeat Part 2, but now create a daily_rides_by_type
data frame where you break down the counts by rider type, like we did in Exercise 2. For example:
date | rider_type | rides | |
---|---|---|---|
0 | 2011-01-01 | Casual | 330 |
1 | 2011-01-01 | Member | 629 |
2 | 2011-01-02 | Casual | 130 |
3 | 2011-01-02 | Member | 651 |
4 | 2011-01-03 | Casual | 120 |
Hint: you can group by multiple columns by passing a list of column names to groupby
.
Make sure you validate the shape of this data frame also; there is a difference between the number of rows you expect and the number of rows you get. Fix the issue by using a query
to remove a small number of rows.
Use the daily_rides_by_type
data frame to recreate the plot from Exercise 2:
pandas does not give a warning if the length of a column isn’t correct. I think this is a design bug, since you can silently get wrong outputs. For example:
'date_broken'] = pd.to_datetime(daily_rides['date'])
daily_rides_by_type[ daily_rides_by_type.tail()
date | rider_type | rides | date_broken | |
---|---|---|---|---|
725 | 2011-12-29 | Member | 2149 | NaT |
726 | 2011-12-30 | Casual | 491 | NaT |
727 | 2011-12-30 | Member | 2476 | NaT |
728 | 2011-12-31 | Casual | 665 | NaT |
729 | 2011-12-31 | Member | 1792 | NaT |
For a smaller example:
"x": [1, 2, 3]}).assign(y=pd.Series([1,2])) pd.DataFrame({
x | y | |
---|---|---|
0 | 1 | 1.0 |
1 | 2 | 2.0 |
2 | 3 | NaN |
The root cause here is a copy-paste bug: you’re now working with daily_rides_by_type
, but you copied code that used daily_rides
and forgot to change both. Here’s an alternative style that avoids this problem:
def fixed_date(df):
return pd.to_datetime(df['date'])
(
daily_rides=fixed_date)
.assign(date
.all_the_rest_of_the_chain() )
Notice that the function takes the current dataframe as an argument, so you can use it in a method chain.
You’ll often see this written in a more compact way using lambda
functions:
(
daily_rides=lambda df: pd.to_datetime(df['date'])
.assign(date
.all_the_rest_of_the_chain() )
Checklist:
Part 4: Breaking down by day of week
We’ll recreate the “How does ridership vary over a typical week?” plot from the previous exercise using this dataset.
You’ll need to add a column for the day of the week. Use the strftime
method to get the name of the day of the week. We also need to convert the resulting column to a Categorical
column so that Plotly will plot the days in the correct order and pandas will order the rows correctly when we group by day of week. Here’s how I did it (admittedly with some help from GitHub Copilot!):
# Extract day of week
'day_of_week'] = daily_rides_by_type['date'].dt.strftime("%a")
daily_rides_by_type[# Convert to an ordered Categorical
'day_of_week'] = pd.Categorical(
daily_rides_by_type['day_of_week'],
daily_rides_by_type[=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
categories=True
ordered )
We’re not going to study Categorical columns very much, but they’re useful when dealing with qualitative data, especially when it’s ordered. You can read more about them here.
- Omit the “Unknown” rider type (if you haven’t already)
- If you’re only using the 2011 rides, the specific values will be a bit different than what you saw in Exercise 2 because that also included 2012. You can add the 2012 data too if you like; see the notes at the beginning of the exercise.
Part 4A: Plot
Make a box plot of the distribution of rides by day of week, broken down by rider type. Example:
Checklist:
Part 4B: Table
Make a table showing the median number of rides by day of week, broken down by rider type. Example:
day_of_week | rider_type | median_rides | |
---|---|---|---|
0 | Mon | Casual | 669.0 |
1 | Mon | Member | 3236.5 |
2 | Tue | Casual | 454.5 |
3 | Tue | Member | 3522.0 |
4 | Wed | Casual | 344.0 |
5 | Wed | Member | 3270.5 |
6 | Thu | Casual | 453.0 |
7 | Thu | Member | 3131.0 |
8 | Fri | Casual | 569.0 |
9 | Fri | Member | 3201.5 |
10 | Sat | Casual | 1154.0 |
11 | Sat | Member | 2510.0 |
12 | Sun | Casual | 1217.5 |
13 | Sun | Member | 2405.5 |
Checklist:
Optional: pivot!
Once you learn about pivoting, try this one: pivot the rider_type
column to get a table with one row per day of week, like this:
rider_type | Casual | Member |
---|---|---|
day_of_week | ||
Mon | 669.0 | 3236.5 |
Tue | 454.5 | 3522.0 |
Wed | 344.0 | 3270.5 |
Thu | 453.0 | 3131.0 |
Fri | 569.0 | 3201.5 |
Sat | 1154.0 | 2510.0 |
Sun | 1217.5 | 2405.5 |
Submitting your work
Final checklist:
-
- If you have a long line of wrangling code, surround it in parentheses (
(
and)
) and put each method on its own line.
- If you have a long line of wrangling code, surround it in parentheses (
When you’re finished, Render your qmd
and submit the HTML as usual.