Exercise 5: Bikeshare Wrangling

In this exercise, we’ll continue our study of the bike sharing data, but we’re starting to take off the training wheels: rather than using pre-wrangled data, you’ll be doing some of the data wrangling yourself!

In this exercise you’ll:

refine your skills in creating reproducible documents (Quarto) and graphics (Plotly)
practice grouping and aggregation functions in pandas
see some ways to work with dates

Getting Started

Create a folder for this exercise and create a new Quarto file in that folder named ex05-bikeshare-wrangling.qmd. Start with the usual setup code, given in the course notes.

Load Data

We downloaded the data from the Capital BikeShare program. Note the license agreement there. We removed some columns of the data that we aren’t using to reduce the size of the dataset you have to download.

Low-level wrangling code

If you’re really curious, here’s the code I used to make the parquet files. For this exercise, though, just used the provided data files.

import requests, zipfile, io, pathlib
import pandas as pd
import numpy as np
import glob

for year in [2011, 2012]:
    parquet_filename = f'data/bikeshare_{year}.parquet'
    if pathlib.Path(parquet_filename).exists():
        continue

    filenames = sorted(glob.glob(f"data/{year}*tripdata.csv"))
    if not filenames:
        print("Downloading", year, "data")
        url = f"https://s3.amazonaws.com/capitalbikeshare-data/{year}-capitalbikeshare-tripdata.zip"
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall("data")
        filenames = sorted(glob.glob(f"data/{year}*tripdata.csv"))

    df = pd.concat([
        pd.read_csv(filename,
        usecols=["Duration", "Start date", "Member type"], parse_dates=["Start date"])
        for filename in filenames], ignore_index=True)
    df.rename(columns={"Start date": "start_time", "Duration": "duration", "Member type": "rider_type"}, inplace=True)
    df.to_parquet(parquet_filename, index=False)

We’ll distribute the data for this exercise in a format called Parquet, an efficient emerging standard for dataframe interchange. You can read Parquet files with pandas.

The data files are broken up by year. Here we read in the data for 2011, which is the first of the two years included in the dataset for previous exercises. Download the data file here and upload it to a data folder in your exercise folder.

rides_2011 = pd.read_parquet("data/bikeshare_2011.parquet")

rides = rides_2011
rides.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1226767 entries, 0 to 1226766
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype         
---  ------      --------------    -----         
 0   duration    1226767 non-null  int64         
 1   start_time  1226767 non-null  datetime64[ns]
 2   rider_type  1226767 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 28.1+ MB

2012 data

Once you’ve finished with this exercise, you can use the following instructions to load the 2012 data also.

Download the data file here
Upload it to the data folder.
Use the following code to load it and concatenate it with the 2011 data (replacing the rides = rides_2011 line above):

```{python}
rides_2012 = pd.read_parquet("data/bikeshare_2012.parquet")
rides = pd.concat([rides_2011, rides_2012], ignore_index=True)
```

Each row is a ride. For each ride we’re given the start time of the trip, its duration, and rider type.

According to the website:

Member Type – Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)

Part 1: Inspecting the data

Part 1A: Basic information

Report some basic size and content information about the rides data frame. For example, how many rows and columns does it have? What are the data types of the columns? What fact does the first row of the data tell us? You may wish to refer to the data dictionary given on the Capital BikeShare program website.

Checklist:

You reported the number of rows and columns in the data frame (ideally using a computation).
You reported the data types (both storage and logical) of the columns.
You wrote a sentence conveying the information that the first row of the data tells us.

Part 1B: `duration` column

Note that the data dictionary does not specify the units of the duration column. It turns out that this is in seconds. Confirm this by computing summary statistics of the duration column.

Do you really want to plot this?

If you wish to make a plot, use rides.sample(n=1000) instead of rides to get a random sample of 1000 rows, since the full dataset pushes the limits of what Plotly can plot efficiently.

Use your domain knowledge about bike riding and cities to check whether the summary statistics make sense. For example, what is the longest ride you would expect to see? What is the shortest ride you would expect to see? What is the typical ride length?

Finally, for practice, make a duration_minutes column that expresses the duration in minutes instead of seconds. Remember that there are 60 seconds in a minute.

Checklist:

You computed summary statistics of the duration column.
You wrote a sentence or two explaining how one or two of those summary statistics make sense in the context of bike riding and cities.
You wrote code to create a duration_minutes column.

Part 1C: `start_time` column

What are the earliest and latest start times in the dataset? Look at the first and last rows in the dataframe, and check your observations using min and max methods on the column.

Checklist:

You reported the first and last start times in the dataset.
You checked your observations using min and max methods on the column.

Part 1D: `rider_type` column

Make a table that shows how many rides were taken by registered riders and how many were taken by non-registered riders. Example result:

	rider_type	size
0	Casual	246949
1	Member	979814
2	Unknown	4

What does this table tell you that the data dictionary did not tell you?

Checklist:

Your report includes a table showing the number of rides by rider type.
You wrote a short sentence containing the new piece of information that the table tells you.

Part 2: Counting Rides By Day

Our goal is to create a dataframe like the one used in Exercise 1 that includes the number of rides in each day. We’ll do this in two steps:

Part 2A: Date Wrangling

The start_time column gives the timestamp (date and time) of when the ride started. We want to count the number of rides per day, so we need to extract the date from the timestamp. We can use the dt accessor to access the date part of the timestamp:

rides['date'] = rides['start_time'].dt.date

other dt accessors

The original dataframe also had columns for year, month, and day broken out. We won’t need these (you can derive them all from year), but if you want them, you can use the dt accessor (short for “datetime”). For example:

rides['year'] = rides['start_time'].dt.year

Check that the first few rows look like:

	duration	start_time	rider_type	date
0	3548	2011-01-01 00:01:29	Member	2011-01-01
1	346	2011-01-01 00:02:46	Casual	2011-01-01
2	562	2011-01-01 00:06:13	Member	2011-01-01

(All of the code for this part has been provided for you above; you just need to include it in your document.)

Part 2B: Counting

Count the number of rides for each date. Call the result daily_rides.

The first few rows of result should look like:

	date	total_rides
0	2011-01-01	959
1	2011-01-02	781
2	2011-01-03	1301
3	2011-01-04	1536
4	2011-01-05	1571

Tip

To get the column to be named total_rides instead of size, you can either use an explicit aggregation, or rename the result using .rename(columns={'size': 'total_rides'}). Remember that by default, DataFrame methods return a new DataFrame, so you can just include the rename in the method chain.

Data Types

The resulting date column is a datetime.date object, which is a Python object that represents a date (but not a time). pandas and Plotly don’t handle this type of object well, so let’s convert it to a datetime object. You can convert a column to a datetime object using pd.to_datetime:

daily_rides['date'] = pd.to_datetime(daily_rides['date'])

Checklist:

You created a daily_rides data frame that contains the number of rides per day.
You converted the date column to a datetime object using the provided code.

Part 2C: Validating the result

How many rows does daily_rides have? How many rows should it have, given your knowledge of how calendars work? If the number of rows is different, what might be the cause?

Checklist:

You reported the number of rows in the daily_rides data frame.
You reported the number of rows that you expected.
You wrote a sentence about what you (didn’t) observe.

Part 2D: Recreating a plot from Exercise 1

Recreate the “total number of rides by day” plot we made in Exercise 1:

Note

When I first tried adding a trendline like we did, I hit a Plotly / pandas bug about date (not datetime) columns. If you also get a TypeError, see above for code for how to use pd.to_datetime to convert the column to a datetime column.

Checklist:

You made a plot of total rides by date.

(anything beyond that is nice to have, but not required)

Part 3: Breaking down by rider type

Repeat Part 2, but now create a daily_rides_by_type data frame where you break down the counts by rider type, like we did in Exercise 2. For example:

	date	rider_type	rides
0	2011-01-01	Casual	330
1	2011-01-01	Member	629
2	2011-01-02	Casual	130
3	2011-01-02	Member	651
4	2011-01-03	Casual	120

Hint: you can group by multiple columns by passing a list of column names to groupby.

Make sure you validate the shape of this data frame also; there is a difference between the number of rows you expect and the number of rows you get. Fix the issue by using a query to remove a small number of rows.

Use the daily_rides_by_type data frame to recreate the plot from Exercise 2:

Copy-paste bugs (are your trendlines broken?)

pandas does not give a warning if the length of a column isn’t correct. I think this is a design bug, since you can silently get wrong outputs. For example:

daily_rides_by_type['date_broken'] = pd.to_datetime(daily_rides['date'])
daily_rides_by_type.tail()

	date	rider_type	rides	date_broken
725	2011-12-29	Member	2149	NaT
726	2011-12-30	Casual	491	NaT
727	2011-12-30	Member	2476	NaT
728	2011-12-31	Casual	665	NaT
729	2011-12-31	Member	1792	NaT

For a smaller example:

pd.DataFrame({"x": [1, 2, 3]}).assign(y=pd.Series([1,2]))

	x	y
0	1	1.0
1	2	2.0
2	3	NaN

The root cause here is a copy-paste bug: you’re now working with daily_rides_by_type, but you copied code that used daily_rides and forgot to change both. Here’s an alternative style that avoids this problem:

def fixed_date(df):
    return pd.to_datetime(df['date'])

(
    daily_rides
    .assign(date=fixed_date)
    .all_the_rest_of_the_chain()
)

Notice that the function takes the current dataframe as an argument, so you can use it in a method chain.

You’ll often see this written in a more compact way using lambda functions:

(
    daily_rides
    .assign(date=lambda df: pd.to_datetime(df['date'])
    .all_the_rest_of_the_chain()
)

Checklist:

You created a daily_rides_by_type data frame that contains the number of rides per day, broken down by rider type.
You checked the number of rows in the data frame and reported any discrepancies.
You included a query to remove the rows that were causing the discrepancy.
You made a plot of total rides by date, broken down by rider type.

Part 4: Breaking down by day of week

We’ll recreate the “How does ridership vary over a typical week?” plot from the previous exercise using this dataset.

Day of Week column

You’ll need to add a column for the day of the week. Use the strftime method to get the name of the day of the week. We also need to convert the resulting column to a Categorical column so that Plotly will plot the days in the correct order and pandas will order the rows correctly when we group by day of week. Here’s how I did it (admittedly with some help from GitHub Copilot!):

# Extract day of week
daily_rides_by_type['day_of_week'] = daily_rides_by_type['date'].dt.strftime("%a")
# Convert to an ordered Categorical
daily_rides_by_type['day_of_week'] = pd.Categorical(
    daily_rides_by_type['day_of_week'],
    categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    ordered=True
)

We’re not going to study Categorical columns very much, but they’re useful when dealing with qualitative data, especially when it’s ordered. You can read more about them here.

Other notes

Omit the “Unknown” rider type (if you haven’t already)
If you’re only using the 2011 rides, the specific values will be a bit different than what you saw in Exercise 2 because that also included 2012. You can add the 2012 data too if you like; see the notes at the beginning of the exercise.

Part 4A: Plot

Make a box plot of the distribution of rides by day of week, broken down by rider type. Example:

Checklist:

You created a day_of_week column using the provided code.
You made a box plot of the distribution of rides by day of week, broken down by rider type.

Part 4B: Table

Make a table showing the median number of rides by day of week, broken down by rider type. Example:

	day_of_week	rider_type	median_rides
0	Mon	Casual	669.0
1	Mon	Member	3236.5
2	Tue	Casual	454.5
3	Tue	Member	3522.0
4	Wed	Casual	344.0
5	Wed	Member	3270.5
6	Thu	Casual	453.0
7	Thu	Member	3131.0
8	Fri	Casual	569.0
9	Fri	Member	3201.5
10	Sat	Casual	1154.0
11	Sat	Member	2510.0
12	Sun	Casual	1217.5
13	Sun	Member	2405.5

Checklist:

You created a table showing the median number of rides by day of week, broken down by rider type.

Optional: pivot!

Once you learn about pivoting, try this one: pivot the rider_type column to get a table with one row per day of week, like this:

rider_type	Casual	Member
day_of_week
Mon	669.0	3236.5
Tue	454.5	3522.0
Wed	344.0	3270.5
Thu	453.0	3131.0
Fri	569.0	3201.5
Sat	1154.0	2510.0
Sun	1217.5	2405.5

Submitting your work

Final checklist:

You’ve checked off all the checklists above.
Headings show up correctly (e.g., make sure you have a space between the # and the heading text) in the “Rendered” HTML.
Plots show up correctly.
There’s not extraneous code.
Code is reasonably well formatted.
- If you have a long line of wrangling code, surround it in parentheses (( and )) and put each method on its own line.
The HTML document shows correctly on your own computer.
All my code is visible in the HTML document.

When you’re finished, Render your qmd and submit the HTML as usual.

Getting Started

Load Data

Part 1: Inspecting the data

Part 1A: Basic information

Part 1B: duration column

Part 1C: start_time column

Part 1D: rider_type column

Part 2: Counting Rides By Day

Part 2A: Date Wrangling

Part 2B: Counting

Part 2C: Validating the result

Part 2D: Recreating a plot from Exercise 1

Part 3: Breaking down by rider type

Part 4: Breaking down by day of week

Part 4A: Plot

Part 4B: Table

Optional: pivot!

Submitting your work

Part 1B: `duration` column

Part 1C: `start_time` column

Part 1D: `rider_type` column