Exercise 5: Bikeshare Wrangling

In this exercise, we’ll continue our study of the bike sharing data, but we’re starting to take off the training wheels: rather than using pre-wrangled data, you’ll be doing some of the data wrangling yourself!

In this exercise you’ll:

Getting Started

Create a folder for this exercise and create a new Quarto file in that folder named ex05-bikeshare-wrangling.qmd. Start with the usual setup code, given in the course notes.

Load Data

We downloaded the data from the Capital BikeShare program. Note the license agreement there. We removed some columns of the data that we aren’t using to reduce the size of the dataset you have to download.

If you’re really curious, here’s the code I used to make the parquet files. For this exercise, though, just used the provided data files.

import requests, zipfile, io, pathlib
import pandas as pd
import numpy as np
import glob

for year in [2011, 2012]:
    parquet_filename = f'data/bikeshare_{year}.parquet'
    if pathlib.Path(parquet_filename).exists():
        continue

    filenames = sorted(glob.glob(f"data/{year}*tripdata.csv"))
    if not filenames:
        print("Downloading", year, "data")
        url = f"https://s3.amazonaws.com/capitalbikeshare-data/{year}-capitalbikeshare-tripdata.zip"
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall("data")
        filenames = sorted(glob.glob(f"data/{year}*tripdata.csv"))

    df = pd.concat([
        pd.read_csv(filename,
        usecols=["Duration", "Start date", "Member type"], parse_dates=["Start date"])
        for filename in filenames], ignore_index=True)
    df.rename(columns={"Start date": "start_time", "Duration": "duration", "Member type": "rider_type"}, inplace=True)
    df.to_parquet(parquet_filename, index=False)

We’ll distribute the data for this exercise in a format called Parquet, an efficient emerging standard for dataframe interchange. You can read Parquet files with pandas.

The data files are broken up by year. Here we read in the data for 2011, which is the first of the two years included in the dataset for previous exercises. Download the data file here and upload it to a data folder in your exercise folder.

rides_2011 = pd.read_parquet("data/bikeshare_2011.parquet")

rides = rides_2011
rides.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1226767 entries, 0 to 1226766
Data columns (total 3 columns):
 #   Column      Non-Null Count    Dtype         
---  ------      --------------    -----         
 0   duration    1226767 non-null  int64         
 1   start_time  1226767 non-null  datetime64[ns]
 2   rider_type  1226767 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 28.1+ MB

Once you’ve finished with this exercise, you can use the following instructions to load the 2012 data also.

  1. Download the data file here
  2. Upload it to the data folder.
  3. Use the following code to load it and concatenate it with the 2011 data (replacing the rides = rides_2011 line above):
```{python}
rides_2012 = pd.read_parquet("data/bikeshare_2012.parquet")
rides = pd.concat([rides_2011, rides_2012], ignore_index=True)
```

Each row is a ride. For each ride we’re given the start time of the trip, its duration, and rider type.

According to the website:

Member Type – Indicates whether user was a “registered” member (Annual Member, 30-Day Member or Day Key Member) or a “casual” rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)

Part 1: Inspecting the data

Part 1A: Basic information

Report some basic size and content information about the rides data frame. For example, how many rows and columns does it have? What are the data types of the columns? What fact does the first row of the data tell us? You may wish to refer to the data dictionary given on the Capital BikeShare program website.

Checklist:

Part 1B: duration column

Note that the data dictionary does not specify the units of the duration column. It turns out that this is in seconds. Confirm this by computing summary statistics of the duration column.

If you wish to make a plot, use rides.sample(n=1000) instead of rides to get a random sample of 1000 rows, since the full dataset pushes the limits of what Plotly can plot efficiently.

Use your domain knowledge about bike riding and cities to check whether the summary statistics make sense. For example, what is the longest ride you would expect to see? What is the shortest ride you would expect to see? What is the typical ride length?

Finally, for practice, make a duration_minutes column that expresses the duration in minutes instead of seconds. Remember that there are 60 seconds in a minute.

Checklist:

Part 1C: start_time column

What are the earliest and latest start times in the dataset? Look at the first and last rows in the dataframe, and check your observations using min and max methods on the column.

Checklist:

Part 1D: rider_type column

Make a table that shows how many rides were taken by registered riders and how many were taken by non-registered riders. Example result:

rider_type size
0 Casual 246949
1 Member 979814
2 Unknown 4

What does this table tell you that the data dictionary did not tell you?

Checklist:

Part 2: Counting Rides By Day

Our goal is to create a dataframe like the one used in Exercise 1 that includes the number of rides in each day. We’ll do this in two steps:

Part 2A: Date Wrangling

The start_time column gives the timestamp (date and time) of when the ride started. We want to count the number of rides per day, so we need to extract the date from the timestamp. We can use the dt accessor to access the date part of the timestamp:

rides['date'] = rides['start_time'].dt.date

The original dataframe also had columns for year, month, and day broken out. We won’t need these (you can derive them all from year), but if you want them, you can use the dt accessor (short for “datetime”). For example:

rides['year'] = rides['start_time'].dt.year

Check that the first few rows look like:

duration start_time rider_type date
0 3548 2011-01-01 00:01:29 Member 2011-01-01
1 346 2011-01-01 00:02:46 Casual 2011-01-01
2 562 2011-01-01 00:06:13 Member 2011-01-01

(All of the code for this part has been provided for you above; you just need to include it in your document.)

Part 2B: Counting

Count the number of rides for each date. Call the result daily_rides.

The first few rows of result should look like:

date total_rides
0 2011-01-01 959
1 2011-01-02 781
2 2011-01-03 1301
3 2011-01-04 1536
4 2011-01-05 1571
Tip

To get the column to be named total_rides instead of size, you can either use an explicit aggregation, or rename the result using .rename(columns={'size': 'total_rides'}). Remember that by default, DataFrame methods return a new DataFrame, so you can just include the rename in the method chain.

Data Types

The resulting date column is a datetime.date object, which is a Python object that represents a date (but not a time). pandas and Plotly don’t handle this type of object well, so let’s convert it to a datetime object. You can convert a column to a datetime object using pd.to_datetime:

daily_rides['date'] = pd.to_datetime(daily_rides['date'])

Checklist:

Part 2C: Validating the result

How many rows does daily_rides have? How many rows should it have, given your knowledge of how calendars work? If the number of rows is different, what might be the cause?

Checklist:

Part 2D: Recreating a plot from Exercise 1

Recreate the “total number of rides by day” plot we made in Exercise 1:

Note

When I first tried adding a trendline like we did, I hit a Plotly / pandas bug about date (not datetime) columns. If you also get a TypeError, see above for code for how to use pd.to_datetime to convert the column to a datetime column.

Checklist:

(anything beyond that is nice to have, but not required)

Part 3: Breaking down by rider type

Repeat Part 2, but now create a daily_rides_by_type data frame where you break down the counts by rider type, like we did in Exercise 2. For example:

date rider_type rides
0 2011-01-01 Casual 330
1 2011-01-01 Member 629
2 2011-01-02 Casual 130
3 2011-01-02 Member 651
4 2011-01-03 Casual 120

Hint: you can group by multiple columns by passing a list of column names to groupby.

Make sure you validate the shape of this data frame also; there is a difference between the number of rows you expect and the number of rows you get. Fix the issue by using a query to remove a small number of rows.

Use the daily_rides_by_type data frame to recreate the plot from Exercise 2:

pandas does not give a warning if the length of a column isn’t correct. I think this is a design bug, since you can silently get wrong outputs. For example:

daily_rides_by_type['date_broken'] = pd.to_datetime(daily_rides['date'])
daily_rides_by_type.tail()
date rider_type rides date_broken
725 2011-12-29 Member 2149 NaT
726 2011-12-30 Casual 491 NaT
727 2011-12-30 Member 2476 NaT
728 2011-12-31 Casual 665 NaT
729 2011-12-31 Member 1792 NaT

For a smaller example:

pd.DataFrame({"x": [1, 2, 3]}).assign(y=pd.Series([1,2]))
x y
0 1 1.0
1 2 2.0
2 3 NaN

The root cause here is a copy-paste bug: you’re now working with daily_rides_by_type, but you copied code that used daily_rides and forgot to change both. Here’s an alternative style that avoids this problem:

def fixed_date(df):
    return pd.to_datetime(df['date'])

(
    daily_rides
    .assign(date=fixed_date)
    .all_the_rest_of_the_chain()
)

Notice that the function takes the current dataframe as an argument, so you can use it in a method chain.

You’ll often see this written in a more compact way using lambda functions:

(
    daily_rides
    .assign(date=lambda df: pd.to_datetime(df['date'])
    .all_the_rest_of_the_chain()
)

Checklist:

Part 4: Breaking down by day of week

We’ll recreate the “How does ridership vary over a typical week?” plot from the previous exercise using this dataset.

Day of Week column

You’ll need to add a column for the day of the week. Use the strftime method to get the name of the day of the week. We also need to convert the resulting column to a Categorical column so that Plotly will plot the days in the correct order and pandas will order the rows correctly when we group by day of week. Here’s how I did it (admittedly with some help from GitHub Copilot!):

# Extract day of week
daily_rides_by_type['day_of_week'] = daily_rides_by_type['date'].dt.strftime("%a")
# Convert to an ordered Categorical
daily_rides_by_type['day_of_week'] = pd.Categorical(
    daily_rides_by_type['day_of_week'],
    categories=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    ordered=True
)

We’re not going to study Categorical columns very much, but they’re useful when dealing with qualitative data, especially when it’s ordered. You can read more about them here.

Other notes
  • Omit the “Unknown” rider type (if you haven’t already)
  • If you’re only using the 2011 rides, the specific values will be a bit different than what you saw in Exercise 2 because that also included 2012. You can add the 2012 data too if you like; see the notes at the beginning of the exercise.

Part 4A: Plot

Make a box plot of the distribution of rides by day of week, broken down by rider type. Example:

Checklist:

Part 4B: Table

Make a table showing the median number of rides by day of week, broken down by rider type. Example:

day_of_week rider_type median_rides
0 Mon Casual 669.0
1 Mon Member 3236.5
2 Tue Casual 454.5
3 Tue Member 3522.0
4 Wed Casual 344.0
5 Wed Member 3270.5
6 Thu Casual 453.0
7 Thu Member 3131.0
8 Fri Casual 569.0
9 Fri Member 3201.5
10 Sat Casual 1154.0
11 Sat Member 2510.0
12 Sun Casual 1217.5
13 Sun Member 2405.5

Checklist:

Optional: pivot!

Once you learn about pivoting, try this one: pivot the rider_type column to get a table with one row per day of week, like this:

rider_type Casual Member
day_of_week
Mon 669.0 3236.5
Tue 454.5 3522.0
Wed 344.0 3270.5
Thu 453.0 3131.0
Fri 569.0 3201.5
Sat 1154.0 2510.0
Sun 1217.5 2405.5

Submitting your work

Final checklist:

    • If you have a long line of wrangling code, surround it in parentheses (( and )) and put each method on its own line.

When you’re finished, Render your qmd and submit the HTML as usual.