Exercise 2: Bikeshare
Please shutdown your RStudio session when you’re not using it.
This will ensure that your work is saved and that we don’t waste server resources.
We will be continuing our work with the Capital Bikeshare dataset that we started in the previous exercise.
The Purpose
Our goal continues to be to understand ridership patterns to evaluate the current system and suggest potential improvements.
Towards that end, we will construct some more visualizations, this time using more fine-grained ridership data.
The Document
We’ll start by creating a Quarto document, exactly like what we did last week. Note that the template is slightly different this time.
- Open the
data202-exercises
project that you created for the previous exercise. - Create a new folder called
ex02
in your project. - Create a new Quarto document called
ex02.qmd
in the folder you just created. - Make sure the
Source
mode is selected in the top left corner of the editor pane. (the next step won’t work in the Visual mode, but you can switch back to Visual mode afterwards.) - Expand the block below, click the Copy button in the top-right, then Select All in your
ex02.qmd
document and paste the copied text, replacing all existing text in the document.
---
"DATA 202 Exercise 2"
title: "Your name here"
author: format:
html:-resources: true
embed-tools: true
code-fold: true
code---
# Load Packages
```{python}import pandas as pd
import plotly.express as px
```
```{python}#| echo: false
# DATA 202 hack for displaying plotly within RStudio:
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
r[def show_plot(p): r.to_html(p._repr_html_())
else:
def show_plot(p): return p
# End hack
```
# Read Data
```{python}= pd.read_csv("data/day_by_type.csv", parse_dates=["date"])
daily_rides
```
Example row:
```{python}1).T
daily_rides.head(
```
# Exercise 1
```{python}#| echo: false
#| output: asis
print("""
Your answer here.
""".format())
```
# Exercise 2: Label days of the week
# Exercise 3: Describe a row
Your answer here.
# Exercise 4: Rides by date, by rider type
# Exercise 5
# Exercise 6
The Data
We’ll use an updated dataset based on the Capital Bikeshape dataset we used previously.
Uploading the Dataset
Last time we gave you a code chunk that loaded data directly from the class website. But usually you’ll be working with data that you’ve downloaded to your computer. So let’s practice that.
- Download the following data file. (Right-click and choose “Save link as…”.)
day_by_type.csv
- Create a
data
folder within yourex02
folder. - Upload the downloaded file to the
data
folder.
What are the columns?
Run the data-loading chunk in the Quarto document. (You might want to actually Run All, to make sure all the imports have run.) In the Environment pane, click the daily_rides
dataframe to open a view of the data.
The chunk after the loading chunk displays an example row, T
ransposed to make it easier to read.
Add .style.hide(axis='columns')
after the .T
to hide the meaningless column name. This only works when rendering, though.
Observe that there are some extra columns in the dataset now.
The id columns are the columns that uniquely identify an observation (sometimes called a “case” instead of “observation”). In the previous Exercise, we only had one id column, date
, because we had one observation for each date. The dataset for this exercise has two id columns:
date
: as beforerider_type
:registered
orcasual
(see below)
The additional id column means that we’ve now broken down the data by rider type (rider_type
). Some riders have registered
for a Capital Bikeshare membership to get better rates. Other riders just bought a single trip or short-term pass, so we call them casual
riders. (Nb., according to the source data, “casual” riders include: Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass). So we’ll now have two rows for each date: one for registered riders and one for casual riders.
So: each row is the count of how many rides were completed on a given day by a given type of rider. For each row, we have the following observed variables:
rides
: the number of rides by that type of riderseason
: Winter, Spring, sUmmer, or Fallyear
: 2011 or 2012.holiday
:N
for ordinary days,Y
for holidaysworkingday
: either “weekday” or “weekend” (where “weekend” includes holidays too).day_of_week
: an integer between 0 and 6 inclusive. In this exercise, you will decode which number represents Monday, etc.temp
: the average temperature that day, in degrees Cfeels_like
: the “feels-like” temperature in degrees Chumidity
: relative humidity, scaled to range from 0 to 1weather_type
: four coded weather types here (see the source data)windspeed
: wind speed in mph
For a description of the original fields, perhaps with different names, see the source data.
Exercise 1: Describe the dataset
Write a sentence answering the following two questions: How many rows does the dataset contain? What does each row represent?
This time, try to use a code chunk to compute the row counts. (See the class notes on Quarto for how to do this.)
Checklist:
Exercise 2: Label days of week
The data set uses the integers 0 through 6 to label days of the week. It does not document, however, what 0 means, or what 6 means. If we want to make understandable plots, we should label these days of the week. To do this: first, figure out what day-of-week codes map to what days-of-week (see the glimpse of the dataset given above for evidence; you may want to look at a calendar); and then do the following.
- Write a brief description of what the mapping is and your evidence that you got it correct.
- Fill in the blank in the given code block to give labels to the weekdays. Use abbreviations (“Wed”, “Fri”, …).
- Check your
daily_rides
dataframe to make sure the result is correct.
Checklist:
```{python}
weekdays = [list of weekday abbreviations]
def get_day_label(day_number):
return weekdays[day_number]
daily_rides['day_of_week'] = daily_rides['day_of_week'].map(get_day_label)
```
The RStudio viewer does an unnecessary time zone conversion, causing it to incorrectly show 2010-12-31. The first date in the dataset should be 2011-01-01.
Right after the read_csv
line, add the following code:
# Fix the time zone, to work around an RStudio Viewer bug.
'date'] = daily_rides['date'].dt.tz_localize("US/Eastern") daily_rides[
This has not been adequately tested, so it hasn’t been included in the template.
The cell reassigns the column. So if you run it twice, you’ll get an error. If you do get an error, you can re-run the cell above to re-load the data.
Exercise 3: Describe a row
Describe, in one or two English sentences, the information conveyed by the first row in the data frame. Focus your description on only following fields: date
; rider_type
; rides
; workingday
; day_of_week
; temp
; and feels_like
.
Don’t use code for this one, just type it out.
Exercise 4: Rides by date, by rider type
Make a scatterplot of the number of rides by date, broken down by type of rider. Then, write a brief interpretation of this plot.
Tips: - Refer to your Exercise 1 solution for a very similar (but not identical) plot. - Make the points smaller (marker_size
of 3) and partially transparent (marker_opacity
of 0.5) to reduce overplotting. See below for an example of how to use update_traces
for this; we will discuss this more fully in the next week or two. - Fully label your plot to make the context clear.
Here is one possibility:
You can use the update_traces
method to update the properties of all the traces in a plot. For example, to change the marker size, you can do:
```{python}
show_plot(
px.scatter(
...
)
.update_traces(marker_size=...)
)
```
Exercise 5: How does ridership vary over a typical week?
We want to find out how ridership varies over a typical week. Before moving on, consider what question is being asking about the relationship between which variables. It can help to sketch a visualization on scrap paper.
Here’s one possible plot; fill in the blanks to make it.
```{python}
show_plot(
px.box(
daily_rides,
x="__", y="__", color="___",
))
```
Once you have that plot, try a few variations: faceting, using different plot types, etc.
Finally, write a one-or-two-sentence description of what the plot tells you about the data.
Exercise 6: Plot of your choice
Pick another variable or two from the list of variables above. Make a plot of their relationship.
Write a one-sentence description of what the plot suggests about ridership based on the data.
Click Render to check that you can view the rendered output of your document.
Reflection
At the end of your document, write a sentence or two of your overall reflections on this exercise. You may write whatever you want, but you might perhaps respond to one or two of these questions:
- Was anything unclear about this assignment?
- How hard was it for you? Where did you get “stuck”?
- How long did it take you?
- What questions or uncertainties remain?
- What skills do you think you’ll need more practice with?
- Did you try anything out of curiosity that you weren’t specifically asked to do?
We’ll respond to these reflections in class, but only in an overall sense and only once we’ve had a chance to review them. If you have a question that needs a response, please post it on Perusall.
Submitting
First, make sure that your qmd
file is free of template text. (Did you change the title and author names? Did you remove the “your code here” lines? Did you delete any “replace this line” text?)
Make sure that your qmd
file renders successfully by clicking Render. Spot-check that your most recent change is reflected in the rendered output. Then, submit your html
file to Moodle. Refer to the first Exercise for how—but this time, submit your HTML file.