Exercise 2: Bikeshare

Important

Please shutdown your RStudio session when you’re not using it.

This will ensure that your work is saved and that we don’t waste server resources.

We will be continuing our work with the Capital Bikeshare dataset that we started in the previous exercise.

The Purpose

Our goal continues to be to understand ridership patterns to evaluate the current system and suggest potential improvements.

Towards that end, we will construct some more visualizations, this time using more fine-grained ridership data.

The Document

We’ll start by creating a Quarto document, exactly like what we did last week. Note that the template is slightly different this time.

  1. Open the data202-exercises project that you created for the previous exercise.
  2. Create a new folder called ex02 in your project.
  3. Create a new Quarto document called ex02.qmd in the folder you just created.
  4. Make sure the Source mode is selected in the top left corner of the editor pane. (the next step won’t work in the Visual mode, but you can switch back to Visual mode afterwards.)
  5. Expand the block below, click the Copy button in the top-right, then Select All in your ex02.qmd document and paste the copied text, replacing all existing text in the document.
---
title: "DATA 202 Exercise 2"
author: "Your name here"
format:
  html:
    embed-resources: true
    code-tools: true
    code-fold: true
---

# Load Packages

```{python}
import pandas as pd
import plotly.express as px
```

```{python}
#| echo: false
# DATA 202 hack for displaying plotly within RStudio:
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
  r[".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
  def show_plot(p): r.to_html(p._repr_html_())
else:
  def show_plot(p): return p
# End hack
```


# Read Data

```{python}
daily_rides = pd.read_csv("data/day_by_type.csv", parse_dates=["date"])
```

Example row:

```{python}
daily_rides.head(1).T
```


# Exercise 1

```{python}
#| echo: false
#| output: asis
print("""
Your answer here.
""".format())
```

# Exercise 2: Label days of the week

# Exercise 3: Describe a row

Your answer here.

# Exercise 4: Rides by date, by rider type


# Exercise 5


# Exercise 6

The Data

We’ll use an updated dataset based on the Capital Bikeshape dataset we used previously.

Uploading the Dataset

Last time we gave you a code chunk that loaded data directly from the class website. But usually you’ll be working with data that you’ve downloaded to your computer. So let’s practice that.

  1. Download the following data file. (Right-click and choose “Save link as…”.) day_by_type.csv
  2. Create a data folder within your ex02 folder.
  3. Upload the downloaded file to the data folder.

What are the columns?

Run the data-loading chunk in the Quarto document. (You might want to actually Run All, to make sure all the imports have run.) In the Environment pane, click the daily_rides dataframe to open a view of the data.

The chunk after the loading chunk displays an example row, Transposed to make it easier to read.

Add .style.hide(axis='columns') after the .T to hide the meaningless column name. This only works when rendering, though.

Observe that there are some extra columns in the dataset now.

The id columns are the columns that uniquely identify an observation (sometimes called a “case” instead of “observation”). In the previous Exercise, we only had one id column, date, because we had one observation for each date. The dataset for this exercise has two id columns:

  • date: as before
  • rider_type: registered or casual (see below)

The additional id column means that we’ve now broken down the data by rider type (rider_type). Some riders have registered for a Capital Bikeshare membership to get better rates. Other riders just bought a single trip or short-term pass, so we call them casual riders. (Nb., according to the source data, “casual” riders include: Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass). So we’ll now have two rows for each date: one for registered riders and one for casual riders.

So: each row is the count of how many rides were completed on a given day by a given type of rider. For each row, we have the following observed variables:

  • rides: the number of rides by that type of rider
  • season: Winter, Spring, sUmmer, or Fall
  • year: 2011 or 2012.
  • holiday: N for ordinary days, Y for holidays
  • workingday: either “weekday” or “weekend” (where “weekend” includes holidays too).
  • day_of_week: an integer between 0 and 6 inclusive. In this exercise, you will decode which number represents Monday, etc.
  • temp: the average temperature that day, in degrees C
  • feels_like: the “feels-like” temperature in degrees C
  • humidity: relative humidity, scaled to range from 0 to 1
  • weather_type: four coded weather types here (see the source data)
  • windspeed: wind speed in mph

For a description of the original fields, perhaps with different names, see the source data.

Exercise 1: Describe the dataset

Write a sentence answering the following two questions: How many rows does the dataset contain? What does each row represent?

This time, try to use a code chunk to compute the row counts. (See the class notes on Quarto for how to do this.)

Checklist:

Exercise 2: Label days of week

The data set uses the integers 0 through 6 to label days of the week. It does not document, however, what 0 means, or what 6 means. If we want to make understandable plots, we should label these days of the week. To do this: first, figure out what day-of-week codes map to what days-of-week (see the glimpse of the dataset given above for evidence; you may want to look at a calendar); and then do the following.

  1. Write a brief description of what the mapping is and your evidence that you got it correct.
  2. Fill in the blank in the given code block to give labels to the weekdays. Use abbreviations (“Wed”, “Fri”, …).
  3. Check your daily_rides dataframe to make sure the result is correct.

Checklist:

```{python}
weekdays = [list of weekday abbreviations]

def get_day_label(day_number):
  return weekdays[day_number]

daily_rides['day_of_week'] = daily_rides['day_of_week'].map(get_day_label)
```
Warning

The RStudio viewer does an unnecessary time zone conversion, causing it to incorrectly show 2010-12-31. The first date in the dataset should be 2011-01-01.

Right after the read_csv line, add the following code:

# Fix the time zone, to work around an RStudio Viewer bug.
daily_rides['date'] = daily_rides['date'].dt.tz_localize("US/Eastern")

This has not been adequately tested, so it hasn’t been included in the template.

Caution

The cell reassigns the column. So if you run it twice, you’ll get an error. If you do get an error, you can re-run the cell above to re-load the data.

Exercise 3: Describe a row

Describe, in one or two English sentences, the information conveyed by the first row in the data frame. Focus your description on only following fields: date; rider_type; rides; workingday; day_of_week; temp; and feels_like.

Tip

Don’t use code for this one, just type it out.

Exercise 4: Rides by date, by rider type

Make a scatterplot of the number of rides by date, broken down by type of rider. Then, write a brief interpretation of this plot.

Tips: - Refer to your Exercise 1 solution for a very similar (but not identical) plot. - Make the points smaller (marker_size of 3) and partially transparent (marker_opacity of 0.5) to reduce overplotting. See below for an example of how to use update_traces for this; we will discuss this more fully in the next week or two. - Fully label your plot to make the context clear.

Here is one possibility:

Tip

You can use the update_traces method to update the properties of all the traces in a plot. For example, to change the marker size, you can do:

```{python}
show_plot(
  px.scatter(
    ...
    )
    .update_traces(marker_size=...)
)
```

Exercise 5: How does ridership vary over a typical week?

We want to find out how ridership varies over a typical week. Before moving on, consider what question is being asking about the relationship between which variables. It can help to sketch a visualization on scrap paper.

Here’s one possible plot; fill in the blanks to make it.

```{python}
show_plot(
  px.box(
    daily_rides,
    x="__", y="__", color="___",
))
```

Once you have that plot, try a few variations: faceting, using different plot types, etc.

Finally, write a one-or-two-sentence description of what the plot tells you about the data.

Exercise 6: Plot of your choice

Pick another variable or two from the list of variables above. Make a plot of their relationship.

Write a one-sentence description of what the plot suggests about ridership based on the data.

Click Render to check that you can view the rendered output of your document.

Reflection

At the end of your document, write a sentence or two of your overall reflections on this exercise. You may write whatever you want, but you might perhaps respond to one or two of these questions:

  • Was anything unclear about this assignment?
  • How hard was it for you? Where did you get “stuck”?
  • How long did it take you?
  • What questions or uncertainties remain?
  • What skills do you think you’ll need more practice with?
  • Did you try anything out of curiosity that you weren’t specifically asked to do?
Note

We’ll respond to these reflections in class, but only in an overall sense and only once we’ve had a chance to review them. If you have a question that needs a response, please post it on Perusall.

Submitting

First, make sure that your qmd file is free of template text. (Did you change the title and author names? Did you remove the “your code here” lines? Did you delete any “replace this line” text?)

Make sure that your qmd file renders successfully by clicking Render. Spot-check that your most recent change is reflected in the rendered output. Then, submit your html file to Moodle. Refer to the first Exercise for how—but this time, submit your HTML file.