Midterm Project

Replicate and critique a visual.

For this project you will pick some existing data science work (a newspapar article, blog post, report, research paper, etc.) and replicate a visual from it using the data wrangling and plotting tools we’re studying.

You will then critique the original visualization and propose alternative designs.

The project will be done in teams of between 1 and 3. You will submit a single report and make a joint presentation.

Depth Somewhere

This document details various requirements for the project. Not all requirements will make sense for every project, though.

The overall goal is that your project has depth somewhere. For example, you may choose to go deeper in:

understanding where the data came from
wrangling the data
analyzing the visualization and designing alternatives
thinking deeply about who the audience of the visualization is and making well-motivated design choices
making a high-quality plot

Going deeper in one area may mean that you don’t go as deep in another area. For example, if the data wrangling is very difficult, we will have lower expectations for other parts of the project.

If you’re unsure whether a specific requirement applies to your project, ask.

Milestones

Week 3: Identify possible plots

This was done as part of Discussion 2.

How to pick a plot?

You can pick one of these approaches:

Organization-centered: Pick an organization that you’re interested in (company, NGO, government agency, sports team, hospital, school, scientific community, etc.) and find some graphic that is helpful for them as they make decisions.
Issue-centered: Pick an issue that’s important to you. Find a news article or opinion piece about that issue and try to replicate a visual that they’d use to support their claims. The strongest such projects will pick a claim that you disagree with.

Week 4: Plot selection and initial analysis

Upload a screenshot of one or more potential plots to the Plot Gallery. Your post should include:

Screenshot: A screenshot of the plot (image)
Claim or Purpose: What the author is trying to say with this plot
Source: URL, or some other similarly clear description of how to get to the original plot
Each row is: A short phrase of what each row of the plot dataframe represents (e.g., “The data has one row per day and rider type”.
Data dictionary: A table of (possible) variable names and what they (probably) represent
Data Sources: Potential sources for the data, in the form of URLs or similar. Try downloading and opening the data if you can; report on any anticipated difficulties.

You don’t have to fill in everything for the initial post; you can come back and fill it in later.

Important

The plot you choose to replicate should have at least 3 variables.

If you really want to make a plot that has fewer than 3 variables, you should be able to explain why it’s still interesting and worth doing.

Other Plot Selection Guidelines

The plot you choose to replicate:

Should make an interesting claim.
- Shouldn’t just be about something (e.g., “number of wins by each athlete”)
- Most interesting claims are about relationships (e.g,. “the highest paid athletes don’t necessarily win more”)
- A claim finishes the sentence: “The article uses this graphic to back up the claim that BLANK”.
- If the article doesn’t make a claim, make one up. Imagine that you’ve gotten in an argument with someone and you show them this plot to back up your claim; what claim were you trying to make?
Should involve at least 3 variables.
- Ideally they have to come from several separate data sources that you have to bring together.
Should have some room for improvement, i.e., you’d be interested in trying a different way of presenting the data.

If you are really stuck, I can simply assign something to do, but I’m hoping to avoid that.

Week 5: Data

Find and load the data; write a brief critique of the data.

For this milestone, your report should be complete through the “Data” section. See the Report Template below.

Data Selection Guidelines

The data you find:

Should come from a reputable source.
- If you use an aggregation site like OurWorldInData, try to track down where they got the data.
Should require at least a little bit of wrangling
- Beware of sites where you can download exactly the data for a specific plot
- “Download data for this chart” links are red flags that you may be getting already-wrangled data.

See Some questions to ask if you’re working with data that you didn’t collect yourself.

If the wrangling is especially straightforward, you can add depth by:

Finding a different source for the same data
Critiquing the data collection process
Describing alternative choices that could have been made in the data collection or wrangling process and what the consequences of those choices might be.

Week 6: Initial plot and ideas

Make an initial plot and a todo list of things to improve; sketch ideas for alternative ways to plot the same data.

Your report should be complete through the “Wrangling” section and have some initial work in the “Replication” and “Alternatives” sections.

Week 8: Presentation and Report

Present the original, replication, and alternative plots to the class and in a Quarto report.

Report and Presentation Details

Report Content

Your report should include the following sections:

Overview of the original plot (include a screenshot) and the claim it makes.
Design of the original plot. (plot type, effectiveness)
Data (where’d you get it, anything interesting about it or what you had to do with it)
Wrangling (what did you have to do to get it into a form that you could plot?)
Replication (show your replication of the original plot)
Alternative 1: what did you change? why?
Alternative 2: (same, but can be a sketch rather than a full plot)
Summary: one or two take-aways

Report Template

This template includes a suggested outline for your report. You may choose to organize your report differently if you have a good reason.

Click to show the Report Template

---
title: "A title"
author:
    - "Your name"
    - "Partner name"
format:
  html:
    embed-resources: true
    code-tools: true
    code-fold: true
---

```{python}
#| echo: true
import pandas as pd
import plotly.express as px
```

```{python}
#| echo: false
# Hack to make plotly work in RStudio
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
  r[".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
  def show_plot(p): r.to_html(p._repr_html_())
else:
  def show_plot(p): return p
```

🚧
This template is intended to help you structure your report. Remove placeholders like this and make it your own. Not every question needs to be answered for every project, and some projects will have additional questions. **Your final report should not include any "under construction" or template text.**

## Overview

🚧
We are interested in TOPIC because STORY. So we chose to replicate a plot from [this article] (INCLUDE THE COMPLETE URL TO THE ARTICLE).

🚧
Original visualization:

![](https://example.com/your-url-here)

🚧
Claim:

> You can put the claim in a "block quote" like this.
> A concise statement (ideally a quote) of the claim that the article uses the visualization to make (or the claim you invented if there wasn't a clear one)


## Design

🚧
What overall type of visualization was chosen? Why might the author have chosen it?
🚧
What variables are being shown?
🚧
What visual cues (aka retinal variables or aesthetics) were chosen to represent those data variables?
🚧
    For at least one of these variables, describe what makes that choice appropriate or inappropriate.
🚧
Overall, what about the visual makes it effective, or ineffective, for making its claim?


## Data

### Data Overview

🚧
Whether you were able to find the original data (if not, why not?)
🚧
Where the data came from
🚧
    Direct URL and/or specific instructions for how to obtain it.
🚧
    Under what terms is the source allowing you to use the data?
🚧
    Try to trace it upstream as close to the source as you can.
🚧
    Who worked with the data on its way to you? (Include names and roles, if applicable.)
🚧
    What processing may have happened to it: was it aggregated? Anonymized? etc.
🚧
What might we need to know about the data collection process in order to interpret the data correctly? (e.g., If it’s from a survey–who was surveyed?)

### Data Details

🚧
```{r load-data}
# your code to load the data here
```

🚧
A low-level description of the size and structure of the data.
🚧
How many rows are there?
What does a single row represent? (Translate the first observation in the dataset into an English sentence.)
🚧
What might be interesting to know about what information the data does, and doesn’t, provide?


### Wrangling

🚧
Describe, at a broad level, what you need to do to the data to make it into the form you need for the plot. (e.g., what data types need fixing, whether you need to pivot, what filtering is needed, etc.)

🚧
Add code blocks, with appropriate names, for wrangling steps. **Explain the *why* for any choices you make (like filtering data).**

## Replication

🚧
Include your replication, along with all code needed.
🚧
Briefly describe any difficulties you encountered, both those you overcame and those you still have not. (It’s ok to not have a perfect graph here. If the essential structure is there, don’t worry if the details are a bit different. Focus your attention on making an interesting and polished alternative design.)
🚧


## Alternatives

🚧
Describe at least two alternative design choices that could be made in visualizing your data. For each design, include the following sections

### Alternative 1: Design

🚧
What choice did the original visual make? (e.g., to use a particular aesthetic mapping or glyph)
🚧
What choice does your alternative design make instead? (It should be a reasonable choice, but it doesn’t have to be an improvement.)
🚧
How does that change affect how the visual supports the original claim? Can your redesign now support some different claim?

### Implementation

🚧
Make a solid attempt to implement your best alternative design.
If creating it using plotly is too challenging, you may include a high-fidelity sketch of what the plot would look like (using PowerPoint, a vector graphics tool, or a good-quality scan of a paper or whiteboard), along with a clear description of what you’d need to figure out in order to produce it with code.

## Summary

🚧
Now that you’ve gone through the whole process, how has your understanding of, and belief in, the original article’s claim changed?
🚧
How faithful was your replication?
🚧
Compare your original and alternative designs. Which is best for what purpose?
🚧
What follow-up questions and ideas do you have about the data or visualization you worked with?
🚧
How do you feel about this whole experience?

## Acknowledgments

🚧
Include the full names of any students outside your team who helped you and a brief description of how they helped.

### License

Sharing: Would you be okay with sharing your project, and if so, how?

Ideally we'd make a public gallery with all projects, screenshots, and code, but you could choose to:

- Go anonymous (choices: "anon" or "names")
- Don't share code? (choices: "code" or "screenshots" or "just title")
- Restrict to just future students (choices: "public" or "students").

so, e.g., you might say "anon, code, students" or "names, screenshots, public".

Report Style

Your report should be:

understandable by itself: a reader should not need to see your discussion posts or prior submissions.
reproducible – if a new version of the data becomes available, anyone should be able to re-run your code and get an updated plot. Things to avoid:
- paths that only work on your computer
- making modifications to your raw data (e.g., editing it in Excel)
- hard-coding row numbers or other things that are likely to change
understandable without the code: a reader should be able to skip over all of the code and understand all of the results.
At least one of your visuals (either the original or alternative design) should be high quality, with effort spent getting the details right.
Clean up any messy outputs from code (debugging, etc.)
Write succinctly. Bulleted lists are fine when they’re clear.
Use ordinary text formatting (not headings, blockquotes, etc.) for ordinary prose. Reserve headings and block quotes for headings and quotes.
Format your code cleanly. If lazy, select the code and click “Reformat Code” on the Code menu. (Make sure you save first.)
Read through the report before submitting. Check that you don’t have placeholder text, the headings make sense, etc.

Academic Integrity: If you take any code from elsewhere, you must state very clearly where it came from and how you changed it. This includes ChatGPT and other AI tools. If you find a plot and it has Plotly code already, I advise against looking at it, and possibly even finding a different plot. Code using a different toolkit (e.g., R) is probably okay.

Presentation

To practice making data-driven arguments, this project includes a brief presentation, done on VoiceThread.

Keep it short and simple. The slides should be as follows:

Original Plot and Claim
Data (where’d you get it, anything interesting about it or what you had to do with it)
Replication
Alternative 1: what did you change? why?
Alternative 2: (same)
Summary: one or two take-aways

Use a presentation tool like Google Slides or PowerPoint.

Use voice comments to narrate. Each team member should make at least one comment.

Logistics

Report Submission

If your dataset is less than 10 MB: submit a .zip file with your entire project folder.
Otherwise: submit just the HTML (and make sure the instructions are very clear about how to get the data)
- Use the header from the template, which includes these important settings:
  - Keep code-tools: true in the header so we can see your source code.
  - Make sure you keep embed-resources: true so we can see any images.

Shared Projects

RStudio (Posit Workbench) has a feature called Shared Projects that many teams will find helpful.

To use it, one person should create a new project in the ~/rprojects folder. Then, they should go to the projects menu in the top-right corner and choose “Share Project”. Type the username of another team member (the part of their email before the @calvin.edu) in the textbox and click Add.

rprojects folder

For shared projects to work, you must create your project within the ~/rprojects folder, i.e., the folder named rprojects within your home (~) folder.

Purpose

The project will explore the intersection between three things:

The data
The visualization
The story that the visualization tells about the data (and, indirectly, about the world)

The three components to this project involve each of these three things:

Data: Obtain some real-world dataset. Trace where it came from and how it’s structured. Load it and process it using the data science toolkit we’re studying to a form that’s appropriate for making the visualization.
Visualization: Replicate (re-create) a visualization that someone else already made based on that data. Evaluate some of the choices and assumptions that were made in the visualization process: in what ways does the visualization faithfully represent the data (or not)? What sort of stories does the chosen visualization amplify?
Story: Consider the story that the original source told using the visualization. Is that story accurate? complete? clearly articulated? How did choices in the data collection, preparation, and visualization affect the storytelling? Are there other stories that the data might also be telling?

This project addresses our course-level learning objectives in this way:

Technical skills: manipulating data, constructing visualizations, and creating reports.
Communication: analyzing choices made in visualization and text with respect to how it tells a story about data. Proposing and implementing changes to improve the clarity of communication.
Ethics and Critical Thinking: identify potential ethical questions (e.g., of transparency, diversity, etc.) that emerge in the process of obtaining, manipulating, and communicating with data.