Midterm Project
Replicate and critique a visual.
For this project you will pick some existing data science work (a newspapar article, blog post, report, research paper, etc.) and replicate a visual from it using the data wrangling and plotting tools we’re studying.
You will then critique the original visualization and propose alternative designs.
The project will be done in teams of between 1 and 3. You will submit a single report and make a joint presentation.
This document details various requirements for the project. Not all requirements will make sense for every project, though.
The overall goal is that your project has depth somewhere. For example, you may choose to go deeper in:
- understanding where the data came from
- wrangling the data
- analyzing the visualization and designing alternatives
- thinking deeply about who the audience of the visualization is and making well-motivated design choices
- making a high-quality plot
Going deeper in one area may mean that you don’t go as deep in another area. For example, if the data wrangling is very difficult, we will have lower expectations for other parts of the project.
If you’re unsure whether a specific requirement applies to your project, ask.
Milestones
Week 3: Identify possible plots
This was done as part of Discussion 2.
You can pick one of these approaches:
- Organization-centered: Pick an organization that you’re interested in (company, NGO, government agency, sports team, hospital, school, scientific community, etc.) and find some graphic that is helpful for them as they make decisions.
- Issue-centered: Pick an issue that’s important to you. Find a news article or opinion piece about that issue and try to replicate a visual that they’d use to support their claims. The strongest such projects will pick a claim that you disagree with.
Week 4: Plot selection and initial analysis
Upload a screenshot of one or more potential plots to the Plot Gallery. Your post should include:
- Screenshot: A screenshot of the plot (image)
- Claim or Purpose: What the author is trying to say with this plot
- Source: URL, or some other similarly clear description of how to get to the original plot
- Each row is: A short phrase of what each row of the plot dataframe represents (e.g., “The data has one row per day and rider type”.
- Data dictionary: A table of (possible) variable names and what they (probably) represent
- Data Sources: Potential sources for the data, in the form of URLs or similar. Try downloading and opening the data if you can; report on any anticipated difficulties.
You don’t have to fill in everything for the initial post; you can come back and fill it in later.
The plot you choose to replicate should have at least 3 variables.
If you really want to make a plot that has fewer than 3 variables, you should be able to explain why it’s still interesting and worth doing.
The plot you choose to replicate:
- Should make an interesting claim.
- Shouldn’t just be about something (e.g., “number of wins by each athlete”)
- Most interesting claims are about relationships (e.g,. “the highest paid athletes don’t necessarily win more”)
- A claim finishes the sentence: “The article uses this graphic to back up the claim that BLANK”.
- If the article doesn’t make a claim, make one up. Imagine that you’ve gotten in an argument with someone and you show them this plot to back up your claim; what claim were you trying to make?
- Should involve at least 3 variables.
- Ideally they have to come from several separate data sources that you have to bring together.
- Should have some room for improvement, i.e., you’d be interested in trying a different way of presenting the data.
If you are really stuck, I can simply assign something to do, but I’m hoping to avoid that.
Week 5: Data
Find and load the data; write a brief critique of the data.
For this milestone, your report should be complete through the “Data” section. See the Report Template below.
The data you find:
- Should come from a reputable source.
- If you use an aggregation site like OurWorldInData, try to track down where they got the data.
- Should require at least a little bit of wrangling
- Beware of sites where you can download exactly the data for a specific plot
- “Download data for this chart” links are red flags that you may be getting already-wrangled data.
See Some questions to ask if you’re working with data that you didn’t collect yourself.
If the wrangling is especially straightforward, you can add depth by:
- Finding a different source for the same data
- Critiquing the data collection process
- Describing alternative choices that could have been made in the data collection or wrangling process and what the consequences of those choices might be.
Week 6: Initial plot and ideas
Make an initial plot and a todo list of things to improve; sketch ideas for alternative ways to plot the same data.
Your report should be complete through the “Wrangling” section and have some initial work in the “Replication” and “Alternatives” sections.
Week 8: Presentation and Report
Present the original, replication, and alternative plots to the class and in a Quarto report.
Report and Presentation Details
Report Content
Your report should include the following sections:
- Overview of the original plot (include a screenshot) and the claim it makes.
- Design of the original plot. (plot type, effectiveness)
- Data (where’d you get it, anything interesting about it or what you had to do with it)
- Wrangling (what did you have to do to get it into a form that you could plot?)
- Replication (show your replication of the original plot)
- Alternative 1: what did you change? why?
- Alternative 2: (same, but can be a sketch rather than a full plot)
- Summary: one or two take-aways
Report Template
This template includes a suggested outline for your report. You may choose to organize your report differently if you have a good reason.
---
"A title"
title:
author:- "Your name"
- "Partner name"
format:
html:-resources: true
embed-tools: true
code-fold: true
code---
```{python}#| echo: true
import pandas as pd
import plotly.express as px
```
```{python}#| echo: false
# Hack to make plotly work in RStudio
if 'r' in globals() and r['.Platform$GUI'] == "RStudio" and r['suppressMessages(requireNamespace("htmltools"))']:
".GlobalEnv$to_html <- function(x) { print(htmltools::HTML(x)) }"] and None
r[def show_plot(p): r.to_html(p._repr_html_())
else:
def show_plot(p): return p
```
🚧is intended to help you structure your report. Remove placeholders like this and make it your own. Not every question needs to be answered for every project, and some projects will have additional questions. **Your final report should not include any "under construction" or template text.**
This template
## Overview
🚧in TOPIC because STORY. So we chose to replicate a plot from [this article] (INCLUDE THE COMPLETE URL TO THE ARTICLE).
We are interested
🚧
Original visualization:
![](https://example.com/your-url-here)
🚧
Claim:
> You can put the claim in a "block quote" like this.
> A concise statement (ideally a quote) of the claim that the article uses the visualization to make (or the claim you invented if there wasn't a clear one)
## Design
🚧type of visualization was chosen? Why might the author have chosen it?
What overall
🚧
What variables are being shown?
🚧or aesthetics) were chosen to represent those data variables?
What visual cues (aka retinal variables
🚧or inappropriate.
For at least one of these variables, describe what makes that choice appropriate
🚧or ineffective, for making its claim?
Overall, what about the visual makes it effective,
## Data
### Data Overview
🚧if not, why not?)
Whether you were able to find the original data (
🚧from
Where the data came
🚧and/or specific instructions for how to obtain it.
Direct URL
🚧is the source allowing you to use the data?
Under what terms
🚧as close to the source as you can.
Try to trace it upstream
🚧with the data on its way to you? (Include names and roles, if applicable.)
Who worked
🚧
What processing may have happened to it: was it aggregated? Anonymized? etc.
🚧in order to interpret the data correctly? (e.g., If it’s from a survey–who was surveyed?)
What might we need to know about the data collection process
### Data Details
🚧-data}
```{r load# your code to load the data here
```
🚧-level description of the size and structure of the data.
A low
🚧
How many rows are there?in the dataset into an English sentence.)
What does a single row represent? (Translate the first observation
🚧and doesn’t, provide?
What might be interesting to know about what information the data does,
### Wrangling
🚧for the plot. (e.g., what data types need fixing, whether you need to pivot, what filtering is needed, etc.)
Describe, at a broad level, what you need to do to the data to make it into the form you need
🚧with appropriate names, for wrangling steps. **Explain the *why* for any choices you make (like filtering data).**
Add code blocks,
## Replication
🚧with all code needed.
Include your replication, along
🚧any difficulties you encountered, both those you overcame and those you still have not. (It’s ok to not have a perfect graph here. If the essential structure is there, don’t worry if the details are a bit different. Focus your attention on making an interesting and polished alternative design.)
Briefly describe
🚧
## Alternatives
🚧in visualizing your data. For each design, include the following sections
Describe at least two alternative design choices that could be made
### Alternative 1: Design
🚧or glyph)
What choice did the original visual make? (e.g., to use a particular aesthetic mapping
🚧
What choice does your alternative design make instead? (It should be a reasonable choice, but it doesn’t have to be an improvement.)
🚧
How does that change affect how the visual supports the original claim? Can your redesign now support some different claim?
### Implementation
🚧
Make a solid attempt to implement your best alternative design.is too challenging, you may include a high-fidelity sketch of what the plot would look like (using PowerPoint, a vector graphics tool, or a good-quality scan of a paper or whiteboard), along with a clear description of what you’d need to figure out in order to produce it with code.
If creating it using plotly
## Summary
🚧and belief in, the original article’s claim changed?
Now that you’ve gone through the whole process, how has your understanding of,
🚧
How faithful was your replication?
🚧and alternative designs. Which is best for what purpose?
Compare your original
🚧-up questions and ideas do you have about the data or visualization you worked with?
What follow
🚧
How do you feel about this whole experience?
## Acknowledgments
🚧any students outside your team who helped you and a brief description of how they helped.
Include the full names of
### License
with sharing your project, and if so, how?
Sharing: Would you be okay
'd make a public gallery with all projects, screenshots, and code, but you could choose to:
Ideally we
- Go anonymous (choices: "anon" or "names")
- Don't share code? (choices: "code" or "screenshots" or "just title")
- Restrict to just future students (choices: "public" or "students").
"anon, code, students" or "names, screenshots, public". so, e.g., you might say
Report Style
Your report should be:
- understandable by itself: a reader should not need to see your discussion posts or prior submissions.
- reproducible – if a new version of the data becomes available, anyone should be able to re-run your code and get an updated plot. Things to avoid:
- paths that only work on your computer
- making modifications to your raw data (e.g., editing it in Excel)
- hard-coding row numbers or other things that are likely to change
- understandable without the code: a reader should be able to skip over all of the code and understand all of the results.
- At least one of your visuals (either the original or alternative design) should be high quality, with effort spent getting the details right.
- Clean up any messy outputs from code (debugging, etc.)
- Write succinctly. Bulleted lists are fine when they’re clear.
- Use ordinary text formatting (not headings, blockquotes, etc.) for ordinary prose. Reserve headings and block quotes for headings and quotes.
- Format your code cleanly. If lazy, select the code and click “Reformat Code” on the Code menu. (Make sure you save first.)
- Read through the report before submitting. Check that you don’t have placeholder text, the headings make sense, etc.
Academic Integrity: If you take any code from elsewhere, you must state very clearly where it came from and how you changed it. This includes ChatGPT and other AI tools. If you find a plot and it has Plotly code already, I advise against looking at it, and possibly even finding a different plot. Code using a different toolkit (e.g., R) is probably okay.
Presentation
To practice making data-driven arguments, this project includes a brief presentation, done on VoiceThread.
Keep it short and simple. The slides should be as follows:
- Original Plot and Claim
- Data (where’d you get it, anything interesting about it or what you had to do with it)
- Replication
- Alternative 1: what did you change? why?
- Alternative 2: (same)
- Summary: one or two take-aways
Use a presentation tool like Google Slides or PowerPoint.
Use voice comments to narrate. Each team member should make at least one comment.
Logistics
Report Submission
- If your dataset is less than 10 MB: submit a
.zip
file with your entire project folder. - Otherwise: submit just the HTML (and make sure the instructions are very clear about how to get the data)
- Use the header from the template, which includes these important settings:
- Keep
code-tools: true
in the header so we can see your source code. - Make sure you keep
embed-resources: true
so we can see any images.
- Keep
- Use the header from the template, which includes these important settings:
Checklist
The following is the minimal requirements for the project. The project should also go into depth in some area.
- Overview
- Design
- Data
- Wrangling
- Replication
- Alternatives
- Summary
- Appendix
- Meta
Purpose
The project will explore the intersection between three things:
- The data
- The visualization
- The story that the visualization tells about the data (and, indirectly, about the world)
The three components to this project involve each of these three things:
- Data: Obtain some real-world dataset. Trace where it came from and how it’s structured. Load it and process it using the data science toolkit we’re studying to a form that’s appropriate for making the visualization.
- Visualization: Replicate (re-create) a visualization that someone else already made based on that data. Evaluate some of the choices and assumptions that were made in the visualization process: in what ways does the visualization faithfully represent the data (or not)? What sort of stories does the chosen visualization amplify?
- Story: Consider the story that the original source told using the visualization. Is that story accurate? complete? clearly articulated? How did choices in the data collection, preparation, and visualization affect the storytelling? Are there other stories that the data might also be telling?
This project addresses our course-level learning objectives in this way:
- Technical skills: manipulating data, constructing visualizations, and creating reports.
- Communication: analyzing choices made in visualization and text with respect to how it tells a story about data. Proposing and implementing changes to improve the clarity of communication.
- Ethics and Critical Thinking: identify potential ethical questions (e.g., of transparency, diversity, etc.) that emerge in the process of obtaining, manipulating, and communicating with data.