DATA 202 Week 1 Day 1: Welcome!

As you enter…

  • Make a name card (name on front and back)
  • Sit next to someone you don’t know well
  • Introduce yourself. Ideas:
    • What you’re studying (and how you got to that answer)
    • Something you’re passionate about outside of this class: a cause, a subject, a hobby, etc.
    • Something you’re excited about for this semester

Introduction - Ken Arnold

  • From east coast (North Carolina, Maryland, NY, Boston)
  • 5th year at Calvin
  • Partner with an architect (Susan)
  • Dad of twin kindergarteners (Naomi and Esther)
  • Member of New City Fellowship (AV, piano, choir)

Introduction - Fernando Santos

  • From São Carlos, Brazil
    • Portuguese, not Spanish
    • Cerrado (brazilian savannah), not jungle
    • Be patient with my English, I promise it will get better!
  • Just arrived at Calvin
  • Major in computer engineering, PhD in complex systems modeling
  • Married to Jemima, dad of Suzana (5) and Natanael (3)

Opening Prayer

From the apostle Paul’s letter to the Philippians:

This is my prayer:
that your love may abound more and more
in knowledge and depth of insight,
so that you may be able to discern what is best and may be pure and blameless for the day of Christ, filled with the fruit of righteousness that comes through Jesus Christ—to the glory and praise of God.

What is this course?

  • visualization: communicating data to humans
  • modeling and validation: using data to perform tasks
  • but first, data wrangling: How to transform data into the structure you need for visualization and modeling
    • Not covering data collection
    • Often where important decisions are made

…using computing in Python and occasionally other tools

Where are we going?

  • Introduction (weeks 1 and 2)
  • Vis Design (week 3)
  • Vis Implementation (week 4)
  • Wrangling (weeks 5 and 6)
  • Midterm project: redesign and recreate a visualization (starts week 3, presentations and revision week 8)
  • Modeling: Design (week 9)
  • Modeling: Neighbors and trees (week 10)
  • Modeling: Other models (week 11)
  • Validation (week 12)
  • LLMs, other topics, and final project (weeks 13, 14 and 15)

Discussions on ethics and perspectives are woven in throughout the course.

Optional material

  • Clustering (Unsupervised Learning)
  • Databases and APIs
  • Text Data
  • Geospatial Data
  • Audio and Image Data

Feedback from Prior Years

“What aspects of this course most helped your learning?”

Lab. Homework. “forced our brains to work.”

“What additional or different things could you have done to enhance your learning?”

“gone to more office hours”, “gone over the exercises and homework more”

Weekly Rhythm

  • Monday and Wednesday in classroom (mix of lecture, discussion, and activities)
  • Friday in lab (mostly working on exercises or projects)

Technology We’ll Use

  • RStudio: A powerful environment for working with data (even in Python!)
    • Most students will use Calvin’s installation https://r.cs.calvin.edu/
    • You can also install it on your own computer
    • Quarto for reproducible reports
    • Plotly for making plots
  • Perusall
    • Textbook reading assignments: ask questions so we know what to focus on in class
    • Announcement and Q&A forums; you can post anonymously if you want
    • Perspectival readings: annotate to share your thoughts

Note: Perusall is new for this course, so expect some growing pains.

Example Projects

Example final projects

  • Predict how much a used car will sell for
  • Forecast how much electricity will be used
  • Predict how much a plane flight will cost

Goals

A reading

Seek good and not evil, that you may live,
and so the Lord, the God of hosts, will be with you, just as you have said. Hate evil and love good, and establish justice in the gate;
it may be that the Lord, the God of hosts, will be gracious to the remnant of Joseph.

I hate, I despise your festivals, and I take no delight in your solemn assemblies. Even though you offer me your burnt offerings and grain offerings, I will not accept them, and the offerings of well-being of your fatted animals I will not look upon. …
But let justice roll down like water and righteousness like an ever-flowing stream.

Amos 5:14-15, 21-24 NRSV

What might this mean for us working with data?

Our Goals

  • Skill: how to work with the tools
  • Knowledge: understanding the underlying concepts
  • Dispositions (virtues): habits of using these skills wisely

Humility

Challenge: data feels powerful, people listen to what you use it to say.

So we will practice:

  • Citing all sources (for both data and process)
  • Acknowledging limitations
  • Noticing and reporting our analysis decisions and possible alternatives
  • Validation of results

Integrity

It’s tempting to say something that isn’t entirely true, or to manipulate the collection/analysis/reporting process to yield the answer you want.


So we will practice:

  • Evaluating claims that others use data to make
  • Clearly articulating our analysis decisions and rationale
  • Reproducibility
  • Using exploratory analytics to validate data against assumptions

Hospitality

We can choose to use our tools to elucidate and clarify, rather than obscure.

So we will practice:

  • Clear visual communication
  • Clarity of code and process
  • Writing explanations that are accessible and appropriate to audience.

Compassion and Justice

Data Science can both cause harm and reveal it.

So we will:

  • Study examples of how data might cause harm
  • Study examples of how harm might be mitigated or revealed

A Worked Example

The data

country year issue percent_yes
0 Turkey 1946 Colonialism 0.800000
1 Turkey 1946 Economic development 0.600000
2 Turkey 1946 Human rights 0.000000
3 Turkey 1947 Colonialism 0.222222
4 Turkey 1947 Economic development 0.500000
... ... ... ... ...
1207 US 2019 Arms control and disarmament 0.187500
1208 US 2019 Economic development 0.187500
1209 US 2019 Human rights 0.357143
1210 US 2019 Palestinian conflict 0.000000
1211 US 2019 Nuclear weapons and material 0.058824

1212 rows × 4 columns

show_plot(
  px.scatter(votes, x='year', y='percent_yes'))

show_plot(
  px.scatter(votes, x='year', y='percent_yes',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1))

show_plot(
  px.scatter(votes, x='year', y='percent_yes', color='country',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1))

show_plot(
  px.scatter(votes, x='year', y='percent_yes', color='country',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1,
    trendline='lowess'))

show_plot(
  px.scatter(votes, x='year', y='percent_yes', color='country',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1,
    trendline='lowess',
    labels={"percent_yes": "% Yes Votes", "year": "Year", "country": "Country"},
    title="Percentage of 'Yes' votes in the UN General Assembly"))

show_plot(
  px.scatter(votes, x='year', y='percent_yes', color='country',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1,
    trendline='lowess',
    labels={"percent_yes": "% Yes Votes", "year": "Year", "country": "Country"},
    title="Percentage of 'Yes' votes in the UN General Assembly")
  .update_traces(marker_size=2)) # Smaller markers

show_plot(
  px.scatter(votes, x='year', y='percent_yes', color='country',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1,
    trendline='lowess',
    labels={"percent_yes": "% Yes Votes", "year": "Year", "country": "Country"},
    title="Percentage of 'Yes' votes in the UN General Assembly")
  .update_traces(marker_size=2) # Smaller markers
  # Remove the "issue="
  .for_each_annotation(lambda a: a.update(text=a.text.split("=", 1)[-1])))

show_plot(
  px.scatter(votes, x='year', y='percent_yes', color='country',
    facet_col='issue', facet_col_wrap=3, facet_col_spacing=.1,
    trendline='lowess',
    labels={"percent_yes": "% Yes Votes", "year": "Year", "country": "Country"},
    title="Percentage of 'Yes' votes in the UN General Assembly")
  .update_traces(marker_size=2) # Smaller markers
  # Remove the "issue="
  .for_each_annotation(lambda a: a.update(text=a.text.split("=", 1)[-1]))
  .update_yaxes(tickformat=",.0%")) # Percent labels