Wrap-Up

Today

Please fill out Course evaluations
- It’s helpful data for me, my department, and the university.
- Each comment matters. Each rating matters.
- Be honest, be balanced. Include both things to keep and things to change.
Communication Tips
Missing Data, Time Series
Virtuous Data Science: wrap-up

Logistics

Sign up for final project presentation slot (see Moodle)
- Make the title a link to your slides
- Lab is busy, so: North Hall 253
All work should be in before Finals (or make specific arrangements with me)
Moodle grades aren’t yet accurate, but hopefully soon!

Communication

Key points

Consider the audience to get the level of detail right.
- Never assume your audience can rapidly process complex visuals. (Claus Wilke)
Consider the purpose to choose report vs dashboard vs presentation
Anchor claims in data.
Tell stories (e.g., “but-therefore”)

Make a point

Report A

MAIN POINT

Supporting chart 1
Supporting table 2
Supporting model 3

Discussion about how each supports main point

Report B

Chart 1
Table 2
Model 3
Chart 4
Table 5
Chart 6
Model 7
Chart 12
Table 25

Tell a Story

Chart 1
Therefore, chart 2
BUT, chart 3

but-therefore

See also: “Telling a story and making a point”

Anchor conclusions in data

The units are probably seconds
The fit looks good
This was surprising

because the median, 600, would be 10 minutes
because the mean error of $15 is less than 0.1% of the price
because I expected that people would leave higher ratings on products they enjoyed more

Use appropriate language

Plain language for the overview, conclusion, and visuals.

Labels in visuals: use real names, not code_names. (For all variables, not just x and y.)
Don’t assume the reader knows the structure of the data.

Technical language when describing methods (data acquisition, wrangling, modeling, etc.).

What data representation choices did you make? why?
What modeling choices? Why? etc.

Brief Notes

Missing Data

Missingness is often informative.
- e.g., “I don’t want to answer that question.”
- e.g., “Only people who have a certain condition are asked this question.”
- So you might want to add a column: df['f1_missing'] = df['f1'].isna()
Never blindly drop rows with missing data
- At least count how many you’re dropping.

Time Series

“Forecasting” could be a whole course (e.g., “Forecasting Principles and Practice”)
Some approaches
- Naive: Predict that the last sample continues
- Trend: Predict that a (linear) trend continues
- Seasonal: Predict the same value as this time last year
- fancier methods: Moving average, exponential smoothing, ARIMA, …

Lagged (Shifted) Features

A simple approach to time series prediction is to shift the target variable by one time step and use it as a feature.

from sklearn.datasets import fetch_openml

bike_sharing = fetch_openml(
    "Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas"
)
df = bike_sharing.frame
df.head()

	season	month	hour	holiday	weekday	workingday	weather	temp	feel_temp	humidity	count
0	spring	1	0	False	6	False	clear	9.84	14.395	0.81	16
1	spring	1	1	False	6	False	clear	9.02	13.635	0.80	40
2	spring	1	2	False	6	False	clear	9.02	13.635	0.80	32
3	spring	1	3	False	6	False	clear	9.84	14.395	0.75	13
4	spring	1	4	False	6	False	clear	9.84	14.395	0.75	1

target_var = "count"
df['prev_count'] = df[target_var].shift(1)
df[['prev_count', target_var]].head()

	prev_count	count
0	NaN	16
1	16.0	40
2	40.0	32
3	32.0	13
4	13.0	1

Time-Series Validation

Don’t randomly split your data into train/test sets.
Do split your data into train/test sets by time.
- e.g., train on the first year, test on the second year
More general approach: time series cross-validation

Wrap-Up of Virtuous Data Science

What virtues have we practiced in this course? What virtues should data scientists have?

Virtues

Cardinal Virtues
- Prudence
- Justice
- Temperance
- Courage
Theological Virtues
- Faith
- Hope
- Love

Vices

Pride
Greed
Lust
Envy
Gluttony
Wrath
Sloth

Ethics

Not just moral dilemmas; we make ethical choices constantly.

Normative / deontological: “is this right?”
Situational: “is this promiting what is good?”
Virtuous: “How does a good person act? How do I become a good person?”

Objectives

Describe how Reformed concepts of justice and shalom apply to data collection, analysis, sharing, and use.
Give examples of specific concerns around privacy, bias, accountability, and transparency
Describe steps and dispositions that individual data scientists can take to act justly in their profession

A bullet-point summary of biblical justice

Community above individual (voluntarily)
Equity: equal treatment, dignity
Collective responsibility
Individual responsibility
Advocacy for poor and marginalized

What does biblical justice require, in the area of data science?

Community

The righteous are willing to disadvantage themselves to advantage the community; the wicked are willing to disadvantage the community to advantage themselves.

So:

privacy: what must we share? what must we not share?
integrity in data collection, analysis, reporting, communication

Thus:

Data analysis process: reproducible, transparent, documented
Reporting: transparency about limitations, choices, consideration of possible harms

Equity

Everyone must be treated equally and with dignity.

. . .

direct impact
- fair risk assessment (see Discussion and COMPAS)
- fair surveillance (don’t hyper-surveil the poor etc.)
- fair resource allocation
indirect impact:
- don’t show ads for criminal background checks more often for Black names
- don’t tolerate higher speech recognition error rates for minorities
- show a representative diversity of age/gender/race/… in image searches

Should we even be predicting peoples’ lives?

Risk assessment for criminality, loan approval, etc. requires predicting peoples’ future actions and situations
These predictions might be terribly inaccurate. Should we be trying at all?

Despite using a rich dataset and applying machine-learning methods optimized for prediction, the best predictions were not very accurate and were only slightly better than those from a simple benchmark model.

Corporate responsibility: I am sometimes responsible for and involved in other people’s sins.

Even if I intend no prejudice, my algorithm could be prejudiced because of training data.
Even if my work is honest, I could be supporting a company that exploits other workers directly or rely on conflict minerals and child labor
Environmental responsibility is both individual and collective

Individual responsibility: I am finally responsible for all my sins, but not for all my outcomes.

I must do what’s right, whether or not my company’s policies require it.
When something isn’t right, I need to say something even if it risks my job.

Advocacy: We must have special concern for the poor and the marginalized.

By exposing injustice through visualization and modeling
By listening to and amplifying, not speaking for.
e.g., beware of doing “parachute research” or de-contextualized “Data for Good”

Data science, and data scientists, are not saviors.

Incarnation

In your relationships with one another, have the
    same mindset as Christ Jesus:
Who, being in very nature God,
    did not consider equality with God something
      to be used to his own advantage;
rather, he made himself nothing
    by taking the very nature of a servant,
    being made in human likeness.
And being found in appearance as a man,
    he humbled himself
    by becoming obedient to death—
        even death on a cross!

Philippians 2:5-8, NIV

Learning More

Courses

DATA 304: Visualization
DATA 385: Topics in Data Science (varies; 24SP is “Cause and Design”)
CS 375 / 376: Machine Learning I/II
STAT 245: Applied Data Analysis
STAT 341: Computational Bayesian Statistics

Some further reading on data ethics

The Oxford Handbook of Ethics of AI
Coded Bias documentary
Fast.AI Data Ethics course
Ethics and Data Science by Mike Loukides, Hilary Mason, DJ Patil
Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, by Cathy O’Neil
How Charts Lie: Getting Smarter about Visual Information, by Alberto Cairo
How Deceptive are Deceptive Visualizations? Pandey et al., CHI 2015

Other resources on Data Ethics

AlgorithmWatch
AI Now Institute
Data and Society
Harvard BKC
ACM Conference on Fairness, Accountability, and Transparency (FAccT)

Who/What I’m Reading / Following: Tech

“What can I do?”

Practice the “data dispositions”
- Humility (cite sources, acknowledge limitations, validate results)
- Integrity (check assumptions, reproduce analyses, evaluate others’ claims)
- Hospitality (clear visuals, clear reports, clear code)
- Compassion and justice

Listen a lot. To diverse opinions. (e.g., “The Flip Side”)
Keep in touch.