W02D1: Data, Questions, Tools

Humility

“God opposes the proud but gives grace to the humble”

Data Issues (from textbook)

Important to recognize potential data issues:
- You never have all the data you’d ideally have.
- You probably got a subset of data that isn’t representative.
- The measurements in your data probably aren’t what you’d hope for.
- You probably won’t be able to make causal claims.
But nevertheless data can be really useful!
Not important: memorizing terminology for these issues. (textbook’s terms aren’t universal)

Data Lifecycle

Sometimes: start with a question, seek out data
Sometimes: start with data, “I wonder…”

Example from textbook: CO₂ measurement

Question (narrowly): what is the current level of CO₂ in the atmosphere?
The Truth: count total number of CO₂ molecules in the atmosphere, divide by volume of atmosphere.
As usual we can’t measure The Truth directly; we infer it from related measurements. (this will be a recurring theme)

Types of Questions

Descriptive: what is the current level of CO₂ in the atmosphere?

Exploratory: what factors are associated with CO₂ levels?

Inferential: can we conclude that CO₂ levels are rising?

Predictive: can we predict future CO₂ levels?

(Generated by GitHub Copilot)

Modeling helps us see something we can’t see:

Inference: something about the population in general
Prediction: something about an individual example

Retrieval Break

On a sheet of paper

Sketch one of the data frames we looked at last week. (bikeshare, gapminder, or UN votes)
- What columns did it have?
- What did each row represent?
- Fill in (made up) values for one or two rows.
Sketch one of the plots that we made with that data. What went on the axes? (Do you happen to remember the syntax we used to make the plot? okay if not yet!)
In the tutorial we looked at a few ways of seeing the overall shape of a data frame. Do you remember:
- What did one of those results look like?
- Do you remember what syntax we used to get it? (okay if not yet)

Review of Plotting

Example: replicating the Health and Wealth plot

Tools for Reproducible Analyses

Reproducible Analyses

Science is based on experiments and analyses that are reproducible.

In data science, achieving reproducibility requires that we be clear about:

the nature and source of our data.
the process we used to analyze that data.
the results of the analysis.
the justifications for our conclusions.

Our documented analyses must be clear enough that others can:

access or reproduce the original data.
understand/rerun the data processing code.
rebuild the tables and the visualizations.
assess the reasoning behind the conclusions.

Ultimately, we’d like to extend the work to other related datasets and analyses.

Why not just use Excel?

Spreadsheets are useful
Often good for exploring data, trying out initial analysis
But:
- code is hard to debug.
- limited statistical and modeling tools
- analyses are not:
  - Reproducible (recreate the same analysis).
  - Reusable (apply the same analysis to different data).

Building Reproducible Analyses

To achieve the goal of reproducibility, data scientists commonly use the following toolkit:

Programming (e.g., Python or R)
Literate programming (e.g., Quarto, RMarkdown, Jupyter Notebooks)
Version control (e.g., Git and GitHub)

RStudio integrates support for all three.

We focus on the first two in this course.

Libraries

Python provides basic data structures (lists, dictionaries, etc.)
But many common operations are not built in.
Instead, we use libraries (also called packages).

Examples:

pandas provides DataFrame data structures for tabular data
plotly provides plotting functions
scikit-learn provides machine learning algorithms

Tabular data: The data frame

A rectangular table of data, like a spreadsheet
Each row is an observation
Each column is a variable
Each cell is a value
All rows have the same columns
Columns have names
Values in a column have the same type

Loading data from CSV

“Comma-Separated Values”: each line is a row, columns are separated by commas

date,day_of_week,workingday,total_rides
2011-01-01,6,weekend,654
2011-01-02,0,weekend,670
2011-01-03,1,workday,1229
2011-01-04,2,workday,1454
...

Load into a data frame:

import pandas as pd
daily_rides = pd.read_csv("https://calvin-data-science.github.io/data202/data/bikeshare/day_simple.csv", parse_dates=["date"])

Note: pandas has to guess data types. Sometimes it guesses wrong. e.g., date column is a string, but we want it to be a date.

Inspecting a data frame

daily_rides.head()

	date	day_of_week	workingday	total_rides
0	2011-01-01	6	weekend	654
1	2011-01-02	0	weekend	670
2	2011-01-03	1	workday	1229
3	2011-01-04	2	workday	1454
4	2011-01-05	3	workday	1518

daily_rides.shape

(731, 4)

daily_rides.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         731 non-null    datetime64[ns]
 1   day_of_week  731 non-null    int64         
 2   workingday   731 non-null    object        
 3   total_rides  731 non-null    int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 23.0+ KB

Accessing a data frame

df["column_name"] accesses a column
df.loc[row_index, "column_name"] accesses a cell
df.iloc[row_index, column_index] accesses a cell by index

Example:

daily_rides["date"]

0     2011-01-01
1     2011-01-02
2     2011-01-03
3     2011-01-04
4     2011-01-05
         ...    
726   2012-12-27
727   2012-12-28
728   2012-12-29
729   2012-12-30
730   2012-12-31
Name: date, Length: 731, dtype: datetime64[ns]

daily_rides['date'][0]

Timestamp('2011-01-01 00:00:00')

daily_rides.loc[0, "date"]

Timestamp('2011-01-01 00:00:00')