W02D1: Data, Questions, Tools

Humility

“God opposes the proud but gives grace to the humble”

Data Issues (from textbook)

  • Important to recognize potential data issues:
    • You never have all the data you’d ideally have.
    • You probably got a subset of data that isn’t representative.
    • The measurements in your data probably aren’t what you’d hope for.
    • You probably won’t be able to make causal claims.
  • But nevertheless data can be really useful!
  • Not important: memorizing terminology for these issues. (textbook’s terms aren’t universal)

Data Lifecycle

  • Sometimes: start with a question, seek out data
  • Sometimes: start with data, “I wonder…”

Example from textbook: CO2 measurement

  • Question (narrowly): what is the current level of CO2 in the atmosphere?
  • The Truth: count total number of CO2 molecules in the atmosphere, divide by volume of atmosphere.
  • As usual we can’t measure The Truth directly; we infer it from related measurements. (this will be a recurring theme)

Types of Questions

  • Descriptive: what is the current level of CO2 in the atmosphere?
  • Exploratory: what factors are associated with CO2 levels?
  • Inferential: can we conclude that CO2 levels are rising?
  • Predictive: can we predict future CO2 levels?

(Generated by GitHub Copilot)

Modeling helps us see something we can’t see:

  • Inference: something about the population in general
  • Prediction: something about an individual example

Retrieval Break

On a sheet of paper

  • Sketch one of the data frames we looked at last week. (bikeshare, gapminder, or UN votes)

    • What columns did it have?
    • What did each row represent?
    • Fill in (made up) values for one or two rows.
  • Sketch one of the plots that we made with that data. What went on the axes? (Do you happen to remember the syntax we used to make the plot? okay if not yet!)

  • In the tutorial we looked at a few ways of seeing the overall shape of a data frame. Do you remember:

    • What did one of those results look like?
    • Do you remember what syntax we used to get it? (okay if not yet)

Review of Plotting

Example: replicating the Health and Wealth plot

Tools for Reproducible Analyses

Data scientist Venn diagram

Reproducible Analyses

Science is based on experiments and analyses that are reproducible.

In data science, achieving reproducibility requires that we be clear about:

  • the nature and source of our data.
  • the process we used to analyze that data.
  • the results of the analysis.
  • the justifications for our conclusions.

Our documented analyses must be clear enough that others can:

  • access or reproduce the original data.
  • understand/rerun the data processing code.
  • rebuild the tables and the visualizations.
  • assess the reasoning behind the conclusions.

Ultimately, we’d like to extend the work to other related datasets and analyses.

Why not just use Excel?

  • Spreadsheets are useful
  • Often good for exploring data, trying out initial analysis
  • But:
    • code is hard to debug.
    • limited statistical and modeling tools
    • analyses are not:
      • Reproducible (recreate the same analysis).
      • Reusable (apply the same analysis to different data).

Building Reproducible Analyses

To achieve the goal of reproducibility, data scientists commonly use the following toolkit:

  • Programming (e.g., Python or R)
  • Literate programming (e.g., Quarto, RMarkdown, Jupyter Notebooks)
  • Version control (e.g., Git and GitHub)

RStudio integrates support for all three.

We focus on the first two in this course.

Libraries

  • Python provides basic data structures (lists, dictionaries, etc.)
  • But many common operations are not built in.
  • Instead, we use libraries (also called packages).

Examples:

  • pandas provides DataFrame data structures for tabular data
  • plotly provides plotting functions
  • scikit-learn provides machine learning algorithms

Tabular data: The data frame

  • A rectangular table of data, like a spreadsheet
  • Each row is an observation
  • Each column is a variable
  • Each cell is a value
  • All rows have the same columns
  • Columns have names
  • Values in a column have the same type

Loading data from CSV

“Comma-Separated Values”: each line is a row, columns are separated by commas

date,day_of_week,workingday,total_rides
2011-01-01,6,weekend,654
2011-01-02,0,weekend,670
2011-01-03,1,workday,1229
2011-01-04,2,workday,1454
...

Load into a data frame:

import pandas as pd
daily_rides = pd.read_csv("https://calvin-data-science.github.io/data202/data/bikeshare/day_simple.csv", parse_dates=["date"])

Note: pandas has to guess data types. Sometimes it guesses wrong. e.g., date column is a string, but we want it to be a date.

Inspecting a data frame

daily_rides.head()
date day_of_week workingday total_rides
0 2011-01-01 6 weekend 654
1 2011-01-02 0 weekend 670
2 2011-01-03 1 workday 1229
3 2011-01-04 2 workday 1454
4 2011-01-05 3 workday 1518
daily_rides.shape
(731, 4)
daily_rides.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         731 non-null    datetime64[ns]
 1   day_of_week  731 non-null    int64         
 2   workingday   731 non-null    object        
 3   total_rides  731 non-null    int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 23.0+ KB

Accessing a data frame

  • df["column_name"] accesses a column
  • df.loc[row_index, "column_name"] accesses a cell
  • df.iloc[row_index, column_index] accesses a cell by index

Example:

daily_rides["date"]
0     2011-01-01
1     2011-01-02
2     2011-01-03
3     2011-01-04
4     2011-01-05
         ...    
726   2012-12-27
727   2012-12-28
728   2012-12-29
729   2012-12-30
730   2012-12-31
Name: date, Length: 731, dtype: datetime64[ns]
daily_rides['date'][0]
Timestamp('2011-01-01 00:00:00')
daily_rides.loc[0, "date"]
Timestamp('2011-01-01 00:00:00')