date,day_of_week,workingday,total_rides
2011-01-01,6,weekend,654
2011-01-02,0,weekend,670
2011-01-03,1,workday,1229
2011-01-04,2,workday,1454
...
“God opposes the proud but gives grace to the humble”
- Descriptive: what is the current level of CO2 in the atmosphere?
- Exploratory: what factors are associated with CO2 levels?
- Inferential: can we conclude that CO2 levels are rising?
- Predictive: can we predict future CO2 levels?
(Generated by GitHub Copilot)
Modeling helps us see something we can’t see:
Sketch one of the data frames we looked at last week. (bikeshare, gapminder, or UN votes)
Sketch one of the plots that we made with that data. What went on the axes? (Do you happen to remember the syntax we used to make the plot? okay if not yet!)
In the tutorial we looked at a few ways of seeing the overall shape of a data frame. Do you remember:
Science is based on experiments and analyses that are reproducible.
In data science, achieving reproducibility requires that we be clear about:
Our documented analyses must be clear enough that others can:
Ultimately, we’d like to extend the work to other related datasets and analyses.
To achieve the goal of reproducibility, data scientists commonly use the following toolkit:
RStudio integrates support for all three.
We focus on the first two in this course.
Examples:
pandas
provides DataFrame
data structures for tabular dataplotly
provides plotting functionsscikit-learn
provides machine learning algorithms“Comma-Separated Values”: each line is a row, columns are separated by commas
date,day_of_week,workingday,total_rides
2011-01-01,6,weekend,654
2011-01-02,0,weekend,670
2011-01-03,1,workday,1229
2011-01-04,2,workday,1454
...
Load into a data frame:
Note: pandas has to guess data types. Sometimes it guesses wrong. e.g., date
column is a string, but we want it to be a date.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 731 non-null datetime64[ns]
1 day_of_week 731 non-null int64
2 workingday 731 non-null object
3 total_rides 731 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 23.0+ KB
df["column_name"]
accesses a columndf.loc[row_index, "column_name"]
accesses a celldf.iloc[row_index, column_index]
accesses a cell by indexExample:
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
...
726 2012-12-27
727 2012-12-28
728 2012-12-29
729 2012-12-30
730 2012-12-31
Name: date, Length: 731, dtype: datetime64[ns]