Missing Data

Marking Missingness

Data sources use various methods to missing data. You might see “NA”, “NULL”, “none”, etc.; pandas’ read_csv automatically recognizes many of these as missing values. But sometimes datasets use different encodings.

For example, sometimes a dataset will use a large positive or negative value. You should expect to find this in the data dictionary, although you occasionally need to make an educated guess (document your reasoning!).

Two approaches to handling this:

  1. Tell read_csv what specific values indicate missing values in your dataset. For example, read_csv('data.csv', na_values=['-9999']) will treat -9999 as a missing value.
  2. Use replace to replace the missing values with np.nan after reading in the data. For example, df_with_missingness_marked = df.replace(-9999, np.nan) will replace -9999 with np.nan in the DataFrame df. (Note that this applies to the entire DataFrame, not just a single column. Usually it’s cleaner to handle each column separately. For example, df['column_name'] = df['column_name'].replace(-9999, np.nan) will replace -9999 with np.nan in the column column_name of the DataFrame df.)

Do I have any missing values?

Use df.info() to get an overview of data types and missing values.

Handling missingness

  1. Figure out how big of a problem you have. How many missing values are there? Are they concentrated in a few columns, or spread out? Are they missing at random, or is there a pattern? Report your findings.
  2. Drop or interpolate.

Often, missing values will get dropped implicitly, either by a numerical operation or by a plotting function. For example, df['column_name'].mean() will ignore missing values. And Plotly will often silently drop data with missing values in any of the columns you’re plotting.

So, even if you’re not explicitly dropping missing values, you need to analyze for missingness.

Approaches:

  1. Drop rows with missing values – and count how many rows got dropped.
  2. Make a new variable for whether another variable was missing. E.g., df['age_was_missing']'] = df['age'].isnull(). Then you can do all your EDA on the new variable, and report how many missing values there were and whether there are any patterns (e.g., are there groups of people who are more likely to have missing values?)