Slides 4: Visualization Implementation

Q&A

  • Bar vs histogram
  • Color schemes
  • Map vs set
  • discrete vs continuous scales

Terminology

Number of variables involved

  • Univariate data analysis - distribution of single variable
  • Bivariate data analysis - relationship between two variables
  • Multivariate data analysis - relationship between many variables at once
    • usually focusing on bivariate relationships assuming other variables are held constant

Types of variables

  • Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
  • If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Data

Data: Lending Club

  • Thousands of loans made through the Lending Club, a platform that allowed individuals to lend to other individuals

  • Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back the loan

  • Data includes loans made, these are not loan applications

Take a peek at the data

loans_full_schema = pd.read_excel('../../static/data/openintro_loans_full_schema.xlsx')
loans_full_schema.info()
loans_full_schema.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 55 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   emp_title                         9167 non-null   object 
 1   emp_length                        9183 non-null   float64
 2   state                             10000 non-null  object 
 3   homeownership                     10000 non-null  object 
 4   annual_income                     10000 non-null  float64
 5   verified_income                   10000 non-null  object 
 6   debt_to_income                    9976 non-null   float64
 7   annual_income_joint               1495 non-null   float64
 8   verification_income_joint         1455 non-null   object 
 9   debt_to_income_joint              1495 non-null   float64
 10  delinq_2y                         10000 non-null  int64  
 11  months_since_last_delinq          4342 non-null   float64
 12  earliest_credit_line              10000 non-null  int64  
 13  inquiries_last_12m                10000 non-null  int64  
 14  total_credit_lines                10000 non-null  int64  
 15  open_credit_lines                 10000 non-null  int64  
 16  total_credit_limit                10000 non-null  int64  
 17  total_credit_utilized             10000 non-null  int64  
 18  num_collections_last_12m          10000 non-null  int64  
 19  num_historical_failed_to_pay      10000 non-null  int64  
 20  months_since_90d_late             2285 non-null   float64
 21  current_accounts_delinq           10000 non-null  int64  
 22  total_collection_amount_ever      10000 non-null  int64  
 23  current_installment_accounts      10000 non-null  int64  
 24  accounts_opened_24m               10000 non-null  int64  
 25  months_since_last_credit_inquiry  8729 non-null   float64
 26  num_satisfactory_accounts         10000 non-null  int64  
 27  num_accounts_120d_past_due        9682 non-null   float64
 28  num_accounts_30d_past_due         10000 non-null  int64  
 29  num_active_debit_accounts         10000 non-null  int64  
 30  total_debit_limit                 10000 non-null  int64  
 31  num_total_cc_accounts             10000 non-null  int64  
 32  num_open_cc_accounts              10000 non-null  int64  
 33  num_cc_carrying_balance           10000 non-null  int64  
 34  num_mort_accounts                 10000 non-null  int64  
 35  account_never_delinq_percent      10000 non-null  float64
 36  tax_liens                         10000 non-null  int64  
 37  public_record_bankrupt            10000 non-null  int64  
 38  loan_purpose                      10000 non-null  object 
 39  application_type                  10000 non-null  object 
 40  loan_amount                       10000 non-null  int64  
 41  term                              10000 non-null  int64  
 42  interest_rate                     10000 non-null  float64
 43  installment                       10000 non-null  float64
 44  grade                             10000 non-null  object 
 45  sub_grade                         10000 non-null  object 
 46  issue_month                       10000 non-null  object 
 47  loan_status                       10000 non-null  object 
 48  initial_listing_status            10000 non-null  object 
 49  disbursement_method               10000 non-null  object 
 50  balance                           10000 non-null  float64
 51  paid_total                        10000 non-null  float64
 52  paid_principal                    10000 non-null  float64
 53  paid_interest                     10000 non-null  float64
 54  paid_late_fees                    10000 non-null  float64
dtypes: float64(17), int64(25), object(13)
memory usage: 4.2+ MB
emp_title emp_length state homeownership annual_income verified_income debt_to_income annual_income_joint verification_income_joint debt_to_income_joint ... sub_grade issue_month loan_status initial_listing_status disbursement_method balance paid_total paid_principal paid_interest paid_late_fees
0 global config engineer 3.0 NJ MORTGAGE 90000.0 Verified 18.01 NaN NaN NaN ... C3 Mar-2018 Current whole Cash 27015.86 1999.33 984.14 1015.19 0.0
1 warehouse office clerk 10.0 HI RENT 40000.0 Not Verified 5.04 NaN NaN NaN ... C1 Feb-2018 Current whole Cash 4651.37 499.12 348.63 150.49 0.0
2 assembly 3.0 WI RENT 40000.0 Source Verified 21.15 NaN NaN NaN ... D1 Feb-2018 Current fractional Cash 1824.63 281.80 175.37 106.43 0.0
3 customer service 1.0 PA RENT 30000.0 Not Verified 10.16 NaN NaN NaN ... A3 Jan-2018 Current whole Cash 18853.26 3312.89 2746.74 566.15 0.0
4 security supervisor 10.0 CA RENT 35000.0 Verified 57.96 57000.0 Verified 37.66 ... C3 Mar-2018 Current whole Cash 21430.15 2324.65 1569.85 754.80 0.0

5 rows × 55 columns

Select a few variables

loans = loans_full_schema[['loan_amount', 'loan_purpose', 'interest_rate', 'term', 'grade', 'state', 'annual_income', 'homeownership', 'debt_to_income']]
loans.head()
loan_amount loan_purpose interest_rate term grade state annual_income homeownership debt_to_income
0 28000 moving 14.07 60 C NJ 90000.0 MORTGAGE 18.01
1 5000 debt_consolidation 12.61 36 C HI 40000.0 RENT 5.04
2 2000 other 17.09 36 D WI 40000.0 RENT 21.15
3 21600 debt_consolidation 6.72 36 A PA 30000.0 RENT 10.16
4 23000 credit_card 14.07 36 C CA 35000.0 RENT 57.96

Selected variables

variable description
loan_amount Amount of the loan received, in US dollars
loan_purpose Purpose of the loan
interest_rate Interest rate on the loan, in an annual percentage
term The length of the loan, which is always set as a whole number of months
grade Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid
state US state where the borrower resides
annual_income Borrower’s annual income, including any second income, in US dollars
homeownership Indicates whether the person owns, owns but has a mortgage, or rents
debt_to_income Debt-to-income ratio

Variable types

variable type
loan_amount numerical, continuous
loan_purpose categorical, not ordinal
interest_rate numerical, continuous
term numerical, discrete
grade categorical, ordinal
state categorical, not ordinal
annual_income numerical, continuous
homeownership categorical, not ordinal
debt_to_income numerical, continuous

Visualizing numerical data

Describing shapes of numerical distributions

  • center: mean, median, mode (not always useful)
  • spread: range, standard deviation, inter-quartile range
  • shape:
    • skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
    • modality: unimodal, bimodal, multimodal, uniform
  • unusual observations

Histograms

Plotting a histogram

px.histogram(loans, x="loan_amount")

Effect of bin count

px.histogram(loans, x="loan_amount", nbins=10, range_x=[0, 45000])
px.histogram(loans, x="loan_amount", nbins=30, range_x=[0, 45000])
px.histogram(loans, x="loan_amount", nbins=100, range_x=[0, 45000])

Summary of bin count effect

  • Overall: histograms emphasize mode and skewness, not mean/median
  • Lots of bins: easier to see modes, harder to see skewness
  • Few bins: easier to see skewness, harder to see multiple modes

Aside: be careful with histograms on discrete data

uniform = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
px.histogram(uniform, x="x", nbins=10)
px.histogram(uniform, x="x", nbins=9)

Binning discrete data can be misleading.

Yes, you should label histograms

px.histogram(loans, x="loan_amount", nbins=30,
  labels={'loan_amount': 'Loan amount (USD)'}, title="Amounts of Lending Club loans")

But many plots in this presentation will be unlabeled for space.

Fill with a categorical variable

px.histogram(loans, x="loan_amount", nbins=30, color="homeownership", barmode="overlay")

Should we stack?

Overlay: bars start at axis. Stacked: start at top of previous bar.

px.histogram(loans, x="loan_amount",
  nbins=30, color="homeownership",
  barmode="overlay")
px.histogram(loans, x="loan_amount",
  nbins=30, color="homeownership",
  barmode="stack")

Should we facet instead?

(
  px.histogram(loans, x="loan_amount", nbins=30, facet_row="homeownership")
  .for_each_annotation(lambda a: a.update(text=a.text.split("=", 1)[-1]))
)

Normalize?

(
  px.histogram(loans, x="loan_amount", nbins=30, facet_row="homeownership",
    histnorm="percent")
  .for_each_annotation(lambda a: a.update(text=a.text.split("=", 1)[-1]))
)

Alternative plot types: Box

px.box(loans, x="loan_amount", y="homeownership")

Alternative plot types: Violin

px.violin(loans, x="loan_amount", y="homeownership")

Alternative plot types: Ridgeline

(
  px.violin(loans, x="loan_amount", y="homeownership")
  .update_traces(side = 'positive', width=3)
)

Retrieval Break

Mapping vs Setting

Which one of these maps color? Which one sets color?

(
  px.scatter(loans, x="loan_amount", y="interest_rate")
  .update_traces(marker_color="red", marker_size=3)
)
px.scatter(loans, x="loan_amount", y="interest_rate", color="grade")

What is the error in the following code?

px.scatter(gapminder,
  x="gdpPercap", y="lifeExp"
  animation_frame="year"
)

Categorical Data

Histogram on categorical data

Often called a bar chart.

px.histogram(loans, x="homeownership")

px.bar vs px.histogram

counts = loans["homeownership"].value_counts()
counts
MORTGAGE    4789
RENT        3858
OWN         1353
Name: homeownership, dtype: int64
px.bar(counts, height=300)

Bar chart with lots of categories

px.histogram(loans, x="loan_purpose")

Category labels should be legible

Tip: flip bar charts so the categorical variable is on the y-axis.

px.histogram(loans, y="loan_purpose")

Order categories meaningfully.

(
  px.histogram(loans, y="loan_purpose")
  .update_yaxes(categoryorder="total ascending")
)

Documentation on categoryorder

Segmented bar plots

px.histogram(loans, x="homeownership", color="grade")

oops, order those categories!

fig = (
  px.histogram(loans, x="homeownership", color="grade")
  .update_yaxes(categoryorder="category ascending")
)
fig

Why didn’t that work?

The figure has 7 traces, stored in fig.data.

print(len(fig.data))
7

The names of the traces are the categories.

print([d.name for d in fig.data])
['C', 'D', 'A', 'B', 'F', 'E', 'G']

They are sorted in order of first appearance.

print(list(loans['grade'].unique()))
['C', 'D', 'A', 'B', 'F', 'E', 'G']

Which isn’t what we want.

Fix the data order

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade"
)

Choosing a color palette

px.colors.qualitative.swatches()
px.colors.sequential.swatches()

Implementing that color palette

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.qualitative.Plotly
)
px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.sequential.Cividis_r
)

Should we normalize the bars?

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.sequential.Cividis_r,
  barnorm=None
)
px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.sequential.Cividis_r,
  barnorm="percent"
)

Your turn

Start work on Exercise 4.

Wednesday

EDA

  • Exploratory data analysis (EDA) is the process of exploring data, usually with visualization, to understand the data and to generate hypotheses.
  • EDA is not a formal process, but it is a process that can be learned.
  • EDA is not a substitute for formal statistical inference, but it is a useful precursor.

Primary keys

  • Primary key: a column or set of columns that uniquely identifies each row in a table
  • When we say “each row is a ___“, the BLANK is the primary key

What’s the primary key for the gapminder_latest from Exercise 4?

What’s the primary key for gapminder?

Does loans have a primary key?

Relationships between continuous variables

px.scatter(loans, x="debt_to_income", y="interest_rate")

Overplotting

Easy fixes: transparency, smaller points

(
  px.scatter(loans, x="debt_to_income", y="interest_rate")
  .update_traces(marker_opacity=0.1, marker_size=3)
)

Avoid lots of blank space

threshold = 75
low_debt_to_income = loans.query("debt_to_income < @threshold")
num_rows_removed = len(loans) - len(low_debt_to_income)
print("Removed {} rows ({:.1%})".format(num_rows_removed, num_rows_removed / len(loans)))
Removed 75 rows (0.8%)
(
  px.scatter(low_debt_to_income, x="debt_to_income", y="interest_rate")
  .update_traces(marker_opacity=0.1, marker_size=3)
)

2D histograms

px.density_heatmap(low_debt_to_income, x="debt_to_income", y="interest_rate")

Density Contours

px.density_contour(low_debt_to_income, x="debt_to_income", y="interest_rate")

Other options: Trendlines

(
  px.scatter(low_debt_to_income, x="debt_to_income", y="interest_rate", trendline="lowess", trendline_color_override="red")
  .update_traces(marker_opacity=0.1, marker_size=3)
)

Other options: binning along one axis

Remember that grade is already a binning of interest_rate.

px.box(low_debt_to_income, x="debt_to_income", y="grade",
  category_orders={"grade": ["A", "B", "C", "D", "E", "F", "G"]})
px.violin(low_debt_to_income, x="debt_to_income", y="grade",
  category_orders={"grade": ["A", "B", "C", "D", "E", "F", "G"]})

Tweaking the ridgeline plot

(
  px.violin(low_debt_to_income, x="debt_to_income", y="grade",
    category_orders={"grade": ["A", "B", "C", "D", "E", "F", "G"]},
    labels={"debt_to_income": "Debt-to-income ratio", "grade": "Loan grade"},
    template="plotly_white"
  )
  .update_traces(side = 'positive', width=5, points=False)
  .update_yaxes(showgrid=True)
  .update_xaxes(showgrid=False, zeroline=False)
)