Slides 4: Visualization Implementation

Q&A

Bar vs histogram
Color schemes
Map vs set
discrete vs continuous scales

Terminology

Number of variables involved

Univariate data analysis - distribution of single variable
Bivariate data analysis - relationship between two variables
Multivariate data analysis - relationship between many variables at once
- usually focusing on bivariate relationships assuming other variables are held constant

Types of variables

Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively.
If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering.

Data

Data: Lending Club

Thousands of loans made through the Lending Club, a platform that allowed individuals to lend to other individuals
Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back the loan
Data includes loans made, these are not loan applications

Take a peek at the data

loans_full_schema = pd.read_excel('../../static/data/openintro_loans_full_schema.xlsx')
loans_full_schema.info()
loans_full_schema.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 55 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   emp_title                         9167 non-null   object 
 1   emp_length                        9183 non-null   float64
 2   state                             10000 non-null  object 
 3   homeownership                     10000 non-null  object 
 4   annual_income                     10000 non-null  float64
 5   verified_income                   10000 non-null  object 
 6   debt_to_income                    9976 non-null   float64
 7   annual_income_joint               1495 non-null   float64
 8   verification_income_joint         1455 non-null   object 
 9   debt_to_income_joint              1495 non-null   float64
 10  delinq_2y                         10000 non-null  int64  
 11  months_since_last_delinq          4342 non-null   float64
 12  earliest_credit_line              10000 non-null  int64  
 13  inquiries_last_12m                10000 non-null  int64  
 14  total_credit_lines                10000 non-null  int64  
 15  open_credit_lines                 10000 non-null  int64  
 16  total_credit_limit                10000 non-null  int64  
 17  total_credit_utilized             10000 non-null  int64  
 18  num_collections_last_12m          10000 non-null  int64  
 19  num_historical_failed_to_pay      10000 non-null  int64  
 20  months_since_90d_late             2285 non-null   float64
 21  current_accounts_delinq           10000 non-null  int64  
 22  total_collection_amount_ever      10000 non-null  int64  
 23  current_installment_accounts      10000 non-null  int64  
 24  accounts_opened_24m               10000 non-null  int64  
 25  months_since_last_credit_inquiry  8729 non-null   float64
 26  num_satisfactory_accounts         10000 non-null  int64  
 27  num_accounts_120d_past_due        9682 non-null   float64
 28  num_accounts_30d_past_due         10000 non-null  int64  
 29  num_active_debit_accounts         10000 non-null  int64  
 30  total_debit_limit                 10000 non-null  int64  
 31  num_total_cc_accounts             10000 non-null  int64  
 32  num_open_cc_accounts              10000 non-null  int64  
 33  num_cc_carrying_balance           10000 non-null  int64  
 34  num_mort_accounts                 10000 non-null  int64  
 35  account_never_delinq_percent      10000 non-null  float64
 36  tax_liens                         10000 non-null  int64  
 37  public_record_bankrupt            10000 non-null  int64  
 38  loan_purpose                      10000 non-null  object 
 39  application_type                  10000 non-null  object 
 40  loan_amount                       10000 non-null  int64  
 41  term                              10000 non-null  int64  
 42  interest_rate                     10000 non-null  float64
 43  installment                       10000 non-null  float64
 44  grade                             10000 non-null  object 
 45  sub_grade                         10000 non-null  object 
 46  issue_month                       10000 non-null  object 
 47  loan_status                       10000 non-null  object 
 48  initial_listing_status            10000 non-null  object 
 49  disbursement_method               10000 non-null  object 
 50  balance                           10000 non-null  float64
 51  paid_total                        10000 non-null  float64
 52  paid_principal                    10000 non-null  float64
 53  paid_interest                     10000 non-null  float64
 54  paid_late_fees                    10000 non-null  float64
dtypes: float64(17), int64(25), object(13)
memory usage: 4.2+ MB

	emp_title	emp_length	state	homeownership	annual_income	verified_income	debt_to_income	annual_income_joint	verification_income_joint	debt_to_income_joint	...	sub_grade	issue_month	loan_status	initial_listing_status	disbursement_method	balance	paid_total	paid_principal	paid_interest
0	global config engineer	3.0	NJ	MORTGAGE	90000.0	Verified	18.01	NaN	NaN	NaN	...	C3	Mar-2018	Current	whole	Cash	27015.86	1999.33	984.14	1015.19
1	warehouse office clerk	10.0	HI	RENT	40000.0	Not Verified	5.04	NaN	NaN	NaN	...	C1	Feb-2018	Current	whole	Cash	4651.37	499.12	348.63	150.49
2	assembly	3.0	WI	RENT	40000.0	Source Verified	21.15	NaN	NaN	NaN	...	D1	Feb-2018	Current	fractional	Cash	1824.63	281.80	175.37	106.43
3	customer service	1.0	PA	RENT	30000.0	Not Verified	10.16	NaN	NaN	NaN	...	A3	Jan-2018	Current	whole	Cash	18853.26	3312.89	2746.74	566.15
4	security supervisor	10.0	CA	RENT	35000.0	Verified	57.96	57000.0	Verified	37.66	...	C3	Mar-2018	Current	whole	Cash	21430.15	2324.65	1569.85	754.80

5 rows × 55 columns

Select a few variables

loans = loans_full_schema[['loan_amount', 'loan_purpose', 'interest_rate', 'term', 'grade', 'state', 'annual_income', 'homeownership', 'debt_to_income']]
loans.head()

	loan_amount	loan_purpose	interest_rate	term	grade	state	annual_income	homeownership	debt_to_income
0	28000	moving	14.07	60	C	NJ	90000.0	MORTGAGE	18.01
1	5000	debt_consolidation	12.61	36	C	HI	40000.0	RENT	5.04
2	2000	other	17.09	36	D	WI	40000.0	RENT	21.15
3	21600	debt_consolidation	6.72	36	A	PA	30000.0	RENT	10.16
4	23000	credit_card	14.07	36	C	CA	35000.0	RENT	57.96

Selected variables

variable	description
`loan_amount`	Amount of the loan received, in US dollars
`loan_purpose`	Purpose of the loan
`interest_rate`	Interest rate on the loan, in an annual percentage
`term`	The length of the loan, which is always set as a whole number of months
`grade`	Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid
`state`	US state where the borrower resides
`annual_income`	Borrower’s annual income, including any second income, in US dollars
`homeownership`	Indicates whether the person owns, owns but has a mortgage, or rents
`debt_to_income`	Debt-to-income ratio

Variable types

variable	type
`loan_amount`	numerical, continuous
`loan_purpose`	categorical, not ordinal
`interest_rate`	numerical, continuous
`term`	numerical, discrete
`grade`	categorical, ordinal
`state`	categorical, not ordinal
`annual_income`	numerical, continuous
`homeownership`	categorical, not ordinal
`debt_to_income`	numerical, continuous

Visualizing numerical data

Describing shapes of numerical distributions

center: mean, median, mode (not always useful)
spread: range, standard deviation, inter-quartile range
shape:
- skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail)
- modality: unimodal, bimodal, multimodal, uniform
unusual observations

Histograms

Plotting a histogram

px.histogram(loans, x="loan_amount")

px.histogram(loans, x="loan_amount", nbins=10, range_x=[0, 45000])

px.histogram(loans, x="loan_amount", nbins=30, range_x=[0, 45000])

px.histogram(loans, x="loan_amount", nbins=100, range_x=[0, 45000])

Summary of bin count effect

Overall: histograms emphasize mode and skewness, not mean/median
Lots of bins: easier to see modes, harder to see skewness
Few bins: easier to see skewness, harder to see multiple modes

Aside: be careful with histograms on discrete data

uniform = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

px.histogram(uniform, x="x", nbins=10)

px.histogram(uniform, x="x", nbins=9)

Binning discrete data can be misleading.

Yes, you should label histograms

px.histogram(loans, x="loan_amount", nbins=30,
  labels={'loan_amount': 'Loan amount (USD)'}, title="Amounts of Lending Club loans")

But many plots in this presentation will be unlabeled for space.

Fill with a categorical variable

px.histogram(loans, x="loan_amount", nbins=30, color="homeownership", barmode="overlay")

Should we stack?

Overlay: bars start at axis. Stacked: start at top of previous bar.

px.histogram(loans, x="loan_amount",
  nbins=30, color="homeownership",
  barmode="overlay")

px.histogram(loans, x="loan_amount",
  nbins=30, color="homeownership",
  barmode="stack")

Should we facet instead?

(
  px.histogram(loans, x="loan_amount", nbins=30, facet_row="homeownership")
  .for_each_annotation(lambda a: a.update(text=a.text.split("=", 1)[-1]))
)

Normalize?

(
  px.histogram(loans, x="loan_amount", nbins=30, facet_row="homeownership",
    histnorm="percent")
  .for_each_annotation(lambda a: a.update(text=a.text.split("=", 1)[-1]))
)

Alternative plot types: Box

px.box(loans, x="loan_amount", y="homeownership")

Alternative plot types: Violin

px.violin(loans, x="loan_amount", y="homeownership")

Alternative plot types: Ridgeline

(
  px.violin(loans, x="loan_amount", y="homeownership")
  .update_traces(side = 'positive', width=3)
)

Retrieval Break

Mapping vs Setting

Which one of these maps color? Which one sets color?

(
  px.scatter(loans, x="loan_amount", y="interest_rate")
  .update_traces(marker_color="red", marker_size=3)
)

px.scatter(loans, x="loan_amount", y="interest_rate", color="grade")

What is the error in the following code?

px.scatter(gapminder,
  x="gdpPercap", y="lifeExp"
  animation_frame="year"
)

Categorical Data

Histogram on categorical data

Often called a bar chart.

px.histogram(loans, x="homeownership")

px.bar vs px.histogram

counts = loans["homeownership"].value_counts()
counts

MORTGAGE    4789
RENT        3858
OWN         1353
Name: homeownership, dtype: int64

px.bar(counts, height=300)

Bar chart with lots of categories

px.histogram(loans, x="loan_purpose")

Category labels should be legible

Tip: flip bar charts so the categorical variable is on the y-axis.

px.histogram(loans, y="loan_purpose")

Order categories meaningfully.

(
  px.histogram(loans, y="loan_purpose")
  .update_yaxes(categoryorder="total ascending")
)

Documentation on categoryorder

Segmented bar plots

px.histogram(loans, x="homeownership", color="grade")

oops, order those categories!

fig = (
  px.histogram(loans, x="homeownership", color="grade")
  .update_yaxes(categoryorder="category ascending")
)
fig

Why didn’t that work?

The figure has 7 traces, stored in fig.data.

print(len(fig.data))

The names of the traces are the categories.

print([d.name for d in fig.data])

['C', 'D', 'A', 'B', 'F', 'E', 'G']

They are sorted in order of first appearance.

print(list(loans['grade'].unique()))

['C', 'D', 'A', 'B', 'F', 'E', 'G']

Which isn’t what we want.

Fix the data order

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade"
)

Choosing a color palette

px.colors.qualitative.swatches()

px.colors.sequential.swatches()

Implementing that color palette

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.qualitative.Plotly
)

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.sequential.Cividis_r
)

Should we normalize the bars?

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.sequential.Cividis_r,
  barnorm=None
)

px.histogram(
  loans.sort_values(by="grade"),
  x="homeownership", color="grade",
  color_discrete_sequence=px.colors.sequential.Cividis_r,
  barnorm="percent"
)

Your turn

Start work on Exercise 4.

Wednesday

EDA

Exploratory data analysis (EDA) is the process of exploring data, usually with visualization, to understand the data and to generate hypotheses.
EDA is not a formal process, but it is a process that can be learned.
EDA is not a substitute for formal statistical inference, but it is a useful precursor.

Primary keys

Primary key: a column or set of columns that uniquely identifies each row in a table
When we say “each row is a ___“, the BLANK is the primary key

What’s the primary key for the gapminder_latest from Exercise 4?

What’s the primary key for gapminder?

Does loans have a primary key?

Relationships between continuous variables

px.scatter(loans, x="debt_to_income", y="interest_rate")

Overplotting

Easy fixes: transparency, smaller points

(
  px.scatter(loans, x="debt_to_income", y="interest_rate")
  .update_traces(marker_opacity=0.1, marker_size=3)
)

Avoid lots of blank space

threshold = 75
low_debt_to_income = loans.query("debt_to_income < @threshold")
num_rows_removed = len(loans) - len(low_debt_to_income)
print("Removed {} rows ({:.1%})".format(num_rows_removed, num_rows_removed / len(loans)))

Removed 75 rows (0.8%)

(
  px.scatter(low_debt_to_income, x="debt_to_income", y="interest_rate")
  .update_traces(marker_opacity=0.1, marker_size=3)
)

2D histograms

px.density_heatmap(low_debt_to_income, x="debt_to_income", y="interest_rate")

Density Contours

px.density_contour(low_debt_to_income, x="debt_to_income", y="interest_rate")

Other options: Trendlines

(
  px.scatter(low_debt_to_income, x="debt_to_income", y="interest_rate", trendline="lowess", trendline_color_override="red")
  .update_traces(marker_opacity=0.1, marker_size=3)
)

Other options: binning along one axis

Remember that grade is already a binning of interest_rate.

px.box(low_debt_to_income, x="debt_to_income", y="grade",
  category_orders={"grade": ["A", "B", "C", "D", "E", "F", "G"]})

px.violin(low_debt_to_income, x="debt_to_income", y="grade",
  category_orders={"grade": ["A", "B", "C", "D", "E", "F", "G"]})

Tweaking the ridgeline plot

(
  px.violin(low_debt_to_income, x="debt_to_income", y="grade",
    category_orders={"grade": ["A", "B", "C", "D", "E", "F", "G"]},
    labels={"debt_to_income": "Debt-to-income ratio", "grade": "Loan grade"},
    template="plotly_white"
  )
  .update_traces(side = 'positive', width=5, points=False)
  .update_yaxes(showgrid=True)
  .update_xaxes(showgrid=False, zeroline=False)
)