= px.data.gapminder()
gapminder = gapminder.query("year == 2007")
gapminder_latest = gapminder.query("country == 'United States'")
just_usa
= [
countries "China", "India", "United States",
"Indonesia", "Brazil", "Pakistan",
"Bangladesh", "Nigeria", "Japan"]
= gapminder.query("country in @countries") gapminder_9_countries
Exercise 4: Plot Types
Setup
Start with the usual setup code, given in the course notes.
Please shutdown your RStudio session when you’re not using it.
Data
You can use this data. It’s the same as for a previous exercise, but we’ve also added gapminder_latest
, which contains the data for just 2007 (so it’s just one row per country).
Part 1: Univariate analysis - Continuous
Pick one of the continuous variables. Make two different types of plots to show its distribution. (Use gapminder_latest
, which has one row per country.)
Here are a few examples of different types of plots you could make.
Note: your plots might show different ticks than the examples below, since the ticks that plotly draws depends on the amount of vertical space that it has. Generally the defaults are fine; if you really want to tweak them, see the plotly course notes.
Checklist:
Part 2: Univariate Analysis - Categorical
Use px.histogram
to make a bar plot of the number of countries by continent. (Use gapminder_latest
, which has one row per country.)
Discuss the interpretation of this plot with your partner. (Why is this a px.histogram
and not a px.bar
?)
Part 2A: Adjustments
Try making the following adjustments to your plot:
Plotly gives two ways to order categories: one is to use category_orders
in the px.histogram
call, and the other is to use update_xaxes
(or update_yaxes
) to set categoryorder
.
See the plotly course notes for more details.
Part 2B: Data Wrangling
The px.histogram
function did some data wrangling for you: it counted the number of countries in each continent. Try doing this yourself by using gapminder_latest.groupby("continent").size()
.
= gapminder_latest.groupby("continent", as_index=False).size()
country_count_by_continent country_count_by_continent
continent | size | |
---|---|---|
0 | Africa | 52 |
1 | Americas | 25 |
2 | Asia | 33 |
3 | Europe | 30 |
4 | Oceania | 2 |
In the slides we showed an alternative approach using .value_counts
. Grouping is more flexible, though, and we’ll be using it extensively next week.
Now make the plot again, using px.bar
instead of px.histogram
.
You can use update_xaxes
or update_yaxes
to set categoryorder
, or you can sort the data using .sort_values
before passing it to px.bar
.
Checklist:
Part 3: Bivariate: Relationship between numerical and categorical
Pick one of the continuous variables and plot its relationship with continent
.
Note:
- You may again want to use only 2007 data (
gapminder_latest
) - You may want to try using
log_x = True
orlog_y = True
, although note that these don’t work for some plot types in the current version of Plotly.
Make two different visualizations. Compare and contrast them.
If there isn’t already a better ordering, boxplots should generally be ordered by median. But Plotly does not natively support this. You’d have to do it yourself. It might look something like (beware, ugly code ahead):
= (
median_order "continent", as_index=False)
gapminder_latest.groupby(=("lifeExp", "median"))
.agg(median_lifeExp"median_lifeExp")
.sort_values('continent'].tolist()
[
) median_order
['Africa', 'Asia', 'Americas', 'Europe', 'Oceania']
Normally we’re not using pandas indexing in this class, but in this case the code would come out cleaner if we did:
= (
median_order "continent").median()
gapminder_latest.groupby("lifeExp").index.tolist()
.sort_values(
) median_order
['Africa', 'Asia', 'Americas', 'Europe', 'Oceania']
Either way, we can use category_orders
to set the order:
px.box(
gapminder_latest,="lifeExp", y="continent",
x={"continent": median_order}
category_orders )
plotly actually does support categoryorder="median ascending"
or categoryorder="median descending"
, but there’s a bug with how it interacts with boxplots. Notice that the following plot is not ordered by median:
(
px.box(
gapminder_latest,="gdpPercap", y="continent",
x
)="median descending")
.update_yaxes(categoryorder )
Checklist:
Part 4: Bivariate with a discrete variable
Plott how the distribution of one of the continuous variables changes by year
. Try making two different distribution plots.
What do you notice about what has happened to the shape of the distribution of different variables over the years?
Checklist:
You might need to set orientation = "h"
to tell plotly which variable is the discrete one (since both have a numeric storage type).
(This used update_traces
to make the violins wider, and to make them all point in the same direction.)
Log scales for violin plot seem to be broken in Plotly. Boxplots seem to work though.
Submitting
Final checklist:
Check your work against all the checklists above. Submit your rendered html
as usual.