Exercise 4: Plot Types

Setup

Start with the usual setup code, given in the course notes.

Important

Please shutdown your RStudio session when you’re not using it.

Data

You can use this data. It’s the same as for a previous exercise, but we’ve also added gapminder_latest, which contains the data for just 2007 (so it’s just one row per country).

gapminder = px.data.gapminder()
gapminder_latest = gapminder.query("year == 2007")
just_usa = gapminder.query("country == 'United States'")

countries = [
  "China", "India", "United States",
  "Indonesia", "Brazil", "Pakistan",
  "Bangladesh", "Nigeria", "Japan"]

gapminder_9_countries = gapminder.query("country in @countries")

Part 1: Univariate analysis - Continuous

Pick one of the continuous variables. Make two different types of plots to show its distribution. (Use gapminder_latest, which has one row per country.)

Here are a few examples of different types of plots you could make.

Note: your plots might show different ticks than the examples below, since the ticks that plotly draws depends on the amount of vertical space that it has. Generally the defaults are fine; if you really want to tweak them, see the plotly course notes.

Checklist:

Part 2: Univariate Analysis - Categorical

Use px.histogram to make a bar plot of the number of countries by continent. (Use gapminder_latest, which has one row per country.)

Discuss the interpretation of this plot with your partner. (Why is this a px.histogram and not a px.bar?)

Part 2A: Adjustments

Try making the following adjustments to your plot:

Note

Plotly gives two ways to order categories: one is to use category_orders in the px.histogram call, and the other is to use update_xaxes (or update_yaxes) to set categoryorder.

See the plotly course notes for more details.

Part 2B: Data Wrangling

The px.histogram function did some data wrangling for you: it counted the number of countries in each continent. Try doing this yourself by using gapminder_latest.groupby("continent").size().

country_count_by_continent = gapminder_latest.groupby("continent", as_index=False).size()
country_count_by_continent
continent size
0 Africa 52
1 Americas 25
2 Asia 33
3 Europe 30
4 Oceania 2
Note

In the slides we showed an alternative approach using .value_counts. Grouping is more flexible, though, and we’ll be using it extensively next week.

Now make the plot again, using px.bar instead of px.histogram.

Note

You can use update_xaxes or update_yaxes to set categoryorder, or you can sort the data using .sort_values before passing it to px.bar.

Checklist:

Part 3: Bivariate: Relationship between numerical and categorical

Pick one of the continuous variables and plot its relationship with continent.

Note:

  • You may again want to use only 2007 data (gapminder_latest)
  • You may want to try using log_x = True or log_y = True, although note that these don’t work for some plot types in the current version of Plotly.

Make two different visualizations. Compare and contrast them.

If there isn’t already a better ordering, boxplots should generally be ordered by median. But Plotly does not natively support this. You’d have to do it yourself. It might look something like (beware, ugly code ahead):

median_order = (
  gapminder_latest.groupby("continent", as_index=False)
  .agg(median_lifeExp=("lifeExp", "median"))
  .sort_values("median_lifeExp")
  ['continent'].tolist()
)
median_order
['Africa', 'Asia', 'Americas', 'Europe', 'Oceania']

Normally we’re not using pandas indexing in this class, but in this case the code would come out cleaner if we did:

median_order = (
  gapminder_latest.groupby("continent").median()
  .sort_values("lifeExp").index.tolist()
)
median_order
['Africa', 'Asia', 'Americas', 'Europe', 'Oceania']

Either way, we can use category_orders to set the order:

px.box(
  gapminder_latest,
  x="lifeExp", y="continent",
  category_orders={"continent": median_order}
)

plotly actually does support categoryorder="median ascending" or categoryorder="median descending", but there’s a bug with how it interacts with boxplots. Notice that the following plot is not ordered by median:

(
  px.box(
    gapminder_latest,
    x="gdpPercap", y="continent",
  )
  .update_yaxes(categoryorder="median descending")
)

Checklist:

Part 4: Bivariate with a discrete variable

Plott how the distribution of one of the continuous variables changes by year. Try making two different distribution plots.

What do you notice about what has happened to the shape of the distribution of different variables over the years?

Checklist:

Tip

You might need to set orientation = "h" to tell plotly which variable is the discrete one (since both have a numeric storage type).

(This used update_traces to make the violins wider, and to make them all point in the same direction.)

Note

Log scales for violin plot seem to be broken in Plotly. Boxplots seem to work though.

Submitting

Final checklist:

Check your work against all the checklists above. Submit your rendered html as usual.