Many visualizations involve joining data from multiple sources and/or reshaping data from “wide” to “long” format. This exercise will give you practice with both of these tasks.
Getting Started
Create a folder and a Quarto file, as usual. You can use ex06-pivot-join.qmd. The usual setup code can be found in thou course notes.
Download the data from https://www.gapminder.org/data/ in either CSV or Excel format. Upload to RStudio and use either pd.read_csv or pd.read_excel to load it.
Step by step instructions, so we all get the same data:
Under “Choose individual indicators”, search for “GDP”. Click the one marked “Income”, it should say “GDP per capita (price and inflation adjusted, in PPP$2017)” when you click on it.
At the top of the table, click Download As: CSV or XLSX. I got a file named lex.xlsx
Click the header (“GDP per capita”) and select instead “Life Expectancy, at birth”. Download that as well. I got a file named gdp_pcap.xlsx
Notice that the data is in “wide” format, with each year as a column. To join and plot this data, we’ll need it in “long” format, with each year as a row:
country
year
gdp_pcap
0
Afghanistan
1800
583
1
Angola
1800
452
2
Albania
1800
569
3
Andorra
1800
1660
4
UAE
1800
1390
We’ll use pd.melt to get it in this format. We’ll need to tell it which columns form the primary key, specified by the id_vars argument. The other columns will be “melted” into a single column, specified by the var_name argument. The values for those columns will be put in a column specified by the value_name argument.
It has a lot of columns; we only want four_regions. Rename it to region. You might also want to rename the name column so it has the same name as the corresponding column in the gapminder dataframe. Then select the columns we want. Then join it with the gapminder data.
To rename, use data_frame.rename(columns={"old_name": "new_name"}). To select columns, use df[["col1", "col2"]]. Remember that both of these return a new dataframe, so you can chain the operations together, then assign the final result to a variable (I used gapminder_with_regions).
Plotting
Now you can make the plot. By now you should be able to do this without further guidance.