Many visualizations involve joining data from multiple sources and/or reshaping data from “wide” to “long” format. This exercise will give you practice with both of these tasks.
Getting Started
Create a folder and a Quarto file, as usual. You can use ex06-pivot-join.qmd. The usual setup code can be found in thou course notes.
Download the data from https://www.gapminder.org/data/ in either CSV or Excel format. Upload to RStudio and use either pd.read_csv or pd.read_excel to load it.
Step by step instructions, so we all get the same data:
Under “Choose individual indicators”, search for “GDP”. Click the one marked “Income”, it should say “GDP per capita (price and inflation adjusted, in PPP$2017)” when you click on it.
At the top of the table, click Download As: CSV or XLSX. I got a file named lex.xlsx
Click the header (“GDP per capita”) and select instead “Life Expectancy, at birth”. Download that as well. I got a file named gdp_pcap.xlsx
Notice that the data is in “wide” format, with each year as a column. To join and plot this data, we’ll need it in “long” format, with each year as a row:
country
year
gdp_pcap
0
Afghanistan
1800
583
1
Angola
1800
452
2
Albania
1800
569
3
Andorra
1800
1660
4
UAE
1800
1390
We’ll use pd.melt to get it in this format. We’ll need to tell it which columns form the primary key, specified by the id_vars argument. The other columns will be “melted” into a single column, specified by the var_name argument. The values for those columns will be put in a column specified by the value_name argument.
It has a lot of columns; we only want four_regions. Rename it to region. You might also want to rename the name column so it has the same name as the corresponding column in the gapminder dataframe. Then select the columns we want. Then join it with the gapminder data.
To rename, use data_frame.rename(columns={"old_name": "new_name"}). To select columns, use df[["col1", "col2"]]. Remember that both of these return a new dataframe, so you can chain the operations together, then assign the final result to a variable (I used gapminder_with_regions).
Debugging Your Join
When you first do a join, it might be a good idea to set indicator=True and use an outer join. This shows us a few interesting things:
gapminder_with_regions.head()
country
region
year
gdp_pcap
life
_merge
0
Australia
asia
1800.0
995.0
34.0
both
1
Australia
asia
1801.0
1000.0
34.0
both
2
Australia
asia
1802.0
1010.0
34.0
both
3
Australia
asia
1803.0
1010.0
34.0
both
4
Australia
asia
1804.0
1020.0
34.0
both
A value of both for _merge indicates that the row comes from a key that was in both tables. But there are also rows that come from only one table or the other. For example:
I had put the gapminder_with_regions dataframe on the left side of the join, and gapminder on the right. Think about what the table above is telling us.