Exercise 6: Pivoting and Joining - Added Notes

Many visualizations involve joining data from multiple sources and/or reshaping data from “wide” to “long” format. This exercise will give you practice with both of these tasks.

Getting Started

Create a folder and a Quarto file, as usual. You can use ex06-pivot-join.qmd. The usual setup code can be found in thou course notes.

Download the data from https://www.gapminder.org/data/ in either CSV or Excel format. Upload to RStudio and use either pd.read_csv or pd.read_excel to load it.

Step by step instructions, so we all get the same data:

  • Go to https://www.gapminder.org/data/
  • Under “Choose individual indicators”, search for “GDP”. Click the one marked “Income”, it should say “GDP per capita (price and inflation adjusted, in PPP$2017)” when you click on it.
  • At the top of the table, click Download As: CSV or XLSX. I got a file named lex.xlsx
  • Click the header (“GDP per capita”) and select instead “Life Expectancy, at birth”. Download that as well. I got a file named gdp_pcap.xlsx
gapminder_gdp = pd.read_excel("data/gdp_pcap.xlsx")
gapminder_gdp.head()
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 Afghanistan 583 583 583 583 583 583 583 583 583 ... 3570 3700 3720 3830 3930 4030 4130 3900 4070 4030
1 Angola 452 454 456 458 460 462 464 466 468 ... 20.6k 21.1k 21.7k 21.9k 22.3k 22.5k 23k 23.5k 24k 24.6k
2 Albania 569 571 572 574 576 577 579 581 582 ... 37.6k 38.5k 39.4k 39.9k 40.9k 41.9k 41.4k 42.4k 43.7k 45.7k
3 Andorra 1660 1660 1670 1670 1670 1680 1680 1680 1690 ... 59.2k 60k 60.3k 61k 61.7k 62.7k 61.9k 62.5k 62.9k 62.1k
4 UAE 1390 1390 1390 1400 1410 1410 1420 1420 1430 ... 86.2k 85.4k 84.1k 84.3k 84.6k 83.9k 84.2k 84k 84.1k 84.3k

5 rows × 302 columns

gapminder_life = pd.read_excel("data/lex.xlsx")
gapminder_life.head()
country 1800 1801 1802 1803 1804 1805 1806 1807 1808 ... 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100
0 Afghanistan 28.2 28.2 28.2 28.2 28.2 28.2 28.1 28.1 28.1 ... 75.5 75.7 75.8 76.0 76.1 76.2 76.4 76.5 76.6 76.8
1 Angola 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 27.0 ... 78.8 79.0 79.1 79.2 79.3 79.5 79.6 79.7 79.9 80.0
2 Albania 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 35.4 ... 87.4 87.5 87.6 87.7 87.8 87.9 88.0 88.2 88.3 88.4
3 Andorra NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 UAE 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 30.7 ... 82.4 82.5 82.6 82.7 82.8 82.9 83.0 83.1 83.2 83.3

5 rows × 302 columns

Wide to Long

Notice that the data is in “wide” format, with each year as a column. To join and plot this data, we’ll need it in “long” format, with each year as a row:

country year gdp_pcap
0 Afghanistan 1800 583
1 Angola 1800 452
2 Albania 1800 569
3 Andorra 1800 1660
4 UAE 1800 1390

We’ll use pd.melt to get it in this format. We’ll need to tell it which columns form the primary key, specified by the id_vars argument. The other columns will be “melted” into a single column, specified by the var_name argument. The values for those columns will be put in a column specified by the value_name argument.

pd.melt(
    gapminder_gdp,
    id_vars=[...],
    var_name="XXX", value_name="YYY"
)

Fill in the blanks above with the appropriate column names. Then assign the result to a variable; I used gdp_long.

Now repeat that for the life expectancy data:

country year life
0 Afghanistan 1800 28.2
1 Angola 1800 27.0
2 Albania 1800 35.4
3 Andorra 1800 NaN
4 UAE 1800 30.7

Data Types

Check the data types of the gdp_long table. There are two that are not what you’d expect.

gdp_long.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58695 entries, 0 to 58694
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   country   58695 non-null  object
 1   year      58695 non-null  object
 2   gdp_pcap  58695 non-null  object
dtypes: object(3)
memory usage: 1.3+ MB

year is easy to fix:

gdp_long['year'] = gdp_long['year'].astype(int)
life_long['year'] = life_long['year'].astype(int)

A more tricky issue is that gdp_pcap uses k to mean “thousands”. For example:

gdp_long.tail()
country year gdp_pcap
58690 Samoa 2100 16k
58691 Yemen 2100 5730
58692 South Africa 2100 42.7k
58693 Zambia 2100 12.8k
58694 Zimbabwe 2100 7380

We need to convert those k values to numbers. Here’s an approach:

def parse_number_with_units(num):
    if not isinstance(num, str):
        return num
    if num.endswith("k"):
        return float(num[:-1]) * 1000
    return float(num)
gdp_long['gdp_pcap'] = gdp_long['gdp_pcap'].map(parse_number_with_units)
gdp_long.tail()
country year gdp_pcap
58690 Samoa 2100 16000.0
58691 Yemen 2100 5730.0
58692 South Africa 2100 42700.0
58693 Zambia 2100 12800.0
58694 Zimbabwe 2100 7380.0

Joining

Now, join the datasets together. Think about what the primary key of this data is. Call the result gapminder:

country year gdp_pcap life
0 Afghanistan 1800 583.0 28.2
1 Angola 1800 452.0 27.0
2 Albania 1800 569.0 35.4
3 Andorra 1800 1660.0 NaN
4 UAE 1800 1390.0 30.7

Validate your join by checking the number of rows in the result.

195 countries, 301 years
n_unique_countries * n_years=58695 expected rows
58695 rows in result

Adding regions

The region data is available here. read_excel can pull the data in directly for you. I’ll save you some time by giving you that code:

region_data = pd.read_excel("https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/export?format=xlsx", sheet_name="list-of-countries-etc")
region_data.head()
geo name four_regions eight_regions six_regions members_oecd_g77 Latitude Longitude UN member since World bank region World bank, 4 income groups 2017 World bank, 3 income groups 2017 UNHCR
0 aus Australia asia east_asia_pacific east_asia_pacific oecd -25.0 135.00000 1945-11-01 East Asia & Pacific High income High income Asia and the Pacific
1 brn Brunei asia east_asia_pacific east_asia_pacific g77 4.5 114.66667 1984-09-21 East Asia & Pacific High income High income Asia and the Pacific
2 khm Cambodia asia east_asia_pacific east_asia_pacific g77 13.0 105.00000 1955-12-14 East Asia & Pacific Lower middle income Middle income Asia and the Pacific
3 chn China asia east_asia_pacific east_asia_pacific g77 35.0 105.00000 1945-10-24 East Asia & Pacific Upper middle income Middle income Asia and the Pacific
4 fji Fiji asia east_asia_pacific east_asia_pacific g77 -18.0 178.00000 1970-10-13 East Asia & Pacific Upper middle income Middle income Asia and the Pacific

It has a lot of columns; we only want four_regions. Rename it to region. You might also want to rename the name column so it has the same name as the corresponding column in the gapminder dataframe. Then select the columns we want. Then join it with the gapminder data.

To rename, use data_frame.rename(columns={"old_name": "new_name"}). To select columns, use df[["col1", "col2"]]. Remember that both of these return a new dataframe, so you can chain the operations together, then assign the final result to a variable (I used gapminder_with_regions).

Debugging Your Join

When you first do a join, it might be a good idea to set indicator=True and use an outer join. This shows us a few interesting things:

gapminder_with_regions.head()
country region year gdp_pcap life _merge
0 Australia asia 1800.0 995.0 34.0 both
1 Australia asia 1801.0 1000.0 34.0 both
2 Australia asia 1802.0 1010.0 34.0 both
3 Australia asia 1803.0 1010.0 34.0 both
4 Australia asia 1804.0 1020.0 34.0 both

A value of both for _merge indicates that the row comes from a key that was in both tables. But there are also rows that come from only one table or the other. For example:

gapminder_with_regions.query('_merge != "both"').head()
country region year gdp_pcap life _merge
17458 Liechtenstein europe NaN NaN NaN left_only
18061 Macedonia, FYR europe NaN NaN NaN left_only
24082 United Kingdom europe NaN NaN NaN left_only
39735 United Arab Emirates asia NaN NaN NaN left_only
40639 United States americas NaN NaN NaN left_only

I had put the gapminder_with_regions dataframe on the left side of the join, and gapminder on the right. Think about what the table above is telling us.

Let’s look at that in aggregate.

(
    gapminder_with_regions
    .groupby(['country', '_merge'], as_index=False)
    .size()
    .query('_merge != "both" and size > 0')
)
country _merge size
169 Eswatini right_only 301
219 Holy See left_only 1
297 Liechtenstein left_only 1
306 Macedonia, FYR left_only 1
391 North Macedonia right_only 301
510 Swaziland left_only 1
559 UAE right_only 301
562 UK right_only 301
565 USA right_only 301
573 United Arab Emirates left_only 1
576 United Kingdom left_only 1
579 United States left_only 1

So: there are some countries that are in one dataset but not the other, and vice versa.

But notice that some countries are just written differently, e.g.,“United States” vs “USA”. We can use replace to fix that:

county_translations = {
    "USA": "United States",
    "UK": "United Kingdom",
    "UAE": "United Arab Emirates",
}
gapminder['country'] = gapminder['country'].replace(county_translations)
Note

Think: Can you do this with a merge instead?

Plotting

Now you can make the plot. By now you should be able to do this without further guidance.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58702 entries, 0 to 58701
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   country   58702 non-null  object  
 1   region    57197 non-null  object  
 2   year      58695 non-null  float64 
 3   gdp_pcap  58695 non-null  float64 
 4   life      56616 non-null  float64 
 5   _merge    58702 non-null  category
dtypes: category(1), float64(3), object(2)
memory usage: 2.7+ MB