Exercise 6: Pivoting and Joining - Added Notes

Many visualizations involve joining data from multiple sources and/or reshaping data from “wide” to “long” format. This exercise will give you practice with both of these tasks.

Getting Started

Create a folder and a Quarto file, as usual. You can use ex06-pivot-join.qmd. The usual setup code can be found in thou course notes.

Download the data from https://www.gapminder.org/data/ in either CSV or Excel format. Upload to RStudio and use either pd.read_csv or pd.read_excel to load it.

Step by step instructions, so we all get the same data:

Go to https://www.gapminder.org/data/
Under “Choose individual indicators”, search for “GDP”. Click the one marked “Income”, it should say “GDP per capita (price and inflation adjusted, in PPP$2017)” when you click on it.
At the top of the table, click Download As: CSV or XLSX. I got a file named lex.xlsx
Click the header (“GDP per capita”) and select instead “Life Expectancy, at birth”. Download that as well. I got a file named gdp_pcap.xlsx

gapminder_gdp = pd.read_excel("data/gdp_pcap.xlsx")
gapminder_gdp.head()

	country	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2091	2092	2093	2094	2095	2096	2097	2098	2099	2100
0	Afghanistan	583	583	583	583	583	583	583	583	583	...	3570	3700	3720	3830	3930	4030	4130	3900	4070	4030
1	Angola	452	454	456	458	460	462	464	466	468	...	20.6k	21.1k	21.7k	21.9k	22.3k	22.5k	23k	23.5k	24k	24.6k
2	Albania	569	571	572	574	576	577	579	581	582	...	37.6k	38.5k	39.4k	39.9k	40.9k	41.9k	41.4k	42.4k	43.7k	45.7k
3	Andorra	1660	1660	1670	1670	1670	1680	1680	1680	1690	...	59.2k	60k	60.3k	61k	61.7k	62.7k	61.9k	62.5k	62.9k	62.1k
4	UAE	1390	1390	1390	1400	1410	1410	1420	1420	1430	...	86.2k	85.4k	84.1k	84.3k	84.6k	83.9k	84.2k	84k	84.1k	84.3k

5 rows × 302 columns

gapminder_life = pd.read_excel("data/lex.xlsx")
gapminder_life.head()

	country	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2091	2092	2093	2094	2095	2096	2097	2098	2099	2100
0	Afghanistan	28.2	28.2	28.2	28.2	28.2	28.2	28.1	28.1	28.1	...	75.5	75.7	75.8	76.0	76.1	76.2	76.4	76.5	76.6	76.8
1	Angola	27.0	27.0	27.0	27.0	27.0	27.0	27.0	27.0	27.0	...	78.8	79.0	79.1	79.2	79.3	79.5	79.6	79.7	79.9	80.0
2	Albania	35.4	35.4	35.4	35.4	35.4	35.4	35.4	35.4	35.4	...	87.4	87.5	87.6	87.7	87.8	87.9	88.0	88.2	88.3	88.4
3	Andorra	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	UAE	30.7	30.7	30.7	30.7	30.7	30.7	30.7	30.7	30.7	...	82.4	82.5	82.6	82.7	82.8	82.9	83.0	83.1	83.2	83.3

5 rows × 302 columns

Wide to Long

Notice that the data is in “wide” format, with each year as a column. To join and plot this data, we’ll need it in “long” format, with each year as a row:

	country	year	gdp_pcap
0	Afghanistan	1800	583
1	Angola	1800	452
2	Albania	1800	569
3	Andorra	1800	1660
4	UAE	1800	1390

We’ll use pd.melt to get it in this format. We’ll need to tell it which columns form the primary key, specified by the id_vars argument. The other columns will be “melted” into a single column, specified by the var_name argument. The values for those columns will be put in a column specified by the value_name argument.

pd.melt(
    gapminder_gdp,
    id_vars=[...],
    var_name="XXX", value_name="YYY"
)

Fill in the blanks above with the appropriate column names. Then assign the result to a variable; I used gdp_long.

Now repeat that for the life expectancy data:

	country	year	life
0	Afghanistan	1800	28.2
1	Angola	1800	27.0
2	Albania	1800	35.4
3	Andorra	1800	NaN
4	UAE	1800	30.7

Data Types

Check the data types of the gdp_long table. There are two that are not what you’d expect.

gdp_long.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58695 entries, 0 to 58694
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   country   58695 non-null  object
 1   year      58695 non-null  object
 2   gdp_pcap  58695 non-null  object
dtypes: object(3)
memory usage: 1.3+ MB

year is easy to fix:

gdp_long['year'] = gdp_long['year'].astype(int)
life_long['year'] = life_long['year'].astype(int)

A more tricky issue is that gdp_pcap uses k to mean “thousands”. For example:

gdp_long.tail()

	country	year	gdp_pcap
58690	Samoa	2100	16k
58691	Yemen	2100	5730
58692	South Africa	2100	42.7k
58693	Zambia	2100	12.8k
58694	Zimbabwe	2100	7380

We need to convert those k values to numbers. Here’s an approach:

def parse_number_with_units(num):
    if not isinstance(num, str):
        return num
    if num.endswith("k"):
        return float(num[:-1]) * 1000
    return float(num)
gdp_long['gdp_pcap'] = gdp_long['gdp_pcap'].map(parse_number_with_units)
gdp_long.tail()

	country	year	gdp_pcap
58690	Samoa	2100	16000.0
58691	Yemen	2100	5730.0
58692	South Africa	2100	42700.0
58693	Zambia	2100	12800.0
58694	Zimbabwe	2100	7380.0

Joining

Now, join the datasets together. Think about what the primary key of this data is. Call the result gapminder:

	country	year	gdp_pcap	life
0	Afghanistan	1800	583.0	28.2
1	Angola	1800	452.0	27.0
2	Albania	1800	569.0	35.4
3	Andorra	1800	1660.0	NaN
4	UAE	1800	1390.0	30.7

Validate your join by checking the number of rows in the result.

195 countries, 301 years
n_unique_countries * n_years=58695 expected rows
58695 rows in result

Adding regions

The region data is available here. read_excel can pull the data in directly for you. I’ll save you some time by giving you that code:

region_data = pd.read_excel("https://docs.google.com/spreadsheets/d/1qHalit8sXC0R8oVXibc2wa2gY7bkwGzOybEMTWp-08o/export?format=xlsx", sheet_name="list-of-countries-etc")
region_data.head()

	geo	name	four_regions	eight_regions	six_regions	members_oecd_g77	Latitude	Longitude	UN member since	World bank region	World bank, 4 income groups 2017	World bank, 3 income groups 2017	UNHCR
0	aus	Australia	asia	east_asia_pacific	east_asia_pacific	oecd	-25.0	135.00000	1945-11-01	East Asia & Pacific	High income	High income	Asia and the Pacific
1	brn	Brunei	asia	east_asia_pacific	east_asia_pacific	g77	4.5	114.66667	1984-09-21	East Asia & Pacific	High income	High income	Asia and the Pacific
2	khm	Cambodia	asia	east_asia_pacific	east_asia_pacific	g77	13.0	105.00000	1955-12-14	East Asia & Pacific	Lower middle income	Middle income	Asia and the Pacific
3	chn	China	asia	east_asia_pacific	east_asia_pacific	g77	35.0	105.00000	1945-10-24	East Asia & Pacific	Upper middle income	Middle income	Asia and the Pacific
4	fji	Fiji	asia	east_asia_pacific	east_asia_pacific	g77	-18.0	178.00000	1970-10-13	East Asia & Pacific	Upper middle income	Middle income	Asia and the Pacific

It has a lot of columns; we only want four_regions. Rename it to region. You might also want to rename the name column so it has the same name as the corresponding column in the gapminder dataframe. Then select the columns we want. Then join it with the gapminder data.

To rename, use data_frame.rename(columns={"old_name": "new_name"}). To select columns, use df[["col1", "col2"]]. Remember that both of these return a new dataframe, so you can chain the operations together, then assign the final result to a variable (I used gapminder_with_regions).

Debugging Your Join

When you first do a join, it might be a good idea to set indicator=True and use an outer join. This shows us a few interesting things:

gapminder_with_regions.head()

	country	region	year	gdp_pcap	life	_merge
0	Australia	asia	1800.0	995.0	34.0	both
1	Australia	asia	1801.0	1000.0	34.0	both
2	Australia	asia	1802.0	1010.0	34.0	both
3	Australia	asia	1803.0	1010.0	34.0	both
4	Australia	asia	1804.0	1020.0	34.0	both

A value of both for _merge indicates that the row comes from a key that was in both tables. But there are also rows that come from only one table or the other. For example:

gapminder_with_regions.query('_merge != "both"').head()

	country	region	year	gdp_pcap	life	_merge
17458	Liechtenstein	europe	NaN	NaN	NaN	left_only
18061	Macedonia, FYR	europe	NaN	NaN	NaN	left_only
24082	United Kingdom	europe	NaN	NaN	NaN	left_only
39735	United Arab Emirates	asia	NaN	NaN	NaN	left_only
40639	United States	americas	NaN	NaN	NaN	left_only

I had put the gapminder_with_regions dataframe on the left side of the join, and gapminder on the right. Think about what the table above is telling us.

Let’s look at that in aggregate.

(
    gapminder_with_regions
    .groupby(['country', '_merge'], as_index=False)
    .size()
    .query('_merge != "both" and size > 0')
)

	country	_merge	size
169	Eswatini	right_only	301
219	Holy See	left_only	1
297	Liechtenstein	left_only	1
306	Macedonia, FYR	left_only	1
391	North Macedonia	right_only	301
510	Swaziland	left_only	1
559	UAE	right_only	301
562	UK	right_only	301
565	USA	right_only	301
573	United Arab Emirates	left_only	1
576	United Kingdom	left_only	1
579	United States	left_only	1

So: there are some countries that are in one dataset but not the other, and vice versa.

But notice that some countries are just written differently, e.g.,“United States” vs “USA”. We can use replace to fix that:

county_translations = {
    "USA": "United States",
    "UK": "United Kingdom",
    "UAE": "United Arab Emirates",
}
gapminder['country'] = gapminder['country'].replace(county_translations)

Note

Think: Can you do this with a merge instead?

Plotting

Now you can make the plot. By now you should be able to do this without further guidance.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58702 entries, 0 to 58701
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   country   58702 non-null  object  
 1   region    57197 non-null  object  
 2   year      58695 non-null  float64 
 3   gdp_pcap  58695 non-null  float64 
 4   life      56616 non-null  float64 
 5   _merge    58702 non-null  category
dtypes: category(1), float64(3), object(2)
memory usage: 2.7+ MB