Joining Data

import pandas as pd
import numpy as np

Data: Women in science

Information on 10 women in science who changed the world

professions[['name']]
name
0 Ada Lovelace
1 Marie Curie
2 Janaki Ammal
3 Chien-Shiung Wu
4 Katherine Johnson
5 Rosalind Franklin
6 Vera Rubin
7 Gladys West
8 Flossie Wong-Staal
9 Jennifer Doudna

Inputs

professions
name profession
0 Ada Lovelace Mathematician
1 Marie Curie Physicist and Chemist
2 Janaki Ammal Botanist
3 Chien-Shiung Wu Physicist
4 Katherine Johnson Mathematician
5 Rosalind Franklin Chemist
6 Vera Rubin Astronomer
7 Gladys West Mathematician
8 Flossie Wong-Staal Virologist and Molecular Biologist
9 Jennifer Doudna Biochemist
dates
name birth_year death_year
0 Janaki Ammal 1897 1984
1 Chien-Shiung Wu 1912 1997
2 Katherine Johnson 1918 2020
3 Rosalind Franklin 1920 1958
4 Vera Rubin 1928 2016
5 Gladys West 1930 <NA>
6 Flossie Wong-Staal 1947 2020
7 Jennifer Doudna 1964 <NA>
8 Barbara Liskov 1939 <NA>
notability
name known_for
0 Ada Lovelace first computer algorithm
1 Chien-Shiung Wu confim and refine theory of radioactive beta d...
2 Flossie Wong-Staal first scientist to clone HIV and create a map ...
3 Gladys West mathematical modeling of the shape of the Eart...
4 Janaki Ammal hybrid species, biodiversity protection
5 Jennifer Doudna one of the primary developers of CRISPR, a gro...
6 Katherine Johnson calculations of orbital mechanics critical to ...
7 Marie Curie theory of radioactivity, discovery of element...
8 Vera Rubin existence of dark matter

Desired Output

name profession birth_year death_year known_for
0 Ada Lovelace Mathematician <NA> <NA> first computer algorithm
1 Marie Curie Physicist and Chemist <NA> <NA> theory of radioactivity, discovery of element...
2 Janaki Ammal Botanist 1897 1984 hybrid species, biodiversity protection
3 Chien-Shiung Wu Physicist 1912 1997 confim and refine theory of radioactive beta d...
4 Katherine Johnson Mathematician 1918 2020 calculations of orbital mechanics critical to ...
5 Rosalind Franklin Chemist 1920 1958 NaN
6 Vera Rubin Astronomer 1928 2016 existence of dark matter
7 Gladys West Mathematician 1930 <NA> mathematical modeling of the shape of the Eart...
8 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020 first scientist to clone HIV and create a map ...
9 Jennifer Doudna Biochemist 1964 <NA> one of the primary developers of CRISPR, a gro...

First try: paste them together

name profession birth_year death_year known_for
0 Ada Lovelace Mathematician 1897 1984 first computer algorithm
1 Marie Curie Physicist and Chemist 1912 1997 confim and refine theory of radioactive beta d...
2 Janaki Ammal Botanist 1918 2020 first scientist to clone HIV and create a map ...
3 Chien-Shiung Wu Physicist 1920 1958 mathematical modeling of the shape of the Eart...
4 Katherine Johnson Mathematician 1928 2016 hybrid species, biodiversity protection
5 Rosalind Franklin Chemist 1930 <NA> one of the primary developers of CRISPR, a gro...
6 Vera Rubin Astronomer 1947 2020 calculations of orbital mechanics critical to ...
7 Gladys West Mathematician 1964 <NA> theory of radioactivity, discovery of element...
8 Flossie Wong-Staal Virologist and Molecular Biologist 1939 <NA> existence of dark matter
9 Jennifer Doudna Biochemist <NA> <NA> NaN

What was wrong?

How do we know which row in dates corresponds to which row in professions?

We need a key: some column(s) that uniquely identify each row

What could we use as a key for this data?

Joining using a key

professions
name profession
0 Ada Lovelace Mathematician
1 Marie Curie Physicist and Chemist
2 Janaki Ammal Botanist
3 Chien-Shiung Wu Physicist
4 Katherine Johnson Mathematician
5 Rosalind Franklin Chemist
6 Vera Rubin Astronomer
7 Gladys West Mathematician
8 Flossie Wong-Staal Virologist and Molecular Biologist
9 Jennifer Doudna Biochemist
dates
name birth_year death_year
0 Janaki Ammal 1897 1984
1 Chien-Shiung Wu 1912 1997
2 Katherine Johnson 1918 2020
3 Rosalind Franklin 1920 1958
4 Vera Rubin 1928 2016
5 Gladys West 1930 <NA>
6 Flossie Wong-Staal 1947 2020
7 Jennifer Doudna 1964 <NA>
8 Barbara Liskov 1939 <NA>
pd.merge(professions, dates, on='name', how='left')
name profession birth_year death_year
0 Ada Lovelace Mathematician <NA> <NA>
1 Marie Curie Physicist and Chemist <NA> <NA>
2 Janaki Ammal Botanist 1897 1984
3 Chien-Shiung Wu Physicist 1912 1997
4 Katherine Johnson Mathematician 1918 2020
5 Rosalind Franklin Chemist 1920 1958
6 Vera Rubin Astronomer 1928 2016
7 Gladys West Mathematician 1930 <NA>
8 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
9 Jennifer Doudna Biochemist 1964 <NA>

When I first tried joining this data…

professions
name profession
0 Ada Lovelace Mathematician
1 Marie Curie Physicist and Chemist
2 Janaki Ammal Botanist
3 Chien-Shiung Wu Physicist
4 Katherine Johnson Mathematician
5 Rosalind Franklin Chemist
6 Vera Rubin Astronomer
7 Gladys West Mathematician
8 Flossie Wong-Staal Virologist and Molecular Biologist
9 Jennifer Doudna Biochemist
dates
name birth_year death_year
0 Janaki Ammal 1897 1984
1 Chien-Shiung Wu 1912 1997
2 Katherine Johnson 1918 2020
3 Rosalind Franklin 1920 1958
4 Vera Rubin 1928 2016
5 Gladys West 1930 <NA>
6 Flossie Wong-Staal 1947 2020
7 Jennifer Doudna 1964 <NA>
8 Barbara Liskov 1939 <NA>
pd.merge(professions, dates, on='name')
name profession birth_year death_year
0 Janaki Ammal Botanist 1897 1984
1 Chien-Shiung Wu Physicist 1912 1997
2 Katherine Johnson Mathematician 1918 2020
3 Rosalind Franklin Chemist 1920 1958
4 Vera Rubin Astronomer 1928 2016
5 Gladys West Mathematician 1930 <NA>
6 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
7 Jennifer Doudna Biochemist 1964 <NA>

Something missing?

Kinds of joins

  • Left join: use keys from left table
  • Right join: use keys from right table
  • Outer join: use union of both sets of keys
  • Inner join: use intersection of both sets of keys

See documentation

Different join types on this data

pd.merge(professions, dates, on='name', how='left')
name profession birth_year death_year
0 Ada Lovelace Mathematician <NA> <NA>
1 Marie Curie Physicist and Chemist <NA> <NA>
2 Janaki Ammal Botanist 1897 1984
3 Chien-Shiung Wu Physicist 1912 1997
4 Katherine Johnson Mathematician 1918 2020
5 Rosalind Franklin Chemist 1920 1958
6 Vera Rubin Astronomer 1928 2016
7 Gladys West Mathematician 1930 <NA>
8 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
9 Jennifer Doudna Biochemist 1964 <NA>
pd.merge(professions, dates, on='name', how='right')
name profession birth_year death_year
0 Janaki Ammal Botanist 1897 1984
1 Chien-Shiung Wu Physicist 1912 1997
2 Katherine Johnson Mathematician 1918 2020
3 Rosalind Franklin Chemist 1920 1958
4 Vera Rubin Astronomer 1928 2016
5 Gladys West Mathematician 1930 <NA>
6 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
7 Jennifer Doudna Biochemist 1964 <NA>
8 Barbara Liskov NaN 1939 <NA>
pd.merge(professions, dates, on='name', how='outer')
name profession birth_year death_year
0 Ada Lovelace Mathematician <NA> <NA>
1 Marie Curie Physicist and Chemist <NA> <NA>
2 Janaki Ammal Botanist 1897 1984
3 Chien-Shiung Wu Physicist 1912 1997
4 Katherine Johnson Mathematician 1918 2020
5 Rosalind Franklin Chemist 1920 1958
6 Vera Rubin Astronomer 1928 2016
7 Gladys West Mathematician 1930 <NA>
8 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
9 Jennifer Doudna Biochemist 1964 <NA>
10 Barbara Liskov NaN 1939 <NA>
pd.merge(professions, dates, on='name', how='inner')
name profession birth_year death_year
0 Janaki Ammal Botanist 1897 1984
1 Chien-Shiung Wu Physicist 1912 1997
2 Katherine Johnson Mathematician 1918 2020
3 Rosalind Franklin Chemist 1920 1958
4 Vera Rubin Astronomer 1928 2016
5 Gladys West Mathematician 1930 <NA>
6 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
7 Jennifer Doudna Biochemist 1964 <NA>

Draw a Venn diagram of the keys of these two tables.

Think

Which join types can end up not including some data?

Including notability

professions
name profession
0 Ada Lovelace Mathematician
1 Marie Curie Physicist and Chemist
2 Janaki Ammal Botanist
3 Chien-Shiung Wu Physicist
4 Katherine Johnson Mathematician
5 Rosalind Franklin Chemist
6 Vera Rubin Astronomer
7 Gladys West Mathematician
8 Flossie Wong-Staal Virologist and Molecular Biologist
9 Jennifer Doudna Biochemist
(
    professions
    .merge(dates, on='name', how='left')
)
name profession birth_year death_year
0 Ada Lovelace Mathematician <NA> <NA>
1 Marie Curie Physicist and Chemist <NA> <NA>
2 Janaki Ammal Botanist 1897 1984
3 Chien-Shiung Wu Physicist 1912 1997
4 Katherine Johnson Mathematician 1918 2020
5 Rosalind Franklin Chemist 1920 1958
6 Vera Rubin Astronomer 1928 2016
7 Gladys West Mathematician 1930 <NA>
8 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020
9 Jennifer Doudna Biochemist 1964 <NA>
(
    professions
    .merge(dates, on='name', how='left')
    .merge(notability, on='name', how='left')
)
name profession birth_year death_year known_for
0 Ada Lovelace Mathematician <NA> <NA> first computer algorithm
1 Marie Curie Physicist and Chemist <NA> <NA> theory of radioactivity, discovery of element...
2 Janaki Ammal Botanist 1897 1984 hybrid species, biodiversity protection
3 Chien-Shiung Wu Physicist 1912 1997 confim and refine theory of radioactive beta d...
4 Katherine Johnson Mathematician 1918 2020 calculations of orbital mechanics critical to ...
5 Rosalind Franklin Chemist 1920 1958 NaN
6 Vera Rubin Astronomer 1928 2016 existence of dark matter
7 Gladys West Mathematician 1930 <NA> mathematical modeling of the shape of the Eart...
8 Flossie Wong-Staal Virologist and Molecular Biologist 1947 2020 first scientist to clone HIV and create a map ...
9 Jennifer Doudna Biochemist 1964 <NA> one of the primary developers of CRISPR, a gro...

What if there are multiple matches?

professions_multi
name profession
0 Ada Lovelace Mathematician
1 Marie Curie Physicist
2 Marie Curie Chemist
3 Janaki Ammal Botanist
4 Chien-Shiung Wu Physicist
5 Katherine Johnson Mathematician
6 Rosalind Franklin Chemist
7 Vera Rubin Astronomer
8 Gladys West Mathematician
9 Flossie Wong-Staal Virologist
10 Flossie Wong-Staal Molecular Biologist
11 Jennifer Doudna Biochemist
dates
name birth_year death_year
0 Janaki Ammal 1897 1984
1 Chien-Shiung Wu 1912 1997
2 Katherine Johnson 1918 2020
3 Rosalind Franklin 1920 1958
4 Vera Rubin 1928 2016
5 Gladys West 1930 <NA>
6 Flossie Wong-Staal 1947 2020
7 Jennifer Doudna 1964 <NA>
8 Barbara Liskov 1939 <NA>
pd.merge(professions_multi, dates, on='name', how='left')
name profession birth_year death_year
0 Ada Lovelace Mathematician <NA> <NA>
1 Marie Curie Physicist <NA> <NA>
2 Marie Curie Chemist <NA> <NA>
3 Janaki Ammal Botanist 1897 1984
4 Chien-Shiung Wu Physicist 1912 1997
5 Katherine Johnson Mathematician 1918 2020
6 Rosalind Franklin Chemist 1920 1958
7 Vera Rubin Astronomer 1928 2016
8 Gladys West Mathematician 1930 <NA>
9 Flossie Wong-Staal Virologist 1947 2020
10 Flossie Wong-Staal Molecular Biologist 1947 2020
11 Jennifer Doudna Biochemist 1964 <NA>

Concatenating

Here was the “first try” from earlier:

pd.concat([
    professions,
    dates
], axis=1)
name profession name birth_year death_year
0 Ada Lovelace Mathematician Janaki Ammal 1897 1984
1 Marie Curie Physicist and Chemist Chien-Shiung Wu 1912 1997
2 Janaki Ammal Botanist Katherine Johnson 1918 2020
3 Chien-Shiung Wu Physicist Rosalind Franklin 1920 1958
4 Katherine Johnson Mathematician Vera Rubin 1928 2016
5 Rosalind Franklin Chemist Gladys West 1930 <NA>
6 Vera Rubin Astronomer Flossie Wong-Staal 1947 2020
7 Gladys West Mathematician Jennifer Doudna 1964 <NA>
8 Flossie Wong-Staal Virologist and Molecular Biologist Barbara Liskov 1939 <NA>
9 Jennifer Doudna Biochemist NaN <NA> <NA>

That was incorrect; we should have used a key.

Concatenate when tables have same structure

Common example: one table per year.

All tables have the same structure (verify this!), so concatenating rows is ok.

dataframes = [
    pd.read_csv(f"../../static/data/airbnb/airbnb_{year}.csv")
    for year in range(2015, 2021)
]
pd.concat(dataframes)