import pandas as pd
import numpy as np
Data: Women in science
Information on 10 women in science who changed the world
0 |
Ada Lovelace |
1 |
Marie Curie |
2 |
Janaki Ammal |
3 |
Chien-Shiung Wu |
4 |
Katherine Johnson |
5 |
Rosalind Franklin |
6 |
Vera Rubin |
7 |
Gladys West |
8 |
Flossie Wong-Staal |
9 |
Jennifer Doudna |
Desired Output
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
first computer algorithm |
1 |
Marie Curie |
Physicist and Chemist |
<NA> |
<NA> |
theory of radioactivity, discovery of element... |
2 |
Janaki Ammal |
Botanist |
1897 |
1984 |
hybrid species, biodiversity protection |
3 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
confim and refine theory of radioactive beta d... |
4 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
calculations of orbital mechanics critical to ... |
5 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
NaN |
6 |
Vera Rubin |
Astronomer |
1928 |
2016 |
existence of dark matter |
7 |
Gladys West |
Mathematician |
1930 |
<NA> |
mathematical modeling of the shape of the Eart... |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
first scientist to clone HIV and create a map ... |
9 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
one of the primary developers of CRISPR, a gro... |
First try: paste them together
0 |
Ada Lovelace |
Mathematician |
1897 |
1984 |
first computer algorithm |
1 |
Marie Curie |
Physicist and Chemist |
1912 |
1997 |
confim and refine theory of radioactive beta d... |
2 |
Janaki Ammal |
Botanist |
1918 |
2020 |
first scientist to clone HIV and create a map ... |
3 |
Chien-Shiung Wu |
Physicist |
1920 |
1958 |
mathematical modeling of the shape of the Eart... |
4 |
Katherine Johnson |
Mathematician |
1928 |
2016 |
hybrid species, biodiversity protection |
5 |
Rosalind Franklin |
Chemist |
1930 |
<NA> |
one of the primary developers of CRISPR, a gro... |
6 |
Vera Rubin |
Astronomer |
1947 |
2020 |
calculations of orbital mechanics critical to ... |
7 |
Gladys West |
Mathematician |
1964 |
<NA> |
theory of radioactivity, discovery of element... |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1939 |
<NA> |
existence of dark matter |
9 |
Jennifer Doudna |
Biochemist |
<NA> |
<NA> |
NaN |
What was wrong?
How do we know which row in dates
corresponds to which row in professions
?
We need a key: some column(s) that uniquely identify each row
What could we use as a key for this data?
Joining using a key
0 |
Ada Lovelace |
Mathematician |
1 |
Marie Curie |
Physicist and Chemist |
2 |
Janaki Ammal |
Botanist |
3 |
Chien-Shiung Wu |
Physicist |
4 |
Katherine Johnson |
Mathematician |
5 |
Rosalind Franklin |
Chemist |
6 |
Vera Rubin |
Astronomer |
7 |
Gladys West |
Mathematician |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
9 |
Jennifer Doudna |
Biochemist |
0 |
Janaki Ammal |
1897 |
1984 |
1 |
Chien-Shiung Wu |
1912 |
1997 |
2 |
Katherine Johnson |
1918 |
2020 |
3 |
Rosalind Franklin |
1920 |
1958 |
4 |
Vera Rubin |
1928 |
2016 |
5 |
Gladys West |
1930 |
<NA> |
6 |
Flossie Wong-Staal |
1947 |
2020 |
7 |
Jennifer Doudna |
1964 |
<NA> |
8 |
Barbara Liskov |
1939 |
<NA> |
pd.merge(professions, dates, on='name', how='left')
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
1 |
Marie Curie |
Physicist and Chemist |
<NA> |
<NA> |
2 |
Janaki Ammal |
Botanist |
1897 |
1984 |
3 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
4 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
5 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
6 |
Vera Rubin |
Astronomer |
1928 |
2016 |
7 |
Gladys West |
Mathematician |
1930 |
<NA> |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
9 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
When I first tried joining this data…
0 |
Ada Lovelace |
Mathematician |
1 |
Marie Curie |
Physicist and Chemist |
2 |
Janaki Ammal |
Botanist |
3 |
Chien-Shiung Wu |
Physicist |
4 |
Katherine Johnson |
Mathematician |
5 |
Rosalind Franklin |
Chemist |
6 |
Vera Rubin |
Astronomer |
7 |
Gladys West |
Mathematician |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
9 |
Jennifer Doudna |
Biochemist |
0 |
Janaki Ammal |
1897 |
1984 |
1 |
Chien-Shiung Wu |
1912 |
1997 |
2 |
Katherine Johnson |
1918 |
2020 |
3 |
Rosalind Franklin |
1920 |
1958 |
4 |
Vera Rubin |
1928 |
2016 |
5 |
Gladys West |
1930 |
<NA> |
6 |
Flossie Wong-Staal |
1947 |
2020 |
7 |
Jennifer Doudna |
1964 |
<NA> |
8 |
Barbara Liskov |
1939 |
<NA> |
pd.merge(professions, dates, on='name')
0 |
Janaki Ammal |
Botanist |
1897 |
1984 |
1 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
2 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
3 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
4 |
Vera Rubin |
Astronomer |
1928 |
2016 |
5 |
Gladys West |
Mathematician |
1930 |
<NA> |
6 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
7 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
Something missing?
Kinds of joins
- Left join: use keys from left table
- Right join: use keys from right table
- Outer join: use union of both sets of keys
- Inner join: use intersection of both sets of keys
See documentation
Different join types on this data
pd.merge(professions, dates, on='name', how='left')
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
1 |
Marie Curie |
Physicist and Chemist |
<NA> |
<NA> |
2 |
Janaki Ammal |
Botanist |
1897 |
1984 |
3 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
4 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
5 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
6 |
Vera Rubin |
Astronomer |
1928 |
2016 |
7 |
Gladys West |
Mathematician |
1930 |
<NA> |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
9 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
pd.merge(professions, dates, on='name', how='right')
0 |
Janaki Ammal |
Botanist |
1897 |
1984 |
1 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
2 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
3 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
4 |
Vera Rubin |
Astronomer |
1928 |
2016 |
5 |
Gladys West |
Mathematician |
1930 |
<NA> |
6 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
7 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
8 |
Barbara Liskov |
NaN |
1939 |
<NA> |
pd.merge(professions, dates, on='name', how='outer')
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
1 |
Marie Curie |
Physicist and Chemist |
<NA> |
<NA> |
2 |
Janaki Ammal |
Botanist |
1897 |
1984 |
3 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
4 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
5 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
6 |
Vera Rubin |
Astronomer |
1928 |
2016 |
7 |
Gladys West |
Mathematician |
1930 |
<NA> |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
9 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
10 |
Barbara Liskov |
NaN |
1939 |
<NA> |
pd.merge(professions, dates, on='name', how='inner')
0 |
Janaki Ammal |
Botanist |
1897 |
1984 |
1 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
2 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
3 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
4 |
Vera Rubin |
Astronomer |
1928 |
2016 |
5 |
Gladys West |
Mathematician |
1930 |
<NA> |
6 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
7 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
Draw a Venn diagram of the keys of these two tables.
Think
Which join types can end up not including some data?
Including notability
0 |
Ada Lovelace |
Mathematician |
1 |
Marie Curie |
Physicist and Chemist |
2 |
Janaki Ammal |
Botanist |
3 |
Chien-Shiung Wu |
Physicist |
4 |
Katherine Johnson |
Mathematician |
5 |
Rosalind Franklin |
Chemist |
6 |
Vera Rubin |
Astronomer |
7 |
Gladys West |
Mathematician |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
9 |
Jennifer Doudna |
Biochemist |
(
professions
.merge(dates, on='name', how='left')
)
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
1 |
Marie Curie |
Physicist and Chemist |
<NA> |
<NA> |
2 |
Janaki Ammal |
Botanist |
1897 |
1984 |
3 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
4 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
5 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
6 |
Vera Rubin |
Astronomer |
1928 |
2016 |
7 |
Gladys West |
Mathematician |
1930 |
<NA> |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
9 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
(
professions
.merge(dates, on='name', how='left')
.merge(notability, on='name', how='left')
)
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
first computer algorithm |
1 |
Marie Curie |
Physicist and Chemist |
<NA> |
<NA> |
theory of radioactivity, discovery of element... |
2 |
Janaki Ammal |
Botanist |
1897 |
1984 |
hybrid species, biodiversity protection |
3 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
confim and refine theory of radioactive beta d... |
4 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
calculations of orbital mechanics critical to ... |
5 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
NaN |
6 |
Vera Rubin |
Astronomer |
1928 |
2016 |
existence of dark matter |
7 |
Gladys West |
Mathematician |
1930 |
<NA> |
mathematical modeling of the shape of the Eart... |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
1947 |
2020 |
first scientist to clone HIV and create a map ... |
9 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
one of the primary developers of CRISPR, a gro... |
What if there are multiple matches?
0 |
Ada Lovelace |
Mathematician |
1 |
Marie Curie |
Physicist |
2 |
Marie Curie |
Chemist |
3 |
Janaki Ammal |
Botanist |
4 |
Chien-Shiung Wu |
Physicist |
5 |
Katherine Johnson |
Mathematician |
6 |
Rosalind Franklin |
Chemist |
7 |
Vera Rubin |
Astronomer |
8 |
Gladys West |
Mathematician |
9 |
Flossie Wong-Staal |
Virologist |
10 |
Flossie Wong-Staal |
Molecular Biologist |
11 |
Jennifer Doudna |
Biochemist |
0 |
Janaki Ammal |
1897 |
1984 |
1 |
Chien-Shiung Wu |
1912 |
1997 |
2 |
Katherine Johnson |
1918 |
2020 |
3 |
Rosalind Franklin |
1920 |
1958 |
4 |
Vera Rubin |
1928 |
2016 |
5 |
Gladys West |
1930 |
<NA> |
6 |
Flossie Wong-Staal |
1947 |
2020 |
7 |
Jennifer Doudna |
1964 |
<NA> |
8 |
Barbara Liskov |
1939 |
<NA> |
pd.merge(professions_multi, dates, on='name', how='left')
0 |
Ada Lovelace |
Mathematician |
<NA> |
<NA> |
1 |
Marie Curie |
Physicist |
<NA> |
<NA> |
2 |
Marie Curie |
Chemist |
<NA> |
<NA> |
3 |
Janaki Ammal |
Botanist |
1897 |
1984 |
4 |
Chien-Shiung Wu |
Physicist |
1912 |
1997 |
5 |
Katherine Johnson |
Mathematician |
1918 |
2020 |
6 |
Rosalind Franklin |
Chemist |
1920 |
1958 |
7 |
Vera Rubin |
Astronomer |
1928 |
2016 |
8 |
Gladys West |
Mathematician |
1930 |
<NA> |
9 |
Flossie Wong-Staal |
Virologist |
1947 |
2020 |
10 |
Flossie Wong-Staal |
Molecular Biologist |
1947 |
2020 |
11 |
Jennifer Doudna |
Biochemist |
1964 |
<NA> |
Concatenating
Here was the “first try” from earlier:
pd.concat([
professions,
dates
], axis=1)
0 |
Ada Lovelace |
Mathematician |
Janaki Ammal |
1897 |
1984 |
1 |
Marie Curie |
Physicist and Chemist |
Chien-Shiung Wu |
1912 |
1997 |
2 |
Janaki Ammal |
Botanist |
Katherine Johnson |
1918 |
2020 |
3 |
Chien-Shiung Wu |
Physicist |
Rosalind Franklin |
1920 |
1958 |
4 |
Katherine Johnson |
Mathematician |
Vera Rubin |
1928 |
2016 |
5 |
Rosalind Franklin |
Chemist |
Gladys West |
1930 |
<NA> |
6 |
Vera Rubin |
Astronomer |
Flossie Wong-Staal |
1947 |
2020 |
7 |
Gladys West |
Mathematician |
Jennifer Doudna |
1964 |
<NA> |
8 |
Flossie Wong-Staal |
Virologist and Molecular Biologist |
Barbara Liskov |
1939 |
<NA> |
9 |
Jennifer Doudna |
Biochemist |
NaN |
<NA> |
<NA> |
That was incorrect; we should have used a key.
Concatenate when tables have same structure
Common example: one table per year.
All tables have the same structure (verify this!), so concatenating rows is ok.
dataframes = [
pd.read_csv(f"../../static/data/airbnb/airbnb_{year}.csv")
for year in range(2015, 2021)
]
pd.concat(dataframes)