Tidying Data

Q&A

Is joining like xlookup or vlookup in Excel?

Yes (though joining is more powerful)

  • xlookup(*lookup_value*, *lookup_array*, *return_array*) searches for lookup_value in lookup_array and returns the corresponding value from return_array.
  • lookup_value is the key
  • Exercise for experienced Excel users: think about what kind of join would this be.

Q&A

Where do “left” and “right” come from?

pd.merge(a, b, how='left')`
#        ^  ^
#     left  right

same as

pd.merge(b, a, how='right')

Tidy data

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Characteristics of untidy data:

  • Varies.

Data: Sales

Let’s construct two dataframes for fake sales data. (Here’s two ways to make a dataframe by hand.)

# columns format
customers = pd.DataFrame({
    'customer_id': [1, 2],
    'item_1': ['bread', 'milk'],
    'item_2': ['milk', 'toilet paper'],
    'item_3': ['banana', None]
})

# records format
prices = pd.DataFrame(
  [
    ['avocado', 5],
    ['banana', 0.57],
    ['bread', 3],
    ['milk', 2.5],
    ['toilet paper', 10],
  ], columns=['item', 'price']
)

Data: Sales

We have…

customer_id item_1 item_2 item_3
0 1 bread milk banana
1 2 milk toilet paper None

We want…

customer_id item_num item_name
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 None

melt the data

Wider vs. longer

wider: more columns

customer_id item_1 item_2 item_3
0 1 bread milk banana
1 2 milk toilet paper None

longer: more rows

customer_id item_num item_name
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 None

Wide to Long: melt

customers
customer_id item_1 item_2 item_3
0 1 bread milk banana
1 2 milk toilet paper None

Wide to Long: melt

customers.melt()
variable value
0 customer_id 1
1 customer_id 2
2 item_1 bread
3 item_1 milk
4 item_2 milk
5 item_2 toilet paper
6 item_3 banana
7 item_3 None

Wide to Long: melt

  • id_vars: primary key (columns that uniquely identify each row)
    • not transformed (but duplicated as needed)
    • a list, since a key can be multiple columns
customers.melt(
    id_vars=['customer_id'],
)
customer_id variable value
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 None

Wide to Long: melt

  • id_vars: primary key
  • var_name: name of the column where names of columns are stored
customers.melt(
    id_vars=['customer_id'],
    var_name='item_num'
)
customer_id item_num value
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 None

Wide to Long: melt

  • id_vars: primary key
  • var_name: new name for variables column
  • value_name: new name for values column
purchases = customers.melt(
    id_vars=['customer_id'],
    var_name='item_num',
    value_name='item'
)
purchases
customer_id item_num item
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 None

Why pivot?

Most likely, because the next step of your analysis needs it

prices
item price
0 avocado 5.00
1 banana 0.57
2 bread 3.00
3 milk 2.50
4 toilet paper 10.00
purchases.merge(prices, how='left')
customer_id item_num item price
0 1 item_1 bread 3.00
1 2 item_1 milk 2.50
2 1 item_2 milk 2.50
3 2 item_2 toilet paper 10.00
4 1 item_3 banana 0.57
5 2 item_3 None NaN

Long to Wide

Sometimes helpful for analysis. Mostly helpful for reports.

purchases.pivot(
    index='customer_id',
    columns='item_num',
    values='item'
)
item_num item_1 item_2 item_3
customer_id
1 bread milk banana
2 milk toilet paper None

Aside: what if there are multiple values that would go in each box? pivot doesn’t handle this cleanly. A related function, pivot_table, makes us explicitly specify what to do. For example, take the first one:

purchases.pivot_table(
    index=['customer_id'],
    columns='item_num',
    values='item',
    aggfunc='first'
)
item_num item_1 item_2 item_3
customer_id
1 bread milk banana
2 milk toilet paper NaN