Transform¶
Pivot by a single column¶
The Table.pivot() method is a general process for grouping data by row and, optionally, by column, and then calculating some aggregation for each group. Consider the following table:
| name | race | gender | age |
|---|---|---|---|
| Joe | white | female | 20 |
| Jane | asian | male | 20 |
| Jill | black | female | 20 |
| Jim | latino | male | 25 |
| Julia | black | female | 25 |
| Joan | asian | female | 25 |
In the very simplest case, this table can be pivoted to count the number occurences of values in a column:
transformed = table.pivot('race')
Result:
| race | pivot |
|---|---|
| white | 1 |
| asian | 2 |
| black | 2 |
| latino | 1 |
Pivot by multiple columns¶
You can pivot by multiple columns either as additional row-groups, or as intersecting columns. For example, given the table in the previous example:
transformed = table.pivot(['race', 'gender'])
Result:
| race | gender | pivot |
|---|---|---|
| white | female | 1 |
| asian | male | 1 |
| black | female | 2 |
| latino | male | 1 |
| asian | female | 1 |
For the column, version you would do:
transformed = table.pivot('race', 'gender')
Result:
| race | male | female |
|---|---|---|
| white | 0 | 1 |
| asian | 1 | 1 |
| black | 0 | 2 |
| latino | 1 | 0 |
Pivot to sum¶
The default pivot aggregation is Count but you can also supply other operations. For example, to aggregate each group by Sum of their ages:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'))
| race | male | female |
|---|---|---|
| white | 0 | 20 |
| asian | 20 | 25 |
| black | 0 | 45 |
| latino | 25 | 0 |
Pivot to percent of total¶
Pivot allows you to apply a Computation to each row of aggregated results prior to returning the table. Use the stringified name of the aggregation as the column argument to your computation:
transformed = table.pivot('race', 'gender', aggregation=agate.Sum('age'), computation=agate.Percent('sum'))
| race | male | female |
|---|---|---|
| white | 0 | 14.8 |
| asian | 14.8 | 18.4 |
| black | 0 | 33.3 |
| latino | 18.4 | 0 |
Note: actual computed percentages will be much more precise.
It’s helpful when constructing these cases to think of all the cells in the pivot table as a single sequence.
Denormalize key/value columns into separate columns¶
It’s common for very large datasets to be distributed in a “normalized” format, such as:
| name | property | value |
|---|---|---|
| Jane | gender | female |
| Jane | race | black |
| Jane | age | 24 |
| ... | ... | ... |
The Table.denormalize() method can be used to transform the table so that each unique property has its own column.
transformed = table.denormalize('name', 'property', 'value')
Result:
| name | gender | race | age |
|---|---|---|---|
| Jane | female | black | 24 |
| Jack | male | white | 35 |
| Joe | male | black | 28 |
Normalize separate columns into key/value columns¶
Sometimes you have a dataset where each property has its own column, but your analysis would be easier if all properties were stored together. Consider this table:
| name | gender | race | age |
|---|---|---|---|
| Jane | female | black | 24 |
| Jack | male | white | 35 |
| Joe | male | black | 28 |
The Table.normalize() method can be used to transform the table so that all the properties and their values share two columns.
transformed = table.normalize('name', ['gender', 'race', 'age'])
Result:
| name | property | value |
|---|---|---|
| Jane | gender | female |
| Jane | race | black |
| Jane | age | 24 |
| ... | ... | ... |