Skip to main content
  1. Posts/

Generating Tidy Summary Tables with groupby / pivot_table

·2 mins
Data Data Basics Data Pandas
Table of Contents

In data analysis, it’s common to generate summary tables grouped by categories. For example, given a table of student test scores, we might want to know the average score per class.

data = {
    'Class': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'Name': ['Ming', 'Mei', 'Hua', 'Qiang', 'Mei', 'An', 'Jie'],
    'Gender': ['M', 'F', 'M', 'M', 'F', 'F', 'M'],
    'Score': [85, 90, 78, 82, 88, 95, 91],
    'Absences': [2, 0, 3, 1, 0, 0, 2]
}

df = pd.DataFrame(data)

groupby
#

Basic Syntax
#

df.groupby("column") splits the DataFrame into groups. For example, df.groupby("Class") will split the data into groups A, B, and C. To view statistics, you need to select specific columns and apply aggregation functions like mean, sum, etc.

df.groupby("Class")["Score"].mean()

Result:

Class
A    87.5
B    82.67
C    93.0

To aggregate multiple columns:

df.groupby("Class")[["Score", "Absences"]].mean()
Class Score Absences
A 87.5000 1.000000
B 82.6667 1.333333
C 93.0000 1.000000

Grouping by multiple columns (e.g., Class and Gender):

df.groupby(["Class", "Gender"])[["Score", "Absences"]].mean()
Class Gender Score Absences
A F 90.0 0.0
M 85.0 2.0
B F 88.0 0.0
M 80.0 2.0
C F 95.0 0.0
M 91.0 2.0

Multiple Aggregation Functions
#

Use agg to apply multiple aggregation functions at once:

df.groupby("Class")["Score"].agg(['mean', 'sum'])
Class mean sum
A 87.5 175
B 82.67 248
C 93.0 186

Custom Aggregation Function
#

Define your own function and use it with apply:

df.groupby("Class")["Score"].apply(custom_function)

pivot_table
#

While groupby is powerful, pivot_table makes it easier to generate tabular reports. It’s ideal for transforming data into a readable table format with rows and columns.

Basic Usage
#

pd.pivot_table(df, index="Class", values="Score", aggfunc="mean")
Class Score
A 87.5
B 82.67
C 93.0

Calculate multiple fields:

pd.pivot_table(df, index="Class", values=["Score", "Absences"], aggfunc="mean")

Add Column Grouping
#

To view the average score by class and gender:

pd.pivot_table(df, index="Class", values="Score", columns="Gender", aggfunc="mean")
Gender F M
Class
A 90.0 85.0
B 88.0 80.0
C 95.0 91.0

Multiple Values + Multiple Columns
#

To view average Score and Absences by Class and Gender:

pd.pivot_table(
    df,
    values=["Score", "Absences"],
    index="Class",
    columns="Gender",
    aggfunc="mean"
)
Score Absences
Gender F M F M
Class
A 90.0 85.0 0.0 2.0
B 88.0 80.0 0.0 2.0
C 95.0 91.0 0.0 2.0

Related

Converting Categorical Variables with `get_dummies`
·2 mins
Data Data Basics Data Pandas
Filtering Data with query()
·1 min
Data Data Basics Data Pandas
Grouping Data with `cut`
·2 mins
Data Data Basics Data Pandas