Skip to main content
  1. Posts/

Making Data Tidy — Structuring Datasets

·2 mins
Data Data Basics Data Pandas
Table of Contents

When performing data analysis, datasets are often structured for human readability rather than for machine-friendly analysis. As mentioned in previous articles, tidy data follows certain principles. This article explains how to use pandas to restructure datasets into tidy formats.

Splitting and Merging Columns
#

Horizontally Splitting a String Column into Multiple Columns
#

Suppose you have a column with values like "height/weight". You can split it using:

df["Height/Weight"].str.split("/")

This returns a list of strings per row. To expand them into separate columns:

df["Height/Weight"].str.split("/", expand=True)

To store the result back into the DataFrame:

df[["Height", "Weight"]] = df["Height/Weight"].str.split("/", expand=True)

Don’t forget to drop the original column:

df.drop("Height/Weight", axis=1, inplace=True)

Vertically Splitting a List Column into Multiple Rows
#

If a column contains lists like this:

    Name         Hobbies
0   Alice   [Basketball, Soccer, Swimming]
1   Bob     [Music, Painting]

You can explode it into multiple rows:

df_exploded = df.explode("Hobbies")

Merging Columns
#

If you have separate "First Name" and "Last Name" columns and want to merge them into a "Full Name" column:

df["Full Name"] = df["First Name"].str.cat(df["Last Name"], sep=" ")

Then drop the original columns if needed.

Convert Wide Data to Long Data
#

Consider this “wide” format that records patient measurements. Because not every patient undergoes the same tests, many columns may contain NaN. A better approach is to convert it into a “long” format.

Wide format:

Patient ID Blood Pressure Blood Sugar Heart Rate
001 120 90 80
002 130 NaN 85

To convert it into long format:

df_long = pd.melt(df, id_vars=["Patient ID"], var_name="Measurement", value_name="Value")

Result:

  Patient ID Measurement   Value
0        001   Blood Pressure  120.0
1        002   Blood Pressure  130.0
2        001   Blood Sugar      90.0
3        002   Blood Sugar       NaN
4        001   Heart Rate       80.0
5        002   Heart Rate       85.0

Explanation of Parameters:

Parameter Description
id_vars=["Patient ID"] Fixed column(s) to keep (e.g., identifiers)
var_name="Measurement" New column to store former column names (e.g., “Blood Pressure”)
value_name="Value" New column to store actual values

Related

Converting Categorical Variables with `get_dummies`
·2 mins
Data Data Basics Data Pandas
Filtering Data with query()
·1 min
Data Data Basics Data Pandas
Grouping Data with `cut`
·2 mins
Data Data Basics Data Pandas