Making Data Clean — Cleaning Dataset Content

Table of Contents

When performing data analysis, problems in the dataset’s content can significantly affect your results. For a description of what constitutes “dirty” data, see: Evaluating the Tidiness and Cleanliness of Data. This article explains how to use tools in pandas to clean problematic content and make your data clean.

Handling Missing Data
#

Manually Filling Missing Values
#

Missing values are common in real-world datasets. Sometimes you may need to manually fill in the missing entries.

Direct Assignment
#

If an entire column is missing, you can use:

df["col_name"] = value

To fill a specific cell using .loc or .iloc:

df.loc[index, "col_name"] = value

Using `fillna`
#

Instead of manually locating missing values, you can use fillna. For example, to fill missing values in a column with 0:

df["col_name"].fillna(0)

To fill multiple columns with different values, pass a dictionary:

df.fillna({"col1": value1, "col2": value2})

Deleting Rows with Missing Values
#

If you prefer to drop rows with missing data:

df.dropna()

This drops any row that contains at least one NaN.

To only drop rows if specific columns are missing:

df.dropna(subset=["col1", "col2"])

Handling Duplicates
#

You can remove duplicate rows with:

df.drop_duplicates(subset=None, keep='first')

subset: specify which columns to use when identifying duplicates.
keep:
- 'first' (default): keep the first occurrence
- 'last': keep the last
- False: drop all duplicates

Handling Inconsistent Data
#

When values are inconsistent (e.g., "M", "male", "boy"), use replace to standardize them:

df["Gender"].replace(["M", "boy"], "male")

To map multiple values using a dictionary:

df["Gender"].replace({"M": "male", "F": "femele"})

Note: these operations do not modify the original DataFrame unless you add inplace=True.

Converting Categorical Variables with `get_dummies`

6 May 2025·2 mins

Data Data Basics Data Pandas

Filtering Data with query()

3 May 2025·1 min

Data Data Basics Data Pandas

Grouping Data with `cut`

3 May 2025·2 mins

Data Data Basics Data Pandas

Handling Missing Data #

Manually Filling Missing Values #

Direct Assignment #

Using fillna #

Deleting Rows with Missing Values #

Handling Duplicates #

Handling Inconsistent Data #

Related

Handling Missing Data
#

Manually Filling Missing Values
#

Direct Assignment
#

Using `fillna`
#

Deleting Rows with Missing Values
#

Handling Duplicates
#

Handling Inconsistent Data
#