Skip to main content
  1. Posts/

Making Data Clean — Cleaning Dataset Content

·2 mins
Data Data Basics Data Pandas
Table of Contents

When performing data analysis, problems in the dataset’s content can significantly affect your results. For a description of what constitutes “dirty” data, see: Evaluating the Tidiness and Cleanliness of Data. This article explains how to use tools in pandas to clean problematic content and make your data clean.

Handling Missing Data
#

Manually Filling Missing Values
#

Missing values are common in real-world datasets. Sometimes you may need to manually fill in the missing entries.

Direct Assignment
#

If an entire column is missing, you can use:

df["col_name"] = value

To fill a specific cell using .loc or .iloc:

df.loc[index, "col_name"] = value

Using fillna
#

Instead of manually locating missing values, you can use fillna. For example, to fill missing values in a column with 0:

df["col_name"].fillna(0)

To fill multiple columns with different values, pass a dictionary:

df.fillna({"col1": value1, "col2": value2})

Deleting Rows with Missing Values
#

If you prefer to drop rows with missing data:

df.dropna()

This drops any row that contains at least one NaN.

To only drop rows if specific columns are missing:

df.dropna(subset=["col1", "col2"])

Handling Duplicates
#

You can remove duplicate rows with:

df.drop_duplicates(subset=None, keep='first')
  • subset: specify which columns to use when identifying duplicates.
  • keep:
    • 'first' (default): keep the first occurrence
    • 'last': keep the last
    • False: drop all duplicates

Handling Inconsistent Data
#

When values are inconsistent (e.g., "M", "male", "boy"), use replace to standardize them:

df["Gender"].replace(["M", "boy"], "male")

To map multiple values using a dictionary:

df["Gender"].replace({"M": "male", "F": "femele"})

Note: these operations do not modify the original DataFrame unless you add inplace=True.

Related

Converting Categorical Variables with `get_dummies`
·2 mins
Data Data Basics Data Pandas
Filtering Data with query()
·1 min
Data Data Basics Data Pandas
Grouping Data with `cut`
·2 mins
Data Data Basics Data Pandas