When performing data analysis, problems in the dataset’s content can significantly affect your results. For a description of what constitutes “dirty” data, see: Evaluating the Tidiness and Cleanliness of Data. This article explains how to use tools in pandas to clean problematic content and make your data clean.
Handling Missing Data #
Manually Filling Missing Values #
Missing values are common in real-world datasets. Sometimes you may need to manually fill in the missing entries.
Direct Assignment #
If an entire column is missing, you can use:
df["col_name"] = value
To fill a specific cell using .loc
or .iloc
:
df.loc[index, "col_name"] = value
Using fillna
#
Instead of manually locating missing values, you can use fillna
. For example, to fill missing values in a column with 0:
df["col_name"].fillna(0)
To fill multiple columns with different values, pass a dictionary:
df.fillna({"col1": value1, "col2": value2})
Deleting Rows with Missing Values #
If you prefer to drop rows with missing data:
df.dropna()
This drops any row that contains at least one NaN
.
To only drop rows if specific columns are missing:
df.dropna(subset=["col1", "col2"])
Handling Duplicates #
You can remove duplicate rows with:
df.drop_duplicates(subset=None, keep='first')
subset
: specify which columns to use when identifying duplicates.keep
:'first'
(default): keep the first occurrence'last'
: keep the lastFalse
: drop all duplicates
Handling Inconsistent Data #
When values are inconsistent (e.g., "M"
, "male"
, "boy"
), use replace
to standardize them:
df["Gender"].replace(["M", "boy"], "male")
To map multiple values using a dictionary:
df["Gender"].replace({"M": "male", "F": "femele"})
Note: these operations do not modify the original DataFrame unless you add
inplace=True
.