Evaluating the Tidiness and Cleanliness of Data

Table of Contents

As mentioned in the previous article, a high-quality dataset should be good in two aspects:

Tidiness: The structure of the dataset follows the three principles of tidy data (see the previous article).
Cleanliness: The data does not contain missing values, invalid/erroneous data, inconsistencies, or duplicates.

So how do we evaluate whether a dataset is tidy and clean? This article introduces several practical methods to assess both.

The article is divided into two parts:

Evaluating tidiness (structure-related issues)
Evaluating cleanliness (content-related issues)

Evaluating Tidiness
#

Structural issues are usually easier to spot than content issues. For example, it’s easy to tell whether books are organized on a shelf just by looking, but harder to spot typos inside the books. Here are some useful tools to help identify structural problems:

df.info() — Gives an overview of the dataset, including column names and data types. This helps assess whether each column represents a single variable.
df.head(n), df.tail(n), df.sample(n) — Display actual rows to help you judge whether the dataset follows the tidy data principles.

Evaluating Cleanliness
#

Missing Values
#

Use df.info() to see how many non-null entries each column has, helping to identify columns with missing values.
Use df["column_name"].isnull() to check which rows in a column have missing values. This returns a Boolean Series. Combine with .sum() to count missing entries:
```
df["column_name"].isnull().sum()
```
Or check all columns at once:
```
df.isnull().sum()
```
You can also filter rows with missing values in a specific column:
```
df[df["column_name"].isnull()]
```

Duplicate Values
#

Use:
```
df.duplicated(subset=["column1", "column2"])
```
This checks for rows where the specified columns match a previous row. It returns a Boolean Series—False for the first occurrence and True for subsequent duplicates. Use it with filtering to find all duplicates:
```
df[df.duplicated(subset=["column1", "column2"])]
```

Inconsistent Values
#

Use:
```
df["column_name"].value_counts()
```
This displays the frequency of each unique value in a column, helping identify inconsistencies like "Class 1", "Class No 1", "The first Class" representing the same class.

Invalid / Erroneous Data
#

This type of issue is usually harder to detect, especially without domain knowledge. However, some basic checks include:

df.describe() — Gives min, max, mean, and other stats to spot outliers.
df["column_name"].sort_values() — Sorting values may help identify anomalies or outliers that deviate from expected ranges.