Skip to main content
  1. Posts/

Converting Categorical Variables with `get_dummies`

·2 mins
Data Data Basics Data Pandas
Table of Contents

When performing linear regression, we often encounter categorical variables like gender (male/female), region (north/south/west), etc. However, linear regression only accepts numerical data. This means we must first convert categorical (text) variables into a numeric format before fitting the model.

In such cases, the pandas function get_dummies() becomes a very handy tool. It converts categorical variables into a set of binary (0 and 1) columns—a process known as One-Hot Encoding. This article introduces how to use get_dummies() and explains its importance before running linear regression.

Basic Syntax
#

Usage is simple:

pd.get_dummies(df, columns=["col_name"], dtype=int)

Just pass the column(s) you want to encode into get_dummies().

Why Drop One Column?
#

Generally, we don’t want the resulting dummy variables to be highly correlated. In fact, if the columns are linearly dependent, the linear regression equation becomes unsolvable.

Recall that in linear regression, we solve:

$$\mathbf{y} = \mathbf{X}\beta + \epsilon$$

Using least squares estimation, the formula is:

$$\beta = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

But if some columns in $\mathbf{X}$ are linearly dependent, then $\mathbf{X}^T\mathbf{X}$ becomes a singular matrix, and we cannot compute its inverse.


Since dummy variables generated from categorical values are mutually exclusive, they are inherently dependent. Therefore, we typically drop one column after one-hot encoding to avoid linear dependence.

Use the drop_first=True argument:

pd.get_dummies(df, columns=["col_name"], dtype=int, drop_first=True)

This will drop the first category and retain the rest, ensuring the design matrix stays full-rank.

Tip: Use df.corr() to inspect correlations between variables.

Related

Generating Tidy Summary Tables with groupby / pivot_table
·2 mins
Data Data Basics Data Pandas
Filtering Data with query()
·1 min
Data Data Basics Data Pandas
Grouping Data with `cut`
·2 mins
Data Data Basics Data Pandas