Table of Contents
ToggleIn pandas, categorical data refers to a data type that represents categorical variables, similar to the concept of factors in R. It is a specialized data type designed for handling categorical variables, commonly used in statistics.
A categorical variable can represent values like “male” or “female,” or ratings on a scale such as “poor,” “average,” and “excellent.” Unlike numerical data, you cannot perform mathematical operations like addition or division on categorical data.
In Pandas, categorical data is stored more efficiently because it uses a combination of an array of category values and an array of integer codes that refer to those categories. This saves memory and improves performance when working with large datasets containing repeated values.
Pandas Series or DataFrame object can be created directly with the categorical data using the dtype="category" parameter of the pandas.Series() or DataFrame() constructors.
Following is the basic example of creating a Pandas Series object with the categorical data.
import pandas as pd
# Create Series object with categorical data
s = pd.Series(["a", "b", "c", "a"], dtype="category")
# Display the categorical Series
print('Series with Categorical Data:\n', s)
Series with Categorical Data:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
This example demonstrates converting an existing Pandas DataFrame column to categorical data type using the astype() method.
import pandas as pd
import numpy as np
# Create a DataFrame
df = pd.DataFrame({"Col_a": list("aeeioou"), "Col_b": range(7)})
# Display the Input DataFrame
print('Input DataFrame:\n',df)
print('\nVerify the Data type of each column:\n', df.dtypes)
# Convert the Data type of col_a to categorical
df['Col_a'] = df["Col_a"].astype("category")
# Display the Input DataFrame
print('\nConverted DataFrame:\n',df)
print('\nVerify the Data type of each column:\n', df.dtypes)
Input DataFrame:
Col_a Col_b
0 a 0
1 e 1
2 e 2
3 i 3
4 o 4
5 o 5
6 u 6
Verify the Data type of each column:
Col_a object
Col_b int64
dtype: object
Converted DataFrame:
Col_a Col_b
0 a 0
1 e 1
2 e 2
3 i 3
4 o 4
5 o 5
6 u 6
Verify the Data type of each column:
Col_a category
Col_b int64
dtype: object
By default, Pandas infers categories from the data and treats them as unordered. To control the behavior, you can use the CategoricalDtype class from the pandas.api.types module.
This example demonstrates how to apply the CategoricalDtype to a whole DataFrame.
import pandas as pd
from pandas.api.types import CategoricalDtype
# Create a DataFrame
df = pd.DataFrame({"A": list("abca"), "B": list("bccd")})
# Display the Input DataFrame
print('Input DataFrame:\n',df)
print('\nVerify the Data type of each column:\n', df.dtypes)
# Applying CategoricalDtype to a DataFrame
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
df_cat = df.astype(cat_type)
# Display the Input DataFrame
print('\nConverted DataFrame:\n', df_cat)
print('\nVerify the Data type of each column:\n', df_cat.dtypes)
Input DataFrame:
A B
0 a b
1 b c
2 c c
3 a d
Verify the Data type of each column:
A object
B object
dtype: object
Converted DataFrame:
A B
0 a b
1 b c
2 c c
3 a d
Verify the Data type of each column:
A category
B category
dtype: object
After converting a Series to categorical data, you can convert it back to its original form using Series.astype() or np.asarray().
This example converts the categorical data of Series object back to the object data type using the astype() method.
import pandas as pd
# Create Series object with categorical data
s = pd.Series(["a", "b", "c", "a"], dtype="category")
# Display the categorical Series
print('Series with Categorical Data:\n', s)
# Display the converted Series
print('Converted Series back to original:\n ', s.astype(str))
Series with Categorical Data:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
Converted Series back to original:
0 a
1 b
2 c
3 a
dtype: object
Using the .describe() command on the categorical data, we get similar output to a Series or DataFrame of the type string.
The following example demonstrates how to get the description of Pandas categorical DataFrame using the describe() method.
import pandas as pd
import numpy as np
cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
df = pd.DataFrame({"cat":cat, "s":["a", "c", "c", np.nan]})
print("Description for whole DataFrame:")
print(df.describe())
print("\nDescription only for a DataFrame column:")
print(df["cat"].describe())
Description for whole DataFrame:
cat s
count 3 3
unique 2 2
top c c
freq 2 2
Description only for a DataFrame column:
count 3
unique 2
top c
freq 2
Name: cat, dtype: object
