How to Create Age Group Categories in Pandas and Visualize Them with Matplotlib
Data visualization is a key part of data analysis, helping to communicate insights clearly. In this blog post, we’ll learn how to categorize age data into specified groups using Pandas and then visualize those groups using a pie chart in Matplotlib. We’ll also show how to customize the chart to show the actual count of patients in each age group.
Step-by-Step Guide
We will walk through the process of creating a DataFrame, categorizing the ages into specified groups, counting the number of patients in each group, and visualizing these counts using a pie chart. Below, you will find the complete code snippet that accomplishes all these tasks.
Complete Code Snippet
import pandas as pd
import matplotlib.pyplot as plt
# Sample data: ages of patients
data = {'Age': [25, 32, 45, 51, 62, 75, 23, 38, 49, 54, 67, 72, 29, 34, 58]}
df = pd.DataFrame(data)
# Define age group labels and bins
labels = ['21-30', '31-40', '41-50', '51-60', '+61']
bins = [20, 30, 40, 50, 60, 70, float('inf')] # Using float('inf') for ages over 70
# Categorize ages into groups
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
# Count the number of patients in each age group
counting = df['AgeGroup'].value_counts().reindex(labels)
# Plotting the pie chart
plt.figure(figsize=(8, 8))
plt.pie(counting, labels=counting.index, autopct=lambda x: "{:.0f}".format(x * (counting.sum() / 100)), startangle=90)
plt.title('Patients by Age Group')
# Adding a legend with the correct order
plt.legend(labels=counting.index, title='Age Range')
plt.show()
Explanation of the Code:
- Importing Libraries: We start by importing the necessary libraries.
pandasis used for data manipulation, andmatplotlib.pyplotis used for creating visualizations. - Creating the DataFrame: We create a simple DataFrame containing a single column
Agewith sample patient age data. - Defining Age Groups: We specify
labelsfor the age groups andbinsfor the corresponding age ranges. Thefloat('inf')value is used to include all ages above 70 in the'+61'group. - Categorizing the Ages: The
pd.cutfunction categorizes the ages into the defined bins and assigns the appropriate labels. Theright=Falseparameter ensures that the bin is inclusive on the left edge and exclusive on the right edge. - Counting Patients in Each Group: Using
value_counts(), we count the occurrences of each age group. Thereindex(labels)method ensures that the counts are displayed in the order specified bylabels. - Plotting the Pie Chart: We use
plt.pieto create a pie chart with the counts of patients. Theautopctparameter is customized to display the actual number of patients instead of percentages. - Adding a Legend: Finally, we add a legend to the pie chart using
plt.legend, which displays the age groups in the correct order as defined inlabels.
Conclusion
In this post, we’ve seen how to categorize continuous numerical data into specified groups using Pandas. We then used Matplotlib to create a pie chart, visually representing the distribution of data across these groups. This approach is particularly useful when you want to summarize demographic data or any other continuous numerical data into more meaningful, readable segments.