Geek Logbook

Tech sea log book

How to Create Age Group Categories in Pandas and Visualize Them with Matplotlib

Data visualization is a key part of data analysis, helping to communicate insights clearly. In this blog post, we’ll learn how to categorize age data into specified groups using Pandas and then visualize those groups using a pie chart in Matplotlib. We’ll also show how to customize the chart to show the actual count of patients in each age group.

Step-by-Step Guide

We will walk through the process of creating a DataFrame, categorizing the ages into specified groups, counting the number of patients in each group, and visualizing these counts using a pie chart. Below, you will find the complete code snippet that accomplishes all these tasks.

Complete Code Snippet

import pandas as pd
import matplotlib.pyplot as plt

# Sample data: ages of patients
data = {'Age': [25, 32, 45, 51, 62, 75, 23, 38, 49, 54, 67, 72, 29, 34, 58]}
df = pd.DataFrame(data)

# Define age group labels and bins
labels = ['21-30', '31-40', '41-50', '51-60', '+61']
bins = [20, 30, 40, 50, 60, 70, float('inf')]  # Using float('inf') for ages over 70

# Categorize ages into groups
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

# Count the number of patients in each age group
counting = df['AgeGroup'].value_counts().reindex(labels)

# Plotting the pie chart
plt.figure(figsize=(8, 8))
plt.pie(counting, labels=counting.index, autopct=lambda x: "{:.0f}".format(x * (counting.sum() / 100)), startangle=90)
plt.title('Patients by Age Group')

# Adding a legend with the correct order
plt.legend(labels=counting.index, title='Age Range')

plt.show()

Explanation of the Code:

  1. Importing Libraries: We start by importing the necessary libraries. pandas is used for data manipulation, and matplotlib.pyplot is used for creating visualizations.
  2. Creating the DataFrame: We create a simple DataFrame containing a single column Age with sample patient age data.
  3. Defining Age Groups: We specify labels for the age groups and bins for the corresponding age ranges. The float('inf') value is used to include all ages above 70 in the '+61' group.
  4. Categorizing the Ages: The pd.cut function categorizes the ages into the defined bins and assigns the appropriate labels. The right=False parameter ensures that the bin is inclusive on the left edge and exclusive on the right edge.
  5. Counting Patients in Each Group: Using value_counts(), we count the occurrences of each age group. The reindex(labels) method ensures that the counts are displayed in the order specified by labels.
  6. Plotting the Pie Chart: We use plt.pie to create a pie chart with the counts of patients. The autopct parameter is customized to display the actual number of patients instead of percentages.
  7. Adding a Legend: Finally, we add a legend to the pie chart using plt.legend, which displays the age groups in the correct order as defined in labels.

Conclusion

In this post, we’ve seen how to categorize continuous numerical data into specified groups using Pandas. We then used Matplotlib to create a pie chart, visually representing the distribution of data across these groups. This approach is particularly useful when you want to summarize demographic data or any other continuous numerical data into more meaningful, readable segments.

Tags: