Combining Multiple CSV Files into One with Python
If you work with data, chances are you’ve encountered situations where you need to merge multiple CSV files into a single file for analysis. Manually combining these files can be time-consuming and error-prone. In this post, I’ll show you how to automate this task using Python and the powerful pandas library.
Problem Statement
Let’s assume you have a folder structure like this:
root_folder/
├── subfolder1/
│ ├── file1.csv
│ ├── file2.csv
├── subfolder2/
│ ├── file3.csv
│ ├── file4.csv
├── subfolder3/
│ ├── file5.csv
│ ├── file6.csv
└── subfolder4/
├── file7.csv
├── file8.csv
You need to combine all the CSV files from these subfolders into a single CSV file for easier analysis.
Solution
We can achieve this by using the os and pandas libraries in Python. The following function will iterate through all the CSV files in the specified folder and its subfolders, and then concatenate them into a single DataFrame:
import os
import pandas as pd
def combine_csv_files(root_folder):
all_data = pd.DataFrame()
for dirpath, dirnames, filenames in os.walk(root_folder):
for filename in filenames:
if filename.endswith('.csv'):
file_path = os.path.join(dirpath, filename)
data = pd.read_csv(file_path)
all_data = pd.concat([all_data, data], ignore_index=True)
return all_data
# Specify the path to your root folder
root_folder = "path/to/your/root_folder"
combined_data = combine_csv_files(root_folder)
# Save the combined data to a new CSV file
combined_data.to_csv("path/to/save/combined_data.csv", index=False)
How It Works
- Import Libraries: First, we import the necessary libraries,
osfor navigating the file system andpandasfor handling the data. - Define the Function:
combine_csv_filesis the function that takes theroot_folderas input. - Initialize DataFrame:
all_datais initialized as an empty DataFrame. - Iterate Through Files:
os.walk(root_folder)iterates through all the files and folders inroot_folder. - Read and Concatenate CSV Files: For each file found, it checks if it has a
.csvextension, reads the data usingpd.read_csv, and concatenates it toall_data. - Return Combined DataFrame: Finally, the function returns the combined DataFrame.
- Save Combined Data: We save the combined DataFrame to a new CSV file.