Geek Logbook

Tech sea log book

Combining Multiple CSV Files into One with Python

If you work with data, chances are you’ve encountered situations where you need to merge multiple CSV files into a single file for analysis. Manually combining these files can be time-consuming and error-prone. In this post, I’ll show you how to automate this task using Python and the powerful pandas library.

Problem Statement

Let’s assume you have a folder structure like this:

root_folder/
├── subfolder1/
│   ├── file1.csv
│   ├── file2.csv
├── subfolder2/
│   ├── file3.csv
│   ├── file4.csv
├── subfolder3/
│   ├── file5.csv
│   ├── file6.csv
└── subfolder4/
    ├── file7.csv
    ├── file8.csv

You need to combine all the CSV files from these subfolders into a single CSV file for easier analysis.

Solution

We can achieve this by using the os and pandas libraries in Python. The following function will iterate through all the CSV files in the specified folder and its subfolders, and then concatenate them into a single DataFrame:

import os
import pandas as pd

def combine_csv_files(root_folder):
    all_data = pd.DataFrame()

    for dirpath, dirnames, filenames in os.walk(root_folder):
        for filename in filenames:
            if filename.endswith('.csv'):
                file_path = os.path.join(dirpath, filename)
                data = pd.read_csv(file_path)
                all_data = pd.concat([all_data, data], ignore_index=True)

    return all_data

# Specify the path to your root folder
root_folder = "path/to/your/root_folder"
combined_data = combine_csv_files(root_folder)

# Save the combined data to a new CSV file
combined_data.to_csv("path/to/save/combined_data.csv", index=False)

How It Works

  1. Import Libraries: First, we import the necessary libraries, os for navigating the file system and pandas for handling the data.
  2. Define the Function: combine_csv_files is the function that takes the root_folder as input.
  3. Initialize DataFrame: all_data is initialized as an empty DataFrame.
  4. Iterate Through Files: os.walk(root_folder) iterates through all the files and folders in root_folder.
  5. Read and Concatenate CSV Files: For each file found, it checks if it has a .csv extension, reads the data using pd.read_csv, and concatenates it to all_data.
  6. Return Combined DataFrame: Finally, the function returns the combined DataFrame.
  7. Save Combined Data: We save the combined DataFrame to a new CSV file.
Tags: