Extracting Data from Fixed-Width Text Files into Pandas DataFrame

By - Geek Logbook
Posted on 2024-07-242024-11-22
Posted in Programming

Extracting Data from Fixed-Width Text Files into Pandas DataFrame

Working with fixed-width text files can be challenging, especially when you need to extract specific fields and transform them into a structured format like a Pandas DataFrame. In this blog post, I’ll walk you through a Python solution to achieve this efficiently.

The Challenge

Fixed-width text files store data in a format where each field has a fixed number of characters. Extracting data from such files involves identifying the starting and ending positions of each field. Here’s an example of a fixed-width text file format:

John Doe    29 New York
Jane Smith  34 Los Angeles

In the above example, we need to extract the name, age, and location fields from each line. This post will guide you through creating a function to read these lines and transform them into a DataFrame.

The Solution

We’ll use Python’s built-in file handling and the Pandas library to achieve this. Here’s a step-by-step guide:

Step 1: Define the Field Positions

First, we need to define the positions of each field in the text file. For example, if the name occupies the first 10 characters, age the next 3 characters, and location the next 10 characters, we define these positions in a list.

fields = [("Name", 1, 10), ("Age", 11, 13), ("Location", 14, 24)]

Step 2: Create the Data Extraction Function

Next, we create a function to read the file, extract the data according to the field positions, and return a DataFrame. We’ll encapsulate the functionality in a class called DataProcessor.

import pandas as pd

class DataProcessor:
    def create_dataframe(self, file_path, field_list):
        data_list = []

        def parse_line(line, field_list):
            data = {}
            for field_name, char_from, char_to in field_list:
                data[field_name] = line[char_from - 1 : char_to].strip()
            return data

        with open(file_path, 'r') as file:
            for line in file:
                data = parse_line(line, field_list)
                data_list.append(data)

        return pd.DataFrame(data_list)

Step 3: Usage Example

Now, let’s see how to use the DataProcessor class to read a fixed-width text file and create a DataFrame.

if __name__ == "__main__":
    processor = DataProcessor()
    fields = [("Name", 1, 10), ("Age", 11, 13), ("Location", 14, 24)]
    file_path = "data.txt"  # Replace with your actual file path
    df = processor.create_dataframe(file_path, fields)
    print(df)

Explanation

Defining the Field Positions: The fields list contains tuples, each representing a field with its name, start, and end positions.
DataProcessor Class: The create_dataframe method reads the file line by line, extracts data based on the field positions, and appends it to a list.
parse_line Function: This nested function processes each line to extract and strip the field values.
Creating the DataFrame: After processing all lines, the method converts the list of dictionaries into a Pandas DataFrame and returns it.

Conclusion

This approach simplifies the extraction of data from fixed-width text files into a structured format, making it easier to analyze and manipulate using Pandas. The DataProcessor class provides a reusable solution for different file formats by simply updating the fields list.

Tags:Python

Geek Logbook

Recent Posts

Categories

Archives

Extracting Data from Fixed-Width Text Files into Pandas DataFrame

The Challenge

The Solution

Step 1: Define the Field Positions

Step 2: Create the Data Extraction Function

Step 3: Usage Example

Explanation

Conclusion

Previous Article

Next Article