How to Optimize a Function for Counting Matching Products in a DataFrame

By - Geek Logbook
Posted on 2024-08-072024-11-22
Posted in Programming

How to Optimize a Function for Counting Matching Products in a DataFrame

In this post, we’ll walk through how to optimize a function that counts the number of rows in a DataFrame where two specified products are present. The original function, though functional, can be made more efficient and readable. Let’s dive into the problem and see how we can improve it.

Original Function

Here’s the initial function we were working with:

def cant_boleta(p1, p2, boletas):
    if p1 == p2:
        return 0
    else:
        uso = boletas[((boletas['Producto 1']==p1)|(boletas['Producto 2']==p1)|(boletas['Producto 3']==p1)|
        (boletas['Producto 4']==p1)|(boletas['Producto 5']==p1)|(boletas['Producto 6']==p1)|
        (boletas['Producto 7']==p1)|(boletas['Producto 8']==p1)|(boletas['Producto 9']==p1)|(boletas['Producto 10']==p1))&
        ((boletas['Producto 1']==p2)|(boletas['Producto 2']==p2)|(boletas['Producto 3']==p2)|
        (boletas['Producto 4']==p2)|(boletas['Producto 5']==p2)|(boletas['Producto 6']==p2)|(boletas['Producto 7']==p2)|
        (boletas['Producto 8']==p2)|(boletas['Producto 9']==p2)|(boletas['Producto 10']==p2))]
        return len(uso)

This function checks each row to see if both products p1 and p2 are present in any of the product columns. If p1 is equal to p2, it returns 0. Otherwise, it filters the DataFrame to find matching rows and returns the count.

Problems with the Original Function

Redundant Comparisons: The function performs multiple comparisons for each product, making the code lengthy and harder to maintain.
Efficiency: The function repeatedly accesses columns, which can be optimized.
Readability: The extensive use of logical OR conditions makes the code difficult to read and understand.

Improved Function

We can improve the function by using the apply method from pandas, which allows us to apply a function to each row of the DataFrame. This method will help us make the code more efficient and readable.

Here is the optimized version of the function:

import pandas as pd

def cant_boleta(p1, p2, boletas):
    if p1 == p2:
        return 0
    
    # Define a function to check if both products are in the row
    def contains_products(row):
        products = row[['Producto 1', 'Producto 2', 'Producto 3', 'Producto 4', 'Producto 5', 
                        'Producto 6', 'Producto 7', 'Producto 8', 'Producto 9', 'Producto 10']]
        return p1 in products.values and p2 in products.values
    
    # Apply the function to each row of the DataFrame and count how many rows meet the condition
    boletas_with_products = boletas.apply(contains_products, axis=1)
    
    return boletas_with_products.sum()

# Example usage
data = {
    'Producto 1': ['A', 'B', 'C'],
    'Producto 2': ['D', 'A', 'E'],
    'Producto 3': ['F', 'G', 'H'],
    'Producto 4': ['I', 'J', 'K'],
    'Producto 5': ['L', 'M', 'N'],
    'Producto 6': ['O', 'P', 'Q'],
    'Producto 7': ['R', 'S', 'T'],
    'Producto 8': ['U', 'V', 'W'],
    'Producto 9': ['X', 'Y', 'Z'],
    'Producto 10': ['AA', 'BB', 'CC']
}

boletas = pd.DataFrame(data)

# Call the function with example
result = cant_boleta('A', 'D', boletas)
print(f"Number of tickets containing products 'A' and 'D': {result}")

Explanation of the Improvements

Defining a Helper Function: We define a helper function contains_products that checks if both p1 and p2 are in the row. This function is applied to each row of the DataFrame.
Using apply: The apply method applies the helper function to each row, simplifying the condition checks and making the code cleaner.
Counting Rows: Instead of filtering the DataFrame multiple times, we use sum() to count how many rows contain both products, improving efficiency.

Conclusion

By using the apply method and defining a helper function, we have optimized the cant_boleta function to be more readable and efficient. This approach not only simplifies the logic but also improves performance, especially when working with larger DataFrames.

Tags:Python

Geek Logbook

Recent Posts

Categories

Archives

How to Optimize a Function for Counting Matching Products in a DataFrame

Original Function

Problems with the Original Function

Improved Function

Explanation of the Improvements

Conclusion

Previous Article

Next Article