How to Optimize a Function for Counting Matching Products in a DataFrame
In this post, we’ll walk through how to optimize a function that counts the number of rows in a DataFrame where two specified products are present. The original function, though functional, can be made more efficient and readable. Let’s dive into the problem and see how we can improve it.
Original Function
Here’s the initial function we were working with:
def cant_boleta(p1, p2, boletas):
if p1 == p2:
return 0
else:
uso = boletas[((boletas['Producto 1']==p1)|(boletas['Producto 2']==p1)|(boletas['Producto 3']==p1)|
(boletas['Producto 4']==p1)|(boletas['Producto 5']==p1)|(boletas['Producto 6']==p1)|
(boletas['Producto 7']==p1)|(boletas['Producto 8']==p1)|(boletas['Producto 9']==p1)|(boletas['Producto 10']==p1))&
((boletas['Producto 1']==p2)|(boletas['Producto 2']==p2)|(boletas['Producto 3']==p2)|
(boletas['Producto 4']==p2)|(boletas['Producto 5']==p2)|(boletas['Producto 6']==p2)|(boletas['Producto 7']==p2)|
(boletas['Producto 8']==p2)|(boletas['Producto 9']==p2)|(boletas['Producto 10']==p2))]
return len(uso)
This function checks each row to see if both products p1 and p2 are present in any of the product columns. If p1 is equal to p2, it returns 0. Otherwise, it filters the DataFrame to find matching rows and returns the count.
Problems with the Original Function
- Redundant Comparisons: The function performs multiple comparisons for each product, making the code lengthy and harder to maintain.
- Efficiency: The function repeatedly accesses columns, which can be optimized.
- Readability: The extensive use of logical OR conditions makes the code difficult to read and understand.
Improved Function
We can improve the function by using the apply method from pandas, which allows us to apply a function to each row of the DataFrame. This method will help us make the code more efficient and readable.
Here is the optimized version of the function:
import pandas as pd
def cant_boleta(p1, p2, boletas):
if p1 == p2:
return 0
# Define a function to check if both products are in the row
def contains_products(row):
products = row[['Producto 1', 'Producto 2', 'Producto 3', 'Producto 4', 'Producto 5',
'Producto 6', 'Producto 7', 'Producto 8', 'Producto 9', 'Producto 10']]
return p1 in products.values and p2 in products.values
# Apply the function to each row of the DataFrame and count how many rows meet the condition
boletas_with_products = boletas.apply(contains_products, axis=1)
return boletas_with_products.sum()
# Example usage
data = {
'Producto 1': ['A', 'B', 'C'],
'Producto 2': ['D', 'A', 'E'],
'Producto 3': ['F', 'G', 'H'],
'Producto 4': ['I', 'J', 'K'],
'Producto 5': ['L', 'M', 'N'],
'Producto 6': ['O', 'P', 'Q'],
'Producto 7': ['R', 'S', 'T'],
'Producto 8': ['U', 'V', 'W'],
'Producto 9': ['X', 'Y', 'Z'],
'Producto 10': ['AA', 'BB', 'CC']
}
boletas = pd.DataFrame(data)
# Call the function with example
result = cant_boleta('A', 'D', boletas)
print(f"Number of tickets containing products 'A' and 'D': {result}")
Explanation of the Improvements
- Defining a Helper Function: We define a helper function
contains_productsthat checks if bothp1andp2are in the row. This function is applied to each row of the DataFrame. - Using
apply: Theapplymethod applies the helper function to each row, simplifying the condition checks and making the code cleaner. - Counting Rows: Instead of filtering the DataFrame multiple times, we use
sum()to count how many rows contain both products, improving efficiency.
Conclusion
By using the apply method and defining a helper function, we have optimized the cant_boleta function to be more readable and efficient. This approach not only simplifies the logic but also improves performance, especially when working with larger DataFrames.