Geek Logbook

Tech sea log book

Parsing Complex Data from HTML Tables with Python

When working with web scraping, you often encounter scenarios where HTML content is nested or contains encoded data within JavaScript attributes. This post walks through parsing player statistics from a complex HTML table, utilizing Python and the BeautifulSoup library to streamline the extraction of JSON data hidden in JavaScript functions.

Project Overview

We have an HTML table filled with player statistics, where each row contains the player’s stats embedded in the onclick attribute. This attribute uses a JavaScript function call that includes JSON-like data, so our goal is to parse this data and structure it into a readable and usable format.

Tools Used

  • Python: General scripting and data manipulation.
  • BeautifulSoup: A Python library to parse HTML documents.
  • json and ast modules: Converting JavaScript-like JSON data into Python dictionaries.

Step-by-Step Guide to Extracting the Data

  1. Read the HTML Content First, we load the HTML content that we want to parse. This is typically read from a file or obtained through a web scraping tool like requests.
  2. Set Up the Parsing Function We’ll use BeautifulSoup to parse the HTML table and locate each row that contains player data. Each row has an onclick attribute that includes the player’s statistics in JSON format.
  3. Extract JSON-Like Data Using regular expressions, we isolate the JSON part within the onclick function. Here’s where we leverage Python’s ast module to safely evaluate this JSON-like structure into a Python dictionary.

The Code

Below is the Python code that accomplishes our goal:

from bs4 import BeautifulSoup
import json
import re
import ast

def parse_jugadores(html_content):
    """
    Parses player statistics from an HTML table and returns them as structured dictionaries.
    
    Args:
        html_content (str): HTML content containing the player table.
    
    Returns:
        list: List of dictionaries with each player's detailed statistics.
    """
    soup = BeautifulSoup(html_content, 'html.parser')
    jugadores = []

    # Locate the stats table
    table = soup.find('table', class_='table table-striped tabla-estadisticas ordenTabla')

    # Iterate over each player row
    for row in table.find_all('tr')[2:]:  # Skipping headers
        onclick_data = row.get('onclick')
        if onclick_data:
            # Use regex to capture JSON-like data within the function call
            match = re.search(r'EstadisticasComponente\((\{.*?\})', onclick_data)
            if match:
                jugador_info_json = match.group(1)
                try:
                    # Convert JSON string to a dictionary
                    jugador_info = ast.literal_eval(jugador_info_json.replace('"', '"'))
                    jugadores.append(jugador_info)
                except (ValueError, SyntaxError) as e:
                    print(f"Error parsing player data: {e}")

    return jugadores

# Example usage
with open('output.txt', 'r', encoding='utf-8') as file:
    html_content = file.read()

jugadores = parse_jugadores(html_content)
print(json.dumps(jugadores, indent=4, ensure_ascii=False))

Explanation of Key Parts

  1. Regex for Extracting JSON: The re.search function captures the JSON part within the onclick function.
  2. Evaluating JSON with ast.literal_eval: This safely parses the JSON-like structure into a dictionary without executing any unintended code, ensuring security.
  3. Output: Each player’s statistics are stored in a list of dictionaries, easily accessible for further analysis.

Final Thoughts

Parsing HTML content with embedded data requires thoughtful handling, especially with JSON embedded in JavaScript functions. The use of BeautifulSoup, regex, and ast.literal_eval creates an efficient pipeline for extracting structured data from complex HTML. This approach can be extended to similar scenarios across web scraping projects, providing flexibility and robustness when dealing with intricate HTML structures.

Tags: