Parsing Complex Data from HTML Tables with Python
When working with web scraping, you often encounter scenarios where HTML content is nested or contains encoded data within JavaScript attributes. This post walks through parsing player statistics from a complex HTML table, utilizing Python and the BeautifulSoup library to streamline the extraction of JSON data hidden in JavaScript functions.
Project Overview
We have an HTML table filled with player statistics, where each row contains the player’s stats embedded in the onclick
attribute. This attribute uses a JavaScript function call that includes JSON-like data, so our goal is to parse this data and structure it into a readable and usable format.
Tools Used
- Python: General scripting and data manipulation.
- BeautifulSoup: A Python library to parse HTML documents.
json
andast
modules: Converting JavaScript-like JSON data into Python dictionaries.
Step-by-Step Guide to Extracting the Data
- Read the HTML Content First, we load the HTML content that we want to parse. This is typically read from a file or obtained through a web scraping tool like
requests
. - Set Up the Parsing Function We’ll use BeautifulSoup to parse the HTML table and locate each row that contains player data. Each row has an
onclick
attribute that includes the player’s statistics in JSON format. - Extract JSON-Like Data Using regular expressions, we isolate the JSON part within the
onclick
function. Here’s where we leverage Python’sast
module to safely evaluate this JSON-like structure into a Python dictionary.
The Code
Below is the Python code that accomplishes our goal:
from bs4 import BeautifulSoup
import json
import re
import ast
def parse_jugadores(html_content):
"""
Parses player statistics from an HTML table and returns them as structured dictionaries.
Args:
html_content (str): HTML content containing the player table.
Returns:
list: List of dictionaries with each player's detailed statistics.
"""
soup = BeautifulSoup(html_content, 'html.parser')
jugadores = []
# Locate the stats table
table = soup.find('table', class_='table table-striped tabla-estadisticas ordenTabla')
# Iterate over each player row
for row in table.find_all('tr')[2:]: # Skipping headers
onclick_data = row.get('onclick')
if onclick_data:
# Use regex to capture JSON-like data within the function call
match = re.search(r'EstadisticasComponente\((\{.*?\})', onclick_data)
if match:
jugador_info_json = match.group(1)
try:
# Convert JSON string to a dictionary
jugador_info = ast.literal_eval(jugador_info_json.replace('"', '"'))
jugadores.append(jugador_info)
except (ValueError, SyntaxError) as e:
print(f"Error parsing player data: {e}")
return jugadores
# Example usage
with open('output.txt', 'r', encoding='utf-8') as file:
html_content = file.read()
jugadores = parse_jugadores(html_content)
print(json.dumps(jugadores, indent=4, ensure_ascii=False))
Explanation of Key Parts
- Regex for Extracting JSON: The
re.search
function captures the JSON part within theonclick
function. - Evaluating JSON with
ast.literal_eval
: This safely parses the JSON-like structure into a dictionary without executing any unintended code, ensuring security. - Output: Each player’s statistics are stored in a list of dictionaries, easily accessible for further analysis.
Final Thoughts
Parsing HTML content with embedded data requires thoughtful handling, especially with JSON embedded in JavaScript functions. The use of BeautifulSoup
, regex
, and ast.literal_eval
creates an efficient pipeline for extracting structured data from complex HTML. This approach can be extended to similar scenarios across web scraping projects, providing flexibility and robustness when dealing with intricate HTML structures.