Geek Logbook

Tech sea log book

Handling Boolean vs IntegerType Mismatches Between MySQL and Spark (Glue JDBC)

When integrating MySQL data into Apache Spark (for example, through AWS Glue), you might encounter schema mismatches caused by how MySQL represents TINYINT(1) fields. This issue often surfaces when data pipelines read from different sources or connection types, such as JDBC over VPC Peering versus SSH tunnels.

The Problem

In MySQL, the data type TINYINT(1) is commonly used to store boolean values (0 or 1). However, the MySQL JDBC driver sometimes interprets it as a boolean (True / False) rather than a numeric integer.

Spark’s IntegerType() cannot accept Python bool objects, leading to an error similar to:

[CANNOT_ACCEPT_OBJECT_IN_TYPE] IntegerType() can not accept object False in type bool.

This typically occurs when:

  • The JDBC driver automatically converts TINYINT(1) into BOOLEAN.
  • Your schema in Spark defines the field as IntegerType() expecting 0/1.

As a result, the same column (is_active, is_member, etc.) may behave differently depending on the connection route (SSH vs Peering).


Why It Happens

This inconsistency comes from MySQL Connector/J, which controls how the driver interprets TINYINT(1) columns. Two properties determine this behavior:

  • tinyInt1isBit
  • transformedBitIsBoolean

When these are left as default (true), the driver exposes TINYINT(1) as a boolean type.


The Fix: Force Numeric Behavior in the JDBC URL

The simplest and most robust solution is to modify the JDBC connection URL to explicitly treat TINYINT(1) as numeric:

jdbc_url = (
    "jdbc:mysql://<hostname>:3306/<database>"
    "?tinyInt1isBit=false&transformedBitIsBoolean=false"
)

Then read your table as usual:

df = (spark.read
        .format("jdbc")
        .option("url", jdbc_url)
        .option("dbtable", "sale_details")
        .option("user", USER)
        .option("password", PASSWORD)
        .load())

With these properties set, MySQL sends TINYINT(1) values as integers, allowing Spark’s IntegerType() schema to load them correctly.


Alternative Approaches

If you can’t modify the JDBC URL (for example, using a managed Glue connection), here are two workarounds:

1. Cast After Reading

Let Spark infer or use BooleanType(), then convert to integer:

from pyspark.sql import functions as F

df = df.withColumn("is_membership", F.col("is_membership").cast("int"))

2. Cast in the SQL Query (Pushdown)

Use a SQL subquery to cast the column in MySQL before Spark reads it:

query = "(SELECT CAST(is_membership AS UNSIGNED) AS is_membership, other_col FROM sale_details) AS tmp"

df = (spark.read
        .format("jdbc")
        .option("url", jdbc_url)
        .option("dbtable", query)
        .option("user", USER)
        .option("password", PASSWORD)
        .load())

This ensures the driver only sees numeric data, eliminating ambiguity.


Best Practice

  • Always ensure consistency between different ingestion paths (e.g., JDBC vs SSH).
  • Prefer explicit typing in your database schema (use BOOLEAN or TINYINT(1) consistently).
  • If multiple systems depend on the same column, force a deterministic type mapping through the JDBC properties.
Tags: