Choosing Between saveAsTable and Iceberg’s writeTo in AWS Glue and Athena
When working with Spark on AWS Glue, there are multiple ways to persist DataFrames as tables and make them queryable in Amazon Athena. Two common approaches are:
- Using Spark’s Hive-style
saveAsTable
- Using Apache Iceberg’s
writeTo
API
At first glance they may look similar, but they solve different problems and have distinct implications for scalability, schema evolution, and data management.
1. Writing with saveAsTable
A simple Spark DataFrame write might look like this:
df.write.mode("overwrite").saveAsTable("staging.budget")
What happens
- Spark writes the dataset to storage (e.g., S3) in Parquet format by default.
- The table is registered in the Glue Data Catalog, making it visible in Athena.
- Using
overwrite
drops the old data and replaces it with the new DataFrame.
Characteristics
- Simple and widely supported.
- Schema evolution is limited.
- Partitioning requires manual setup (
.partitionBy()
before.saveAsTable()
). - No transactional guarantees. If a job fails mid-write, the table can be left in an inconsistent state.
This approach works best when datasets are small-to-medium sized, updates are full replacements, and advanced transactional features are not required.
2. Writing with Iceberg’s writeTo
A more modern way is to use Apache Iceberg through the writeTo
API:
try:
table_exists = spark.catalog._jcatalog.tableExists(qualified_table)
if table_exists:
df.writeTo(qualified_table).overwritePartitions()
print(f"Data written/overwritten to {qualified_table}")
else:
df.writeTo(qualified_table) \
.using("iceberg") \
.tableProperty("location", table_path) \
.partitionedBy(partition_column) \
.create()
print(f"Table created at {table_path}")
What happens
- If the table does not exist, it is created as an Iceberg table.
- If it exists, only the partitions present in the DataFrame are overwritten (
overwritePartitions()
), avoiding a full table rewrite. - Metadata and schema are managed by Iceberg, and the table remains registered in Glue Catalog.
Characteristics
- Transactional guarantees (ACID): Safe concurrent writes, consistent snapshots.
- Efficient partition handling: No need to manage partitions manually.
- Schema evolution: Adding, dropping, or renaming columns is supported natively.
- Querying in Athena: Iceberg tables are natively supported, enabling advanced features such as time travel and incremental queries.
This approach is ideal for large datasets, incremental updates, and scenarios where data reliability and long-term governance matter.