Optimizing Amazon Athena Queries with Partitions: A Practical Example
When working with Amazon Athena, one of the most effective strategies to improve query performance and reduce costs is partitioning your data. Partitions allow Athena to scan only the relevant portions of your dataset instead of reading everything, which is critical when you deal with large files stored in Amazon S3.
In this post, we will explore how to choose a good partitioning strategy using an anonymized dataset.
The Dataset
Suppose we are tracking user activity in an online learning platform. The dataset looks like this:
Level,Section,Lesson_URL,Completions,Period
Level 1,1.1 Section A,https://platform.org/course-1/lesson-1/,140,2025-01
Level 1,1.1 Section A,https://platform.org/course-1/lesson-2/,120,2025-01
Level 1,1.2 Section B,https://platform.org/course-2/lesson-1/,55,2025-01
Level 1,1.2 Section B,https://platform.org/course-2/lesson-2/,61,2025-01
Level 2,2.1 Section C,https://platform.org/course-3/lesson-1/,33,2025-01
Level 2,2.2 Section D,https://platform.org/course-4/lesson-1/,32,2025-01
...
Each row represents a lesson, the number of completions (Completions
), and the reporting Period
(month).
Why Partitioning Matters
If we query this dataset directly without partitioning, Athena will scan all records every time. That becomes inefficient when historical data grows over months or years.
The key is to select a column with low cardinality, frequent filtering, and natural grouping. In this case, the column Period
is an excellent candidate:
- Low cardinality: one value per month (
2025-01
,2025-02
, etc.). - Natural filtering: most queries will ask for specific months or time ranges.
- Efficient organization: new data can be written into monthly partitions.
Creating the Partitioned Table in Athena
CREATE EXTERNAL TABLE IF NOT EXISTS analytics.course_activity (
Level STRING,
Section STRING,
Lesson_URL STRING,
Completions INT
)
PARTITIONED BY (
Period STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
)
LOCATION 's3://your-bucket/activity-data/'
TBLPROPERTIES ('has_encrypted_data'='false');
Loading Partitions
If your S3 structure is organized by Period
:
s3://your-bucket/activity-data/Period=2025-01/
s3://your-bucket/activity-data/Period=2025-02/
You can register partitions automatically:
MSCK REPAIR TABLE analytics.course_activity;
Or add them manually:
ALTER TABLE analytics.course_activity
ADD PARTITION (Period='2025-01')
LOCATION 's3://your-bucket/activity-data/Period=2025-01/';
Querying the Data
SELECT Section, SUM(Completions) AS total_completions
FROM analytics.course_activity
WHERE Period = '2025-01'
GROUP BY Section;
This query will scan only the partition for January 2025, instead of the entire dataset.
Key Takeaways
- Partitioning by
Period
(monthly) balances efficiency and simplicity. - Avoid partitioning on high-cardinality fields like
Lesson_URL
. - Structure your S3 paths with partition keys (
Period=YYYY-MM
). - Always include partition filters in your queries to leverage partition pruning.
By applying these practices, you will reduce query costs and improve performance in Athena.
Reference:
Amazon Athena – Partitions