Efficiently Listing and Filtering S3 Objects by Date
When working with AWS S3 buckets, it’s common to have a large number of objects stored, and you might need to filter them based on certain criteria like dates. This blog post will guide you on how to efficiently list and filter S3 objects by date using Python and the boto3 library.
Why Filtering by Date is Important
In scenarios where S3 buckets store logs, backups, or other timestamped data, retrieving only the relevant files based on year, month, and day can save both time and resources. Instead of fetching all the files and filtering them later, we can directly retrieve files from specific time periods by structuring S3 paths that contain date information.
S3 Folder Structure Example
Assuming the structure of your S3 paths contains a date as part of the key, an example structure could look like this:
bucket-name/activity/2024/08/28/filename1.parquet
bucket-name/client/2024/08/28/filename2.parquet
bucket-name/transactiondetails/2024/08/29/filename3.parquet
Here, the folder structure represents year, month, and day, allowing us to filter objects easily based on a given date.
Efficient S3 Object Listing: Complete Code
Below is the complete code snippet for listing and filtering S3 objects by date. This code will retrieve objects from the S3 bucket based on a folder structure that includes year, month, and day, while handling pagination for large result sets.
import boto3
# Create a session using Boto3 (make sure you have your AWS credentials configured)
session = boto3.Session()
s3_client = session.client('s3')
def list_s3_objects_by_date(bucket_name, folder, year, month, day):
"""
Lists objects in an S3 bucket filtered by year, month, and day.
:param bucket_name: Name of the S3 bucket.
:param folder: Folder prefix (e.g., 'activity', 'client').
:param year: The year to filter by (e.g., '2024').
:param month: The month to filter by (e.g., '08').
:param day: The day to filter by (e.g., '28').
:return: A list of objects with their names and last modified dates.
"""
# Prefix representing the directory structure we want to filter by
prefix = f"{folder}/{year}/{month}/{day}/"
continuation_token = None
all_objects = []
while True:
if continuation_token:
response = s3_client.list_objects_v2(
Bucket=bucket_name, Prefix=prefix, ContinuationToken=continuation_token
)
else:
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
if 'Contents' in response:
objects = response['Contents']
# Extract name and last modified date for each object
all_objects.extend([(obj['Key'], obj['LastModified']) for obj in objects])
if response.get('IsTruncated'): # More objects to fetch
continuation_token = response['NextContinuationToken']
else:
break
return all_objects
# Example usage
bucket_name = 'your-bucket-name'
folder = 'activity'
year = '2024'
month = '08'
day = '28'
objects = list_s3_objects_by_date(bucket_name, folder, year, month, day)
for obj in objects:
print(f"Object: {obj[0]}, Last Modified: {obj[1]}")
Step-by-Step Explanation
- Session Setup: The
boto3.Session()is used to create a session with AWS credentials that are already configured in your environment. - list_s3_objects_by_date Function: This function filters the objects in the S3 bucket by specifying a folder path that includes the year, month, and day. It efficiently handles pagination using the
ContinuationTokento ensure that all objects are retrieved. - Handling Pagination: If there are more than 1,000 objects, the function loops through and retrieves additional batches until all objects are processed.
- Object Details: For each object, the function retrieves the object’s key (name) and last modified date.
Example Output
If objects are found in the specified path, the function will return the object names and their last modified dates:
Object: activity/2024/08/28/filename1.parquet, Last Modified: 2024-08-28 12:34:56
Object: activity/2024/08/28/filename2.parquet, Last Modified: 2024-08-28 13:22:30
Conclusion
Filtering objects in S3 by date can be efficiently done by structuring your folder paths in a way that includes date information (year/month/day). With the boto3 library in Python, we can easily list and filter objects based on this structure. For large datasets, handling pagination ensures that you retrieve all relevant files without missing any.