Geek Logbook

Tech sea log book

Efficiently Listing and Filtering S3 Objects by Date

When working with AWS S3 buckets, it’s common to have a large number of objects stored, and you might need to filter them based on certain criteria like dates. This blog post will guide you on how to efficiently list and filter S3 objects by date using Python and the boto3 library.

Why Filtering by Date is Important

In scenarios where S3 buckets store logs, backups, or other timestamped data, retrieving only the relevant files based on year, month, and day can save both time and resources. Instead of fetching all the files and filtering them later, we can directly retrieve files from specific time periods by structuring S3 paths that contain date information.

S3 Folder Structure Example

Assuming the structure of your S3 paths contains a date as part of the key, an example structure could look like this:

bucket-name/activity/2024/08/28/filename1.parquet
bucket-name/client/2024/08/28/filename2.parquet
bucket-name/transactiondetails/2024/08/29/filename3.parquet

Here, the folder structure represents year, month, and day, allowing us to filter objects easily based on a given date.

Efficient S3 Object Listing: Complete Code

Below is the complete code snippet for listing and filtering S3 objects by date. This code will retrieve objects from the S3 bucket based on a folder structure that includes year, month, and day, while handling pagination for large result sets.

import boto3

# Create a session using Boto3 (make sure you have your AWS credentials configured)
session = boto3.Session()
s3_client = session.client('s3')

def list_s3_objects_by_date(bucket_name, folder, year, month, day):
    """
    Lists objects in an S3 bucket filtered by year, month, and day.
    
    :param bucket_name: Name of the S3 bucket.
    :param folder: Folder prefix (e.g., 'activity', 'client').
    :param year: The year to filter by (e.g., '2024').
    :param month: The month to filter by (e.g., '08').
    :param day: The day to filter by (e.g., '28').
    :return: A list of objects with their names and last modified dates.
    """
    # Prefix representing the directory structure we want to filter by
    prefix = f"{folder}/{year}/{month}/{day}/"
    continuation_token = None
    all_objects = []

    while True:
        if continuation_token:
            response = s3_client.list_objects_v2(
                Bucket=bucket_name, Prefix=prefix, ContinuationToken=continuation_token
            )
        else:
            response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)

        if 'Contents' in response:
            objects = response['Contents']
            # Extract name and last modified date for each object
            all_objects.extend([(obj['Key'], obj['LastModified']) for obj in objects])

        if response.get('IsTruncated'):  # More objects to fetch
            continuation_token = response['NextContinuationToken']
        else:
            break

    return all_objects

# Example usage
bucket_name = 'your-bucket-name'
folder = 'activity'
year = '2024'
month = '08'
day = '28'

objects = list_s3_objects_by_date(bucket_name, folder, year, month, day)

for obj in objects:
    print(f"Object: {obj[0]}, Last Modified: {obj[1]}")

Step-by-Step Explanation

  • Session Setup: The boto3.Session() is used to create a session with AWS credentials that are already configured in your environment.
  • list_s3_objects_by_date Function: This function filters the objects in the S3 bucket by specifying a folder path that includes the year, month, and day. It efficiently handles pagination using the ContinuationToken to ensure that all objects are retrieved.
  • Handling Pagination: If there are more than 1,000 objects, the function loops through and retrieves additional batches until all objects are processed.
  • Object Details: For each object, the function retrieves the object’s key (name) and last modified date.

Example Output

If objects are found in the specified path, the function will return the object names and their last modified dates:

Object: activity/2024/08/28/filename1.parquet, Last Modified: 2024-08-28 12:34:56
Object: activity/2024/08/28/filename2.parquet, Last Modified: 2024-08-28 13:22:30

Conclusion

Filtering objects in S3 by date can be efficiently done by structuring your folder paths in a way that includes date information (year/month/day). With the boto3 library in Python, we can easily list and filter objects based on this structure. For large datasets, handling pagination ensures that you retrieve all relevant files without missing any.