Geek Logbook

Tech sea log book

Tracking File Changes in S3 Using ETags

When working with AWS S3, tracking changes to files can be essential, especially when versioning is not enabled on the bucket. The ETag associated with each file in S3 can provide a simple way to detect changes. In this post, we’ll explore how to use ETags to monitor file modifications in an S3 bucket.

What is an ETag?

An ETag (Entity Tag) is a hash value associated with an object in S3. It’s commonly used to track changes to objects. However, it’s important to understand that ETags are generated differently based on whether the object was uploaded in a single operation or in multiple parts (multipart upload).

  • For single-part uploads, the ETag is typically the MD5 hash of the object content.
  • For multipart uploads, the ETag is a combination of MD5 hashes of the parts, followed by the number of parts (e.g., 33a01f6c513ec334bbdfbc606ad2cbe1-1).

Why ETags Matter?

While S3 doesn’t provide direct file version history without versioning enabled, you can still track changes by comparing the ETag values of files. If the ETag for a file changes, the file content has likely been modified.

How to Track File Changes with ETags in S3

In this section, we’ll walk through the code to extract the file name and ETag of objects in an S3 bucket, which can help you monitor changes over time.

Python Code to Retrieve ETag and Track Changes

Here’s a Python function using the Boto3 library to list objects in an S3 bucket and retrieve their ETags. It consolidates the object names and ETags into a CSV file for easy tracking.

import boto3
import csv

def list_s3_objects_with_etag(bucket_name, folder, year, month, day):
    # Initialize a session using Boto3
    s3_client = boto3.client('s3')
    
    # Construct the prefix (folder path)
    prefix = f"{folder}/{year}/{month}/{day}/"
    
    # List objects in the specified folder
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix)
    
    # Extract object names and ETags
    object_details = []
    if 'Contents' in response:
        for obj in response['Contents']:
            object_name = obj['Key']
            etag = obj['ETag'].strip('"')  # Remove surrounding quotes
            object_details.append({'Object Name': object_name, 'ETag': etag})
    
    return object_details

def create_consolidated_etag_csv(bucket_name, folders, year, month, day, output_file):
    all_object_details = []
    
    # Iterate over each folder to gather object details
    for folder in folders:
        object_details = list_s3_objects_with_etag(bucket_name, folder, year, month, day)
        if object_details:  # Check if the object details list is not empty
            all_object_details.extend(object_details)
    
    # Write the consolidated data into a CSV file
    with open(output_file, mode='w', newline='') as file:
        writer = csv.DictWriter(file, fieldnames=['Object Name', 'ETag'])
        writer.writeheader()
        writer.writerows(all_object_details)
    
    print(f"Consolidated ETag data saved to {output_file}")

# Example usage
bucket_name = "your-bucket-name"
folders = ["folder1", "folder2", "folder3"]
year = "2024"
month = "08"
day = "28"
output_file = "etag_report.csv"

create_consolidated_etag_csv(bucket_name, folders, year, month, day, output_file)

How This Works

  1. list_s3_objects_with_etag: This function fetches objects from the specified folder and date range within an S3 bucket. It returns a list of objects with their names and ETags.
  2. create_consolidated_etag_csv: This function iterates through a list of folders, consolidates all object details, and saves them to a CSV file.

Benefits of Using ETags for Tracking

  • Simplicity: ETags offer a quick way to check for changes in file content.
  • No Versioning Required: Even if versioning is disabled on the bucket, ETags allow for basic change detection.

Limitations

  • Multipart Uploads: For files uploaded in multiple parts, the ETag reflects the MD5 hashes of the parts, not the entire file. This means the ETag might not change if only part of the file was modified.
  • No Historical Change Count: ETags only reflect the current state of the file, so without additional tracking, you can’t determine how many times a file has changed.

Conclusion

Tracking file changes in S3 using ETags can be a lightweight solution when versioning is disabled. While it’s not perfect for multipart uploads or tracking historical changes, it provides a simple method to check if a file has been modified. By generating a CSV report with object names and ETags, you can easily monitor your S3 bucket for any changes.