Avoiding Duplicate File Copies Based on Content in Python on AWS

By - Geek Logbook
Posted on 2024-09-162024-11-22
Posted in Cloud

Avoiding Duplicate File Copies Based on Content in Python on AWS

When working with large file systems, copying files can often lead to unintentional duplication, especially if files with the same content are repeatedly copied into different directories. While filenames can vary, the underlying content might remain the same, leading to redundant data and wasted storage space. In this post, we’ll explore how to avoid copying duplicate files based on their content, rather than relying on filenames.

The Problem: Identifying Duplicates by Content

Most file systems rely on filenames to detect duplicates, but filenames can be misleading. To accurately identify duplicate files, we need to check the actual content. A simple and efficient way to do this in Python is by leveraging hash functions like MD5 to create a unique “fingerprint” for each file.

The Solution: Using MD5 Hashing

MD5 is a widely used cryptographic hash function that takes a file’s content and produces a fixed-size hash value. This value can be used to compare files—if two files have the same MD5 hash, they are considered duplicates.

Here’s how we can implement a solution in Python:

import os
import hashlib
import shutil

def calculate_md5(file_path, chunk_size=8192):
    """
    Calculates the MD5 hash of a file's content.
    
    :param file_path: Path to the file to be hashed.
    :param chunk_size: Size of the chunks to read from the file for hashing.
    :return: MD5 hash of the file's content.
    """
    md5 = hashlib.md5()
    
    with open(file_path, 'rb') as f:
        while chunk = f.read(chunk_size):
            md5.update(chunk)
    
    return md5.hexdigest()

def avoid_duplicate_copies(src_directory, dest_directory):
    """
    Copies files from src_directory to dest_directory, avoiding duplicates by content.
    
    :param src_directory: The source directory containing files to copy.
    :param dest_directory: The destination directory.
    """
    seen_hashes = {}

    # Create destination directory if it doesn't exist
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)

    for root, dirs, files in os.walk(src_directory):
        for file in files:
            src_file_path = os.path.join(root, file)
            file_hash = calculate_md5(src_file_path)
            
            if file_hash not in seen_hashes:
                # First occurrence of this file's content, copy to destination
                dest_file_path = os.path.join(dest_directory, file)
                shutil.copy2(src_file_path, dest_file_path)
                seen_hashes[file_hash] = dest_file_path
                print(f"Copied: {src_file_path} to {dest_file_path}")
            else:
                print(f"Duplicate file {src_file_path} skipped (same as {seen_hashes[file_hash]})")

# Example usage
src_directory = '/path/to/source'
dest_directory = '/path/to/destination'
avoid_duplicate_copies(src_directory, dest_directory)

Here is the blog post for “Avoiding Duplicate File Copies Based on Content in Python”:

Avoiding Duplicate File Copies Based on Content in Python

The Problem: Identifying Duplicates by Content

The Solution: Using MD5 Hashing

Here’s how we can implement a solution in Python:

pythonCopy codeimport os
import hashlib
import shutil

def calculate_md5(file_path, chunk_size=8192):
    """
    Calculates the MD5 hash of a file's content.
    
    :param file_path: Path to the file to be hashed.
    :param chunk_size: Size of the chunks to read from the file for hashing.
    :return: MD5 hash of the file's content.
    """
    md5 = hashlib.md5()
    
    with open(file_path, 'rb') as f:
        while chunk = f.read(chunk_size):
            md5.update(chunk)
    
    return md5.hexdigest()

def avoid_duplicate_copies(src_directory, dest_directory):
    """
    Copies files from src_directory to dest_directory, avoiding duplicates by content.
    
    :param src_directory: The source directory containing files to copy.
    :param dest_directory: The destination directory.
    """
    seen_hashes = {}

    # Create destination directory if it doesn't exist
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)

    for root, dirs, files in os.walk(src_directory):
        for file in files:
            src_file_path = os.path.join(root, file)
            file_hash = calculate_md5(src_file_path)
            
            if file_hash not in seen_hashes:
                # First occurrence of this file's content, copy to destination
                dest_file_path = os.path.join(dest_directory, file)
                shutil.copy2(src_file_path, dest_file_path)
                seen_hashes[file_hash] = dest_file_path
                print(f"Copied: {src_file_path} to {dest_file_path}")
            else:
                print(f"Duplicate file {src_file_path} skipped (same as {seen_hashes[file_hash]})")

# Example usage
src_directory = '/path/to/source'
dest_directory = '/path/to/destination'
avoid_duplicate_copies(src_directory, dest_directory)

Explanation of the Code:

calculate_md5(file_path): This function reads the file in chunks (to handle large files efficiently) and computes the MD5 hash of its contents.
avoid_duplicate_copies(src_directory, dest_directory): This function walks through the source directory and computes the MD5 hash for each file. If the hash is not seen before, it copies the file to the destination. Otherwise, the file is skipped as a duplicate.

Benefits of This Approach:

Content-Based Duplicate Detection: The solution compares files based on their content, not filenames. This ensures that files with different names but the same content are recognized as duplicates.
Efficiency: By reading files in chunks and calculating their hashes, we minimize memory usage and ensure that large files are handled efficiently.
Flexibility: You can easily modify this script to handle different file formats, directories, or other specific needs.

Conclusion

Managing large file systems requires more than just checking filenames to avoid duplication. By using MD5 hashing, we can efficiently detect and avoid copying duplicate files based on their content. This simple Python solution can save time, space, and effort when dealing with large data transfers or backups.

Tags:Amazon Web Services

Geek Logbook

Recent Posts

Categories

Archives

Avoiding Duplicate File Copies Based on Content in Python on AWS

The Problem: Identifying Duplicates by Content

The Solution: Using MD5 Hashing

Avoiding Duplicate File Copies Based on Content in Python

The Problem: Identifying Duplicates by Content

The Solution: Using MD5 Hashing

Explanation of the Code:

Benefits of This Approach:

Conclusion

Previous Article

Next Article