Avoiding Duplicate File Copies Based on Content in Python

By - Geek Logbook
Posted on 2024-08-032024-11-22
Posted in Programming

Avoiding Duplicate File Copies Based on Content in Python

Introduction

When dealing with large datasets, it is common to encounter duplicate files, especially when copying files based on specific criteria. Simply comparing file names or paths isn’t sufficient to avoid duplicates because the same file might exist in different locations. In this blog post, we will explore how to use Python to avoid copying duplicate files by comparing their content using hash values.

The Problem

Imagine you have a directory with numerous text files and you want to copy those containing a specific word to a new location. You want to ensure that duplicate files are not copied multiple times, regardless of their file names or locations. Manual comparison is impractical, especially for large datasets, so we need an automated solution.

The Initial Approach

In previous posts, we discussed functions to copy files containing a specific word and to preserve directory structures. Here’s a quick recap of the code for copying files while preserving the directory structure:

import shutil
import os

def searchWord(paths, word, destination_folder):
    res = []
    common_prefix = os.path.commonpath(paths)

    for path in paths:
        with open(path, "r") as f:
            content = f.read()

            if word in content:
                relative_path = os.path.relpath(path, start=common_prefix)
                destination_path = os.path.join(destination_folder, relative_path)

                os.makedirs(os.path.dirname(destination_path), exist_ok=True)
                shutil.copy(path, destination_path)
                res.append(path)

    return res

# Usage
source_paths = ["example/dir1/file1.txt", "example/dir2/file2.txt"]  # Add your source file paths here
searched_word = "your_search_word"
destination_folder = r"C:\destination_folder"

result = searchWord(source_paths, searched_word, destination_folder)
print("Files containing the word were copied:", result)

Limitation

While this approach works for preserving directory structures, it doesn’t handle duplicate files. Duplicate files, even if located in different directories, will still be copied multiple times.

The Improved Approach

To avoid copying duplicate files, we can use hash values to compare the content of the files. Here’s the updated version of the script that calculates and compares hash values:

import shutil
import os
import hashlib

def calculate_file_hash(file_path):
    hash_object = hashlib.sha256()
    with open(file_path, "rb") as f:
        while chunk := f.read(65536):
            hash_object.update(chunk)
    return hash_object.hexdigest()

def searchWord(paths, word, destination_folder):
    res = []
    copied_hashes = set()
    common_prefix = os.path.commonpath(paths)

    for path in paths:
        with open(path, "r") as f:
            content = f.read()

            if word in content:
                file_hash = calculate_file_hash(path)

                if file_hash not in copied_hashes:
                    relative_path = os.path.relpath(path, start=common_prefix)
                    destination_path = os.path.join(destination_folder, relative_path)

                    os.makedirs(os.path.dirname(destination_path), exist_ok=True)
                    shutil.copy(path, destination_path)
                    copied_hashes.add(file_hash)
                    res.append(path)

    return res

# Usage
source_paths = ["example/dir1/file1.txt", "example/dir2/file2.txt"]  # Add your source file paths here
searched_word = "your_search_word"
destination_folder = r"C:\destination_folder"

result = searchWord(source_paths, searched_word, destination_folder)
print("Files containing the word were copied:", result)

Explanation

Calculating File Hash: The calculate_file_hash(file_path) function reads the file in chunks and calculates its SHA-256 hash. This hash value is unique for different file contents.
Tracking Copied Hashes: The copied_hashes set keeps track of the hash values of the files that have already been copied. This ensures that duplicates are not copied again.
Checking for Duplicates: Before copying a file, the script checks if its hash value is already in the copied_hashes set. If it is, the file is skipped; otherwise, it is copied and its hash is added to the set.

Conclusion

By using hash values to compare file content, we can effectively avoid copying duplicate files. This approach ensures that only unique files are copied, regardless of their file names or locations, making the process efficient and reliable.

Tags:Python

Geek Logbook

Recent Posts

Categories

Archives

Avoiding Duplicate File Copies Based on Content in Python

Introduction

The Problem

The Initial Approach

Limitation

The Improved Approach

Explanation

Conclusion

Previous Article

Next Article