Avoiding Duplicate File Copies Based on Content in Python
Introduction
When dealing with large datasets, it is common to encounter duplicate files, especially when copying files based on specific criteria. Simply comparing file names or paths isn’t sufficient to avoid duplicates because the same file might exist in different locations. In this blog post, we will explore how to use Python to avoid copying duplicate files by comparing their content using hash values.
The Problem
Imagine you have a directory with numerous text files and you want to copy those containing a specific word to a new location. You want to ensure that duplicate files are not copied multiple times, regardless of their file names or locations. Manual comparison is impractical, especially for large datasets, so we need an automated solution.
The Initial Approach
In previous posts, we discussed functions to copy files containing a specific word and to preserve directory structures. Here’s a quick recap of the code for copying files while preserving the directory structure:
import shutil
import os
def searchWord(paths, word, destination_folder):
res = []
common_prefix = os.path.commonpath(paths)
for path in paths:
with open(path, "r") as f:
content = f.read()
if word in content:
relative_path = os.path.relpath(path, start=common_prefix)
destination_path = os.path.join(destination_folder, relative_path)
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
shutil.copy(path, destination_path)
res.append(path)
return res
# Usage
source_paths = ["example/dir1/file1.txt", "example/dir2/file2.txt"] # Add your source file paths here
searched_word = "your_search_word"
destination_folder = r"C:\destination_folder"
result = searchWord(source_paths, searched_word, destination_folder)
print("Files containing the word were copied:", result)
Limitation
While this approach works for preserving directory structures, it doesn’t handle duplicate files. Duplicate files, even if located in different directories, will still be copied multiple times.
The Improved Approach
To avoid copying duplicate files, we can use hash values to compare the content of the files. Here’s the updated version of the script that calculates and compares hash values:
import shutil
import os
import hashlib
def calculate_file_hash(file_path):
hash_object = hashlib.sha256()
with open(file_path, "rb") as f:
while chunk := f.read(65536):
hash_object.update(chunk)
return hash_object.hexdigest()
def searchWord(paths, word, destination_folder):
res = []
copied_hashes = set()
common_prefix = os.path.commonpath(paths)
for path in paths:
with open(path, "r") as f:
content = f.read()
if word in content:
file_hash = calculate_file_hash(path)
if file_hash not in copied_hashes:
relative_path = os.path.relpath(path, start=common_prefix)
destination_path = os.path.join(destination_folder, relative_path)
os.makedirs(os.path.dirname(destination_path), exist_ok=True)
shutil.copy(path, destination_path)
copied_hashes.add(file_hash)
res.append(path)
return res
# Usage
source_paths = ["example/dir1/file1.txt", "example/dir2/file2.txt"] # Add your source file paths here
searched_word = "your_search_word"
destination_folder = r"C:\destination_folder"
result = searchWord(source_paths, searched_word, destination_folder)
print("Files containing the word were copied:", result)
Explanation
- Calculating File Hash: The
calculate_file_hash(file_path)function reads the file in chunks and calculates its SHA-256 hash. This hash value is unique for different file contents. - Tracking Copied Hashes: The
copied_hashesset keeps track of the hash values of the files that have already been copied. This ensures that duplicates are not copied again. - Checking for Duplicates: Before copying a file, the script checks if its hash value is already in the
copied_hashesset. If it is, the file is skipped; otherwise, it is copied and its hash is added to the set.
Conclusion
By using hash values to compare file content, we can effectively avoid copying duplicate files. This approach ensures that only unique files are copied, regardless of their file names or locations, making the process efficient and reliable.