Hiding Personal Information in AWS Glue with Spark
Protecting personal data before analytics consumption is a core requirement in modern data platforms. In AWS-based lake architectures, this is typically achieved through data de-identification during ingestion or transformation. This post outlines a practical and production-ready approach to hiding personal information using Spark jobs in AWS Glue.
What “Hide Personal Information” Means in Data Engineering
Hiding personal information usually refers to de-identification, an umbrella term that includes:
- Masking: replacing characters with placeholders.
- Redaction: removing sensitive substrings entirely.
- Hashing: irreversible transformation for joinability.
- Tokenization: reversible replacement using secure mappings.
In regulated environments (GDPR, HIPAA), this step must occur before data is exposed to analytics, BI tools, or data science workloads.
Why Do This in AWS Glue?
AWS Glue provides a managed Spark environment that is well suited for:
- Large-scale batch processing.
- Schema-aware transformations.
- Integration with AWS-native security services.
By de-identifying data inside Glue jobs, you ensure that only compliant datasets reach curated layers (silver/gold).
Reference Architecture
Flow:
- Raw data lands in Amazon S3 (
raw/). - A Glue Spark job reads the data.
- Personal data is detected and hidden.
- Clean data is written to
clean/orcurated/.
Two common implementation patterns are described below.
Pattern 1: PII Redaction with Amazon Comprehend (Recommended)
Amazon Comprehend provides managed PII detection and redaction.
How it works
- Use Comprehend’s PII APIs to detect entities such as emails, phone numbers, or names.
- Apply masking or replacement.
- Persist the sanitized output.
When to use
- Text-heavy columns (comments, logs, descriptions).
- When you want managed accuracy without maintaining NLP models.
Official documentation:
https://docs.aws.amazon.com/comprehend/latest/dg/redact-api-pii.html
Pattern 2: Column-Level Hashing in Spark
For structured datasets with known sensitive fields:
from pyspark.sql.functions import sha2, concat_ws, lit, col
SALT = "<secure-salt>"
df_clean = (
df.withColumn(
"email_hash",
sha2(concat_ws(":", col("email"), lit(SALT)), 256)
)
.drop("email")
)
When to use
- Deterministic joins are required.
- Re-identification is not allowed.
- Performance must be fully inside Spark.
Operational Considerations
- Salts and keys must be stored in AWS Secrets Manager or encrypted with KMS.
- Avoid synchronous external API calls at high Spark parallelism unless rate-limited.
- Prefer pre-ingestion de-identification over post-analytics fixes.
- Log transformations for auditability, but never log raw PII.
Conclusion
Hiding personal information in AWS Glue is not a single feature, but a design choice embedded in your data pipeline. Combining Spark transformations with AWS-native services allows you to meet privacy requirements without sacrificing scalability or performance.
De-identification should be treated as infrastructure, not as an afterthought.