Geek Logbook

Tech sea log book

Hiding Personal Information in AWS Glue with Spark

Protecting personal data before analytics consumption is a core requirement in modern data platforms. In AWS-based lake architectures, this is typically achieved through data de-identification during ingestion or transformation. This post outlines a practical and production-ready approach to hiding personal information using Spark jobs in AWS Glue. What “Hide Personal Information” Means in Data Engineering

Debugging Spark DataFrame .show() Timeouts in PyCharm and VSCode

When working with PySpark, one of the first commands developers use to quickly inspect data is: However, in certain environments (especially when running inside PyCharm or VSCode with a debugger), you may encounter a warning like the following: At first glance, this message looks like an error in Spark itself, but in reality it is