Experimenting with Automatic PII Detection on the Hub usi... | Experimenting with Automatic PII Detection on the Hub usi...
Experimenting with Automatic PII Detection on the Hub using Presidio
At Hugging Face, we've noticed a concerning trend in machine learning (ML) datasets hosted on our Hub: Undocumented private information about individuals. This poses some unique challenges for ML practitioners. In this blog post, we'll explore different types of datasets containing a type of private information known as Personally Identifying Information (PII), the issues they present, and a new feature we're experimenting with on the Dataset Hub to help address these challenges.

Types of Datasets with PII
We noticed two types of datasets that contain PII:

Annotated PII datasets: Datasets like PII-Masking-300k by Ai4Privacy are specifically designed to train PII Detection Models, which are used to detect and mask PII. For example, these models can help with online content moderation or provide anonymized databases.
Pre-training datasets: These are large-scale datasets, often terabytes in size, that are typically obtained through web crawls. While these datasets are generally filtered to remove certain types of PII, small amounts of sensitive information can still slip through the cracks due to the sheer volume of data and the imperfections of PII Detection Models.https://github.com/huggingface/blog/blob/main/presidio-pii-detection.md blog/presidio-pii-detection.md at main · huggingface/blog