Content Moderation by detecting and Redacting PII using Amazon Comprehend



The daily volume of third-party and user-generated content (UGC) is expanding tremendously across businesses. Businesses, social media, gaming, and other companies must secure their users while reducing operating expenses. Efficiently apply ratings to content pieces and formats in order to comply with industry and consumer criteria. Other financial and healthcare services firms find it difficult to secure personally identifiable health information (PII and PHI) across internal and external environments and procedures.

In this post, we will look at how artificial intelligence (AI) and machine learning (ML) may be used to handle content moderation and compliance in order to safeguard online communities, their users, and organizations.


Amazon Comprehend is a natural language processing (NLP) service that employs machine learning (ML) to discover insights and correlations in unstructured text such as persons, locations, and topics. Amazon Comprehend ML capabilities may now be used to detect and redact personally identifiable information (PII) in customer emails, support requests, product reviews, social media, and other sources. There is no prior knowledge of machine learning necessary. For example, before indexing the documents in the search solution, you may scan support requests and knowledge articles to find PII elements and redact the content. Following that, search solutions are devoid of PII entities in documents. Redacting PII entities allows you to maintain your privacy while still complying with local rules and regulations.

PII Redaction batch processing on the Amazon Comprehend

Amazon Comprehend operations can be used to redact documents. You can mask PII entities using redaction mode Replace with character and replace the characters in PII entities with a character of your choosing (! #, $,%, &, *, or @).
To analyze and redact big documents and collections of documents, place them in an Amazon Simple Storage Service (Amazon S3) bucket and launch an asynchronous operation to detect and redact PII in the documents. The analysis findings are returned in an S3 bucket.
  1. On the Amazon Comprehend console, choose  Analysis jobs.
  2. select Create job.

  1. On the  Create analysis job page, enter a Name.
  2. For the Analysis type, choose personally identifiable information (PII).

  1. In the  PII detection settings section, for  Output mode, select  Redactions.
  2. Expand  PII entity types and select the entity types to redact.
  3. For  Redaction mode, select Replace with character.

  1. In the  Input data section, for  Data source, select  My documents.
  2. For the S3 location, enter the S3 path for the input data.
  3. In the  Output data section, for the S3 location, enter the path to the output folder in Amazon S3.

Make sure you choose the correct input and output paths based on how you organize the document.

  1. In the  Access permissions section, for the  IAM  role, select  Create an IAM role.
  2. For Permission to access, choose  Input and Output S3 buckets.
  3. For the Name suffix, enter a suffix for your role.
  4. Select Create job.

redact-content job status is in progress When the task state changes to Completed, you may examine the output in the output file. The output file has the same sample information as the input, with redaction mode replacing PII with character *.