Content Moderation by detecting and Redacting PII using Amazon Comprehend

Olawale Adepoju

2023.01.25

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

The daily volume of third-party and user-generated content (UGC) is expanding tremendously across businesses. Businesses, social media, gaming, and other companies must secure their users while reducing operating expenses. Efficiently apply ratings to content pieces and formats in order to comply with industry and consumer criteria. Other financial and healthcare services firms find it difficult to secure personally identifiable health information (PII and PHI) across internal and external environments and procedures.

In this post, we will look at how artificial intelligence (AI) and machine learning (ML) may be used to handle content moderation and compliance in order to safeguard online communities, their users, and organizations.

Introduction

Amazon Comprehend is a natural language processing (NLP) service that employs machine learning (ML) to discover insights and correlations in unstructured text such as persons, locations, and topics. Amazon Comprehend ML capabilities may now be used to detect and redact personally identifiable information (PII) in customer emails, support requests, product reviews, social media, and other sources. There is no prior knowledge of machine learning necessary. For example, before indexing the documents in the search solution, you may scan support requests and knowledge articles to find PII elements and redact the content. Following that, search solutions are devoid of PII entities in documents. Redacting PII entities allows you to maintain your privacy while still complying with local rules and regulations.

PII Redaction batch processing on the Amazon Comprehend

Amazon Comprehend operations can be used to redact documents. You can mask PII entities using redaction mode Replace with character and replace the characters in PII entities with a character of your choosing (! #, $,%, &, *, or @).

To analyze and redact big documents and collections of documents, place them in an Amazon Simple Storage Service (Amazon S3) bucket and launch an asynchronous operation to detect and redact PII in the documents. The analysis findings are returned in an S3 bucket.

On the Amazon Comprehend console, choose Analysis jobs.
select Create job.

On the Create analysis job page, enter a Name.
For the Analysis type, choose personally identifiable information (PII).

In the PII detection settings section, for Output mode, select Redactions.
Expand PII entity types and select the entity types to redact.
For Redaction mode, select Replace with character.

In the Input data section, for Data source, select My documents.
For the S3 location, enter the S3 path for the input data.
In the Output data section, for the S3 location, enter the path to the output folder in Amazon S3.

Make sure you choose the correct input and output paths based on how you organize the document.

In the Access permissions section, for the IAM role, select Create an IAM role.
For Permission to access, choose Input and Output S3 buckets.
For the Name suffix, enter a suffix for your role.
Select Create job.

redact-content job status is in progress When the task state changes to Completed, you may examine the output in the output file. The output file has the same sample information as the input, with redaction mode replacing PII with character *.