How to use Amazon Comprehend operations using the AWS SDK for Python (Boto3)

2018.06.10

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

We will demonstrate sample code to use the main functions and Topic Modeling that use Amazon Comprehend in the AWS SDK for Python (Boto3).

The main functions

  • DetectDominantLanguage
  • DetectEntities
  • DetectKeyPhrases
  • DetectSentiment

These functions also provide batch APIs that batch process up to 25 documents.

Functions used in Topic Modeling

  • StartTopicsDetectionJob
  • DescribeTopicsDetectionJob
  • ListTopicsDetectionJobs

Sample Code

Let's start by looking at the main four functions.

import boto3
import json

# Comprehend constant
REGION = 'us-west-2'


# Function for detecting the dominant language
def detect_dominant_language(text):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_dominant_language(Text=text)
    return response


# Function for detecting named entities
def detect_entities(text, language_code):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_entities(Text=text, LanguageCode=language_code)
    return response


# Function for detecting key phrases
def detect_key_phraes(text, language_code):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_key_phrases(Text=text, LanguageCode=language_code)
    return response


# Function for detecting sentiment
def detect_sentiment(text, language_code):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_sentiment(Text=text, LanguageCode=language_code)
    return response


def main():
    # text
    text = "Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text."

    # language code
    language_code = 'en'

    # detecting the dominant language
    result = detect_dominant_language(text)
    print("Starting detecting the dominant language")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting the dominant language\n")

    # detecting named entities
    result = detect_entities(text, language_code)
    print("Starting detecting named entities")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting named entities\n")

    # detecting key phrases
    result = detect_key_phraes(text, language_code)
    print("Starting detecting key phrases")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting key phrases\n")

    # detecting sentiment
    result = detect_sentiment(text, language_code)
    print("Starting detecting sentiment")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting sentiment\n")


if __name__ == '__main__':
    main()

DetectDominantLanguage operation.

It can detect the dominant language of the document. Amazon Comprehend can detect 101 different languages.

Execution result

{
    "Languages": [
        {
            "LanguageCode": "en",
            "Score": 0.9940536618232727
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "64",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:20 GMT",
            "x-amzn-requestid": "a29fda00-2d87-11e8-ad56-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a29fda00-2d87-11e8-ad56-************",
        "RetryAttempts": 0
    }
}

DetectEntities operation.

It can detect the entities, such as persons or places, in the document.

Execution result

{
    "Entities": [
        {
            "BeginOffset": 0,
            "EndOffset": 6,
            "Score": 0.8670787215232849,
            "Text": "Amazon",
            "Type": "ORGANIZATION"
        },
        {
            "BeginOffset": 7,
            "EndOffset": 17,
            "Score": 1.0,
            "Text": "Comprehend",
            "Type": "COMMERCIAL_ITEM"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "201",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:20 GMT",
            "x-amzn-requestid": "a2b84450-2d87-11e8-b3f9-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2b84450-2d87-11e8-b3f9-************",
        "RetryAttempts": 0
    }
}

DetectKeyPhrases operation.

It detects key phrases in the document contents.

Execution result

{
    "KeyPhrases": [
        {
            "BeginOffset": 0,
            "EndOffset": 17,
            "Score": 0.9958747029304504,
            "Text": "Amazon Comprehend"
        },
        {
            "BeginOffset": 21,
            "EndOffset": 50,
            "Score": 0.9654422998428345,
            "Text": "a natural language processing"
        },
        {
            "BeginOffset": 52,
            "EndOffset": 55,
            "Score": 0.941932201385498,
            "Text": "NLP"
        },
        {
            "BeginOffset": 57,
            "EndOffset": 64,
            "Score": 0.9076098203659058,
            "Text": "service"
        },
        {
            "BeginOffset": 75,
            "EndOffset": 91,
            "Score": 0.872683584690094,
            "Text": "machine learning"
        },
        {
            "BeginOffset": 100,
            "EndOffset": 126,
            "Score": 0.9918361902236938,
            "Text": "insights and relationships"
        },
        {
            "BeginOffset": 130,
            "EndOffset": 134,
            "Score": 0.998969554901123,
            "Text": "text"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "615",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:21 GMT",
            "x-amzn-requestid": "a2d409a7-2d87-11e8-a9a6-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2d409a7-2d87-11e8-a9a6-************",
        "RetryAttempts": 0
    }
}

DetectSentiment operation.

It detects the emotion (positive, negative, mixed, or neutral) in the document contents.

Execution result

{
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "161",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:21 GMT",
            "x-amzn-requestid": "a2ebb00b-2d87-11e8-9c58-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2ebb00b-2d87-11e8-9c58-************",
        "RetryAttempts": 0
    },
    "Sentiment": "NEUTRAL",
    "SentimentScore": {
        "Mixed": 0.003294283989816904,
        "Negative": 0.01219215989112854,
        "Neutral": 0.7587229609489441,
        "Positive": 0.2257905900478363
    }
}

Topic Modeling

Let’s try executing the topic detection job.

Sample Code

import boto3
import json
import time
from bson import json_util

# Comprehend constant
REGION = 'us-west-2'

# A low-level client representing Amazon Comprehend
comprehend = boto3.client('comprehend', region_name=REGION)

# Start topics detection job setting
input_s3_url = "s3://your_input"
input_doc_format = "ONE_DOC_PER_FILE or ONE_DOC_PER_FILE"
output_s3_url = "s3://your_output"
data_access_role_arn = "arn:aws:iam::aws_account_id:role/role_name"
number_of_topics = 10
job_name = "Job_name"

input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}

# Starts an asynchronous topic detection job.
response = comprehend.start_topics_detection_job(NumberOfTopics=number_of_topics,
                                                 InputDataConfig=input_data_config,
                                                 OutputDataConfig=output_data_config,
                                                 DataAccessRoleArn=data_access_role_arn,
                                                 JobName=job_name)

# Gets job_id
job_id = response["JobId"]
print('job_id: ' + job_id)

# It loops until JobStatus becomes 'COMPLETED' or 'FAILED'.
while True:
    result = comprehend.describe_topics_detection_job(JobId=job_id)
    job_status = result["TopicsDetectionJobProperties"]["JobStatus"]

    if job_status in ['COMPLETED', 'FAILED']:
        print("job_status: " + job_status)
        break
    else:
        print("job_status: " + job_status)
        time.sleep(60)

# You can get a list of the topic detection jobs that you have submitted.
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}

filter_job_name = {"JobName": job_name}

topics_detection_job_list = comprehend.list_topics_detection_jobs(Filter=filter_job_name)
print('topics_detection_job_list: ' + json.dumps(topics_detection_job_list,
                                                 sort_keys=True,
                                                 indent=4,
                                                 default=json_util.default))

StartTopicsDetectionJob

Start topics detection job as an asynchronous operation. You can confirm job status using DescribeTopicDetectionJob after get JobId.

There are two kinds of InputFormat. - ONE_DOC_PER_FILE - When one document is included in each file. - ONE_DOC_PER_LINE - When one file, each line of the file is considered a document.

DescribeTopicDetectionJob

Get job status for topic detection job. There are four statuses as follows:

  • JobStatus
    • SUBMITTED
    • IN_PROGRESS
    • COMPLETED
    • FAILED

In this sample code, we escape the while loop when JobStatus is COMPLETED or FAILED.

ListTopicsDetectionJobs

Get list of topic detection job.

Execution result

job_id: 2733262c2747153ab8cb0b01********
job_status: SUBMITTED
job_status: IN_PROGRESS
[...]
job_status: COMPLETED
topics_detection_job_list: {
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "415",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:27:59 GMT",
            "x-amzn-requestid": "669ffb28-2d89-11e8-82a0-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "669ffb28-2d89-11e8-82a0-************",
        "RetryAttempts": 0
    },
    "TopicsDetectionJobPropertiesList": [
        {
            "EndTime": {
                "$date": 1521692818930
            },
            "InputDataConfig": {
                "InputFormat": "ONE_DOC_PER_FILE",
                "S3Uri": "s3://your_input"
            },
            "JobId": "2733262c2747153ab8cb0b01********",
            "JobName": "Job4",
            "JobStatus": "COMPLETED",
            "NumberOfTopics": 10,
            "OutputDataConfig": {
                "S3Uri": "s3://your_output/**********-2733262c2747153ab8cb0b01********-1521692274392/output/output.tar.gz"
            },
            "SubmitTime": {
                "$date": 1521692274392
            }
        }
    ]
}

Check the output file

Check that a file is created in the S3 bucket at the output destination. You can confirm using ListTopicsDetectionJobs.

  • OutputDataConfig
"OutputDataConfig": {
    "S3Uri": "s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz"
},
$ aws s3 cp s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz .
$ tar -zxvf output.tar.gz
x topic-terms.csv
x doc-topics.csv
  • topic-terms.csv
    • List os topics in the collection. Each topic, by default, contains top terms according to topic weighting.
  • doc-topics.csv
    • Lists the documents associated with the topic and the percentage of documents related to that topic. ß Note: To get best results you need to use at least 1,000 documents with each topic modeling job.

Conclusion

Since I think the API to use Amazon Comprehend will be useful, I have introduced sample code of each function using the AWS SDK for Python (boto3).

References