How to use Amazon Comprehend operations using the AWS SDK for Python (Boto3)

How to use Amazon Comprehend operations using the AWS SDK for Python (Boto3)

Clock Icon2018.06.09 18:58

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

We will demonstrate sample code to use the main functions and Topic Modeling that use Amazon Comprehend in the AWS SDK for Python (Boto3).

The main functions

  • DetectDominantLanguage
  • DetectEntities
  • DetectKeyPhrases
  • DetectSentiment

These functions also provide batch APIs that batch process up to 25 documents.

Functions used in Topic Modeling

  • StartTopicsDetectionJob
  • DescribeTopicsDetectionJob
  • ListTopicsDetectionJobs

Sample Code

Let's start by looking at the main four functions.

import boto3
import json

# Comprehend constant
REGION = 'us-west-2'


# Function for detecting the dominant language
def detect_dominant_language(text):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_dominant_language(Text=text)
    return response


# Function for detecting named entities
def detect_entities(text, language_code):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_entities(Text=text, LanguageCode=language_code)
    return response


# Function for detecting key phrases
def detect_key_phraes(text, language_code):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_key_phrases(Text=text, LanguageCode=language_code)
    return response


# Function for detecting sentiment
def detect_sentiment(text, language_code):
    comprehend = boto3.client('comprehend', region_name=REGION)
    response = comprehend.detect_sentiment(Text=text, LanguageCode=language_code)
    return response


def main():
    # text
    text = "Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text."

    # language code
    language_code = 'en'

    # detecting the dominant language
    result = detect_dominant_language(text)
    print("Starting detecting the dominant language")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting the dominant language\n")

    # detecting named entities
    result = detect_entities(text, language_code)
    print("Starting detecting named entities")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting named entities\n")

    # detecting key phrases
    result = detect_key_phraes(text, language_code)
    print("Starting detecting key phrases")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting key phrases\n")

    # detecting sentiment
    result = detect_sentiment(text, language_code)
    print("Starting detecting sentiment")
    print(json.dumps(result, sort_keys=True, indent=4))
    print("End of detecting sentiment\n")


if __name__ == '__main__':
    main()

DetectDominantLanguage operation.

It can detect the dominant language of the document. Amazon Comprehend can detect 101 different languages.

Execution result

{
    "Languages": [
        {
            "LanguageCode": "en",
            "Score": 0.9940536618232727
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "64",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:20 GMT",
            "x-amzn-requestid": "a29fda00-2d87-11e8-ad56-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a29fda00-2d87-11e8-ad56-************",
        "RetryAttempts": 0
    }
}

DetectEntities operation.

It can detect the entities, such as persons or places, in the document.

Execution result

{
    "Entities": [
        {
            "BeginOffset": 0,
            "EndOffset": 6,
            "Score": 0.8670787215232849,
            "Text": "Amazon",
            "Type": "ORGANIZATION"
        },
        {
            "BeginOffset": 7,
            "EndOffset": 17,
            "Score": 1.0,
            "Text": "Comprehend",
            "Type": "COMMERCIAL_ITEM"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "201",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:20 GMT",
            "x-amzn-requestid": "a2b84450-2d87-11e8-b3f9-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2b84450-2d87-11e8-b3f9-************",
        "RetryAttempts": 0
    }
}

DetectKeyPhrases operation.

It detects key phrases in the document contents.

Execution result

{
    "KeyPhrases": [
        {
            "BeginOffset": 0,
            "EndOffset": 17,
            "Score": 0.9958747029304504,
            "Text": "Amazon Comprehend"
        },
        {
            "BeginOffset": 21,
            "EndOffset": 50,
            "Score": 0.9654422998428345,
            "Text": "a natural language processing"
        },
        {
            "BeginOffset": 52,
            "EndOffset": 55,
            "Score": 0.941932201385498,
            "Text": "NLP"
        },
        {
            "BeginOffset": 57,
            "EndOffset": 64,
            "Score": 0.9076098203659058,
            "Text": "service"
        },
        {
            "BeginOffset": 75,
            "EndOffset": 91,
            "Score": 0.872683584690094,
            "Text": "machine learning"
        },
        {
            "BeginOffset": 100,
            "EndOffset": 126,
            "Score": 0.9918361902236938,
            "Text": "insights and relationships"
        },
        {
            "BeginOffset": 130,
            "EndOffset": 134,
            "Score": 0.998969554901123,
            "Text": "text"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "615",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:21 GMT",
            "x-amzn-requestid": "a2d409a7-2d87-11e8-a9a6-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2d409a7-2d87-11e8-a9a6-************",
        "RetryAttempts": 0
    }
}

DetectSentiment operation.

It detects the emotion (positive, negative, mixed, or neutral) in the document contents.

Execution result

{
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "161",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:21 GMT",
            "x-amzn-requestid": "a2ebb00b-2d87-11e8-9c58-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2ebb00b-2d87-11e8-9c58-************",
        "RetryAttempts": 0
    },
    "Sentiment": "NEUTRAL",
    "SentimentScore": {
        "Mixed": 0.003294283989816904,
        "Negative": 0.01219215989112854,
        "Neutral": 0.7587229609489441,
        "Positive": 0.2257905900478363
    }
}

Topic Modeling

Let’s try executing the topic detection job.

Sample Code

import boto3
import json
import time
from bson import json_util

# Comprehend constant
REGION = 'us-west-2'

# A low-level client representing Amazon Comprehend
comprehend = boto3.client('comprehend', region_name=REGION)

# Start topics detection job setting
input_s3_url = "s3://your_input"
input_doc_format = "ONE_DOC_PER_FILE or ONE_DOC_PER_FILE"
output_s3_url = "s3://your_output"
data_access_role_arn = "arn:aws:iam::aws_account_id:role/role_name"
number_of_topics = 10
job_name = "Job_name"

input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}

# Starts an asynchronous topic detection job.
response = comprehend.start_topics_detection_job(NumberOfTopics=number_of_topics,
                                                 InputDataConfig=input_data_config,
                                                 OutputDataConfig=output_data_config,
                                                 DataAccessRoleArn=data_access_role_arn,
                                                 JobName=job_name)

# Gets job_id
job_id = response["JobId"]
print('job_id: ' + job_id)

# It loops until JobStatus becomes 'COMPLETED' or 'FAILED'.
while True:
    result = comprehend.describe_topics_detection_job(JobId=job_id)
    job_status = result["TopicsDetectionJobProperties"]["JobStatus"]

    if job_status in ['COMPLETED', 'FAILED']:
        print("job_status: " + job_status)
        break
    else:
        print("job_status: " + job_status)
        time.sleep(60)

# You can get a list of the topic detection jobs that you have submitted.
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}

filter_job_name = {"JobName": job_name}

topics_detection_job_list = comprehend.list_topics_detection_jobs(Filter=filter_job_name)
print('topics_detection_job_list: ' + json.dumps(topics_detection_job_list,
                                                 sort_keys=True,
                                                 indent=4,
                                                 default=json_util.default))

StartTopicsDetectionJob

Start topics detection job as an asynchronous operation. You can confirm job status using DescribeTopicDetectionJob after get JobId.

There are two kinds of InputFormat. - ONE_DOC_PER_FILE - When one document is included in each file. - ONE_DOC_PER_LINE - When one file, each line of the file is considered a document.

DescribeTopicDetectionJob

Get job status for topic detection job. There are four statuses as follows:

  • JobStatus
    • SUBMITTED
    • IN_PROGRESS
    • COMPLETED
    • FAILED

In this sample code, we escape the while loop when JobStatus is COMPLETED or FAILED.

ListTopicsDetectionJobs

Get list of topic detection job.

Execution result

job_id: 2733262c2747153ab8cb0b01********
job_status: SUBMITTED
job_status: IN_PROGRESS
[...]
job_status: COMPLETED
topics_detection_job_list: {
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "415",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:27:59 GMT",
            "x-amzn-requestid": "669ffb28-2d89-11e8-82a0-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "669ffb28-2d89-11e8-82a0-************",
        "RetryAttempts": 0
    },
    "TopicsDetectionJobPropertiesList": [
        {
            "EndTime": {
                "$date": 1521692818930
            },
            "InputDataConfig": {
                "InputFormat": "ONE_DOC_PER_FILE",
                "S3Uri": "s3://your_input"
            },
            "JobId": "2733262c2747153ab8cb0b01********",
            "JobName": "Job4",
            "JobStatus": "COMPLETED",
            "NumberOfTopics": 10,
            "OutputDataConfig": {
                "S3Uri": "s3://your_output/**********-2733262c2747153ab8cb0b01********-1521692274392/output/output.tar.gz"
            },
            "SubmitTime": {
                "$date": 1521692274392
            }
        }
    ]
}

Check the output file

Check that a file is created in the S3 bucket at the output destination. You can confirm using ListTopicsDetectionJobs.

  • OutputDataConfig
"OutputDataConfig": {
    "S3Uri": "s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz"
},
$ aws s3 cp s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz .
$ tar -zxvf output.tar.gz
x topic-terms.csv
x doc-topics.csv
  • topic-terms.csv
    • List os topics in the collection. Each topic, by default, contains top terms according to topic weighting.
  • doc-topics.csv
    • Lists the documents associated with the topic and the percentage of documents related to that topic. ß Note: To get best results you need to use at least 1,000 documents with each topic modeling job.

Conclusion

Since I think the API to use Amazon Comprehend will be useful, I have introduced sample code of each function using the AWS SDK for Python (boto3).

References

この記事をシェアする

facebook logohatena logotwitter logo

© Classmethod, Inc. All rights reserved.