How to use Amazon Comprehend operations using the AWS SDK for Python (Boto3)

2018.06.10

We will demonstrate sample code to use the main functions and Topic Modeling that use Amazon Comprehend in the AWS SDK for Python (Boto3).

The main functions

  • DetectDominantLanguage
  • DetectEntities
  • DetectKeyPhrases
  • DetectSentiment

These functions also provide batch APIs that batch process up to 25 documents.

Functions used in Topic Modeling

  • StartTopicsDetectionJob
  • DescribeTopicsDetectionJob
  • ListTopicsDetectionJobs

Sample Code

Let's start by looking at the main four functions.

DetectDominantLanguage operation.

It can detect the dominant language of the document. Amazon Comprehend can detect 101 different languages.

Execution result

{
    "Languages": [
        {
            "LanguageCode": "en",
            "Score": 0.9940536618232727
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "64",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:20 GMT",
            "x-amzn-requestid": "a29fda00-2d87-11e8-ad56-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a29fda00-2d87-11e8-ad56-************",
        "RetryAttempts": 0
    }
}

DetectEntities operation.

It can detect the entities, such as persons or places, in the document.

Execution result

{
    "Entities": [
        {
            "BeginOffset": 0,
            "EndOffset": 6,
            "Score": 0.8670787215232849,
            "Text": "Amazon",
            "Type": "ORGANIZATION"
        },
        {
            "BeginOffset": 7,
            "EndOffset": 17,
            "Score": 1.0,
            "Text": "Comprehend",
            "Type": "COMMERCIAL_ITEM"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "201",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:20 GMT",
            "x-amzn-requestid": "a2b84450-2d87-11e8-b3f9-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2b84450-2d87-11e8-b3f9-************",
        "RetryAttempts": 0
    }
}

DetectKeyPhrases operation.

It detects key phrases in the document contents.

Execution result

{
    "KeyPhrases": [
        {
            "BeginOffset": 0,
            "EndOffset": 17,
            "Score": 0.9958747029304504,
            "Text": "Amazon Comprehend"
        },
        {
            "BeginOffset": 21,
            "EndOffset": 50,
            "Score": 0.9654422998428345,
            "Text": "a natural language processing"
        },
        {
            "BeginOffset": 52,
            "EndOffset": 55,
            "Score": 0.941932201385498,
            "Text": "NLP"
        },
        {
            "BeginOffset": 57,
            "EndOffset": 64,
            "Score": 0.9076098203659058,
            "Text": "service"
        },
        {
            "BeginOffset": 75,
            "EndOffset": 91,
            "Score": 0.872683584690094,
            "Text": "machine learning"
        },
        {
            "BeginOffset": 100,
            "EndOffset": 126,
            "Score": 0.9918361902236938,
            "Text": "insights and relationships"
        },
        {
            "BeginOffset": 130,
            "EndOffset": 134,
            "Score": 0.998969554901123,
            "Text": "text"
        }
    ],
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "615",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:21 GMT",
            "x-amzn-requestid": "a2d409a7-2d87-11e8-a9a6-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2d409a7-2d87-11e8-a9a6-************",
        "RetryAttempts": 0
    }
}

DetectSentiment operation.

It detects the emotion (positive, negative, mixed, or neutral) in the document contents.

Execution result

{
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "161",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:15:21 GMT",
            "x-amzn-requestid": "a2ebb00b-2d87-11e8-9c58-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "a2ebb00b-2d87-11e8-9c58-************",
        "RetryAttempts": 0
    },
    "Sentiment": "NEUTRAL",
    "SentimentScore": {
        "Mixed": 0.003294283989816904,
        "Negative": 0.01219215989112854,
        "Neutral": 0.7587229609489441,
        "Positive": 0.2257905900478363
    }
}

Topic Modeling

Let’s try executing the topic detection job.

Sample Code

StartTopicsDetectionJob

Start topics detection job as an asynchronous operation. You can confirm job status using DescribeTopicDetectionJob after get JobId.

There are two kinds of InputFormat. - ONE_DOC_PER_FILE - When one document is included in each file. - ONE_DOC_PER_LINE - When one file, each line of the file is considered a document.

DescribeTopicDetectionJob

Get job status for topic detection job. There are four statuses as follows:

  • JobStatus
    • SUBMITTED
    • IN_PROGRESS
    • COMPLETED
    • FAILED

In this sample code, we escape the while loop when JobStatus is COMPLETED or FAILED.

ListTopicsDetectionJobs

Get list of topic detection job.

Execution result

job_id: 2733262c2747153ab8cb0b01********
job_status: SUBMITTED
job_status: IN_PROGRESS
[...]
job_status: COMPLETED
topics_detection_job_list: {
    "ResponseMetadata": {
        "HTTPHeaders": {
            "connection": "keep-alive",
            "content-length": "415",
            "content-type": "application/x-amz-json-1.1",
            "date": "Thu, 22 Mar 2018 04:27:59 GMT",
            "x-amzn-requestid": "669ffb28-2d89-11e8-82a0-************"
        },
        "HTTPStatusCode": 200,
        "RequestId": "669ffb28-2d89-11e8-82a0-************",
        "RetryAttempts": 0
    },
    "TopicsDetectionJobPropertiesList": [
        {
            "EndTime": {
                "$date": 1521692818930
            },
            "InputDataConfig": {
                "InputFormat": "ONE_DOC_PER_FILE",
                "S3Uri": "s3://your_input"
            },
            "JobId": "2733262c2747153ab8cb0b01********",
            "JobName": "Job4",
            "JobStatus": "COMPLETED",
            "NumberOfTopics": 10,
            "OutputDataConfig": {
                "S3Uri": "s3://your_output/**********-2733262c2747153ab8cb0b01********-1521692274392/output/output.tar.gz"
            },
            "SubmitTime": {
                "$date": 1521692274392
            }
        }
    ]
}

Check the output file

Check that a file is created in the S3 bucket at the output destination. You can confirm using ListTopicsDetectionJobs.

  • OutputDataConfig
"OutputDataConfig": {
    "S3Uri": "s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz"
},
$ aws s3 cp s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz .
$ tar -zxvf output.tar.gz
x topic-terms.csv
x doc-topics.csv
  • topic-terms.csv
    • List os topics in the collection. Each topic, by default, contains top terms according to topic weighting.
  • doc-topics.csv
    • Lists the documents associated with the topic and the percentage of documents related to that topic. ß Note: To get best results you need to use at least 1,000 documents with each topic modeling job.

Conclusion

Since I think the API to use Amazon Comprehend will be useful, I have introduced sample code of each function using the AWS SDK for Python (boto3).

References