
How to use Amazon Comprehend operations using the AWS SDK for Python (Boto3)
この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。
We will demonstrate sample code to use the main functions and Topic Modeling that use Amazon Comprehend in the AWS SDK for Python (Boto3).
The main functions
- DetectDominantLanguage
- DetectEntities
- DetectKeyPhrases
- DetectSentiment
These functions also provide batch APIs that batch process up to 25 documents.
Functions used in Topic Modeling
- StartTopicsDetectionJob
- DescribeTopicsDetectionJob
- ListTopicsDetectionJobs
Sample Code
Let's start by looking at the main four functions.
import boto3
import json
# Comprehend constant
REGION = 'us-west-2'
# Function for detecting the dominant language
def detect_dominant_language(text):
comprehend = boto3.client('comprehend', region_name=REGION)
response = comprehend.detect_dominant_language(Text=text)
return response
# Function for detecting named entities
def detect_entities(text, language_code):
comprehend = boto3.client('comprehend', region_name=REGION)
response = comprehend.detect_entities(Text=text, LanguageCode=language_code)
return response
# Function for detecting key phrases
def detect_key_phraes(text, language_code):
comprehend = boto3.client('comprehend', region_name=REGION)
response = comprehend.detect_key_phrases(Text=text, LanguageCode=language_code)
return response
# Function for detecting sentiment
def detect_sentiment(text, language_code):
comprehend = boto3.client('comprehend', region_name=REGION)
response = comprehend.detect_sentiment(Text=text, LanguageCode=language_code)
return response
def main():
# text
text = "Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text."
# language code
language_code = 'en'
# detecting the dominant language
result = detect_dominant_language(text)
print("Starting detecting the dominant language")
print(json.dumps(result, sort_keys=True, indent=4))
print("End of detecting the dominant language\n")
# detecting named entities
result = detect_entities(text, language_code)
print("Starting detecting named entities")
print(json.dumps(result, sort_keys=True, indent=4))
print("End of detecting named entities\n")
# detecting key phrases
result = detect_key_phraes(text, language_code)
print("Starting detecting key phrases")
print(json.dumps(result, sort_keys=True, indent=4))
print("End of detecting key phrases\n")
# detecting sentiment
result = detect_sentiment(text, language_code)
print("Starting detecting sentiment")
print(json.dumps(result, sort_keys=True, indent=4))
print("End of detecting sentiment\n")
if __name__ == '__main__':
main()
DetectDominantLanguage operation.
It can detect the dominant language of the document. Amazon Comprehend can detect 101 different languages.
Execution result
{
"Languages": [
{
"LanguageCode": "en",
"Score": 0.9940536618232727
}
],
"ResponseMetadata": {
"HTTPHeaders": {
"connection": "keep-alive",
"content-length": "64",
"content-type": "application/x-amz-json-1.1",
"date": "Thu, 22 Mar 2018 04:15:20 GMT",
"x-amzn-requestid": "a29fda00-2d87-11e8-ad56-************"
},
"HTTPStatusCode": 200,
"RequestId": "a29fda00-2d87-11e8-ad56-************",
"RetryAttempts": 0
}
}
DetectEntities operation.
It can detect the entities, such as persons or places, in the document.
Execution result
{
"Entities": [
{
"BeginOffset": 0,
"EndOffset": 6,
"Score": 0.8670787215232849,
"Text": "Amazon",
"Type": "ORGANIZATION"
},
{
"BeginOffset": 7,
"EndOffset": 17,
"Score": 1.0,
"Text": "Comprehend",
"Type": "COMMERCIAL_ITEM"
}
],
"ResponseMetadata": {
"HTTPHeaders": {
"connection": "keep-alive",
"content-length": "201",
"content-type": "application/x-amz-json-1.1",
"date": "Thu, 22 Mar 2018 04:15:20 GMT",
"x-amzn-requestid": "a2b84450-2d87-11e8-b3f9-************"
},
"HTTPStatusCode": 200,
"RequestId": "a2b84450-2d87-11e8-b3f9-************",
"RetryAttempts": 0
}
}
DetectKeyPhrases operation.
It detects key phrases in the document contents.
Execution result
{
"KeyPhrases": [
{
"BeginOffset": 0,
"EndOffset": 17,
"Score": 0.9958747029304504,
"Text": "Amazon Comprehend"
},
{
"BeginOffset": 21,
"EndOffset": 50,
"Score": 0.9654422998428345,
"Text": "a natural language processing"
},
{
"BeginOffset": 52,
"EndOffset": 55,
"Score": 0.941932201385498,
"Text": "NLP"
},
{
"BeginOffset": 57,
"EndOffset": 64,
"Score": 0.9076098203659058,
"Text": "service"
},
{
"BeginOffset": 75,
"EndOffset": 91,
"Score": 0.872683584690094,
"Text": "machine learning"
},
{
"BeginOffset": 100,
"EndOffset": 126,
"Score": 0.9918361902236938,
"Text": "insights and relationships"
},
{
"BeginOffset": 130,
"EndOffset": 134,
"Score": 0.998969554901123,
"Text": "text"
}
],
"ResponseMetadata": {
"HTTPHeaders": {
"connection": "keep-alive",
"content-length": "615",
"content-type": "application/x-amz-json-1.1",
"date": "Thu, 22 Mar 2018 04:15:21 GMT",
"x-amzn-requestid": "a2d409a7-2d87-11e8-a9a6-************"
},
"HTTPStatusCode": 200,
"RequestId": "a2d409a7-2d87-11e8-a9a6-************",
"RetryAttempts": 0
}
}
DetectSentiment operation.
It detects the emotion (positive, negative, mixed, or neutral) in the document contents.
Execution result
{
"ResponseMetadata": {
"HTTPHeaders": {
"connection": "keep-alive",
"content-length": "161",
"content-type": "application/x-amz-json-1.1",
"date": "Thu, 22 Mar 2018 04:15:21 GMT",
"x-amzn-requestid": "a2ebb00b-2d87-11e8-9c58-************"
},
"HTTPStatusCode": 200,
"RequestId": "a2ebb00b-2d87-11e8-9c58-************",
"RetryAttempts": 0
},
"Sentiment": "NEUTRAL",
"SentimentScore": {
"Mixed": 0.003294283989816904,
"Negative": 0.01219215989112854,
"Neutral": 0.7587229609489441,
"Positive": 0.2257905900478363
}
}
Topic Modeling
Let’s try executing the topic detection job.
Sample Code
import boto3
import json
import time
from bson import json_util
# Comprehend constant
REGION = 'us-west-2'
# A low-level client representing Amazon Comprehend
comprehend = boto3.client('comprehend', region_name=REGION)
# Start topics detection job setting
input_s3_url = "s3://your_input"
input_doc_format = "ONE_DOC_PER_FILE or ONE_DOC_PER_FILE"
output_s3_url = "s3://your_output"
data_access_role_arn = "arn:aws:iam::aws_account_id:role/role_name"
number_of_topics = 10
job_name = "Job_name"
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
output_data_config = {"S3Uri": output_s3_url}
# Starts an asynchronous topic detection job.
response = comprehend.start_topics_detection_job(NumberOfTopics=number_of_topics,
InputDataConfig=input_data_config,
OutputDataConfig=output_data_config,
DataAccessRoleArn=data_access_role_arn,
JobName=job_name)
# Gets job_id
job_id = response["JobId"]
print('job_id: ' + job_id)
# It loops until JobStatus becomes 'COMPLETED' or 'FAILED'.
while True:
result = comprehend.describe_topics_detection_job(JobId=job_id)
job_status = result["TopicsDetectionJobProperties"]["JobStatus"]
if job_status in ['COMPLETED', 'FAILED']:
print("job_status: " + job_status)
break
else:
print("job_status: " + job_status)
time.sleep(60)
# You can get a list of the topic detection jobs that you have submitted.
input_data_config = {"S3Uri": input_s3_url, "InputFormat": input_doc_format}
filter_job_name = {"JobName": job_name}
topics_detection_job_list = comprehend.list_topics_detection_jobs(Filter=filter_job_name)
print('topics_detection_job_list: ' + json.dumps(topics_detection_job_list,
sort_keys=True,
indent=4,
default=json_util.default))
StartTopicsDetectionJob
Start topics detection job as an asynchronous operation.
You can confirm job status using DescribeTopicDetectionJob
after get JobId
.
There are two kinds of InputFormat. - ONE_DOC_PER_FILE - When one document is included in each file. - ONE_DOC_PER_LINE - When one file, each line of the file is considered a document.
DescribeTopicDetectionJob
Get job status for topic detection job. There are four statuses as follows:
- JobStatus
- SUBMITTED
- IN_PROGRESS
- COMPLETED
- FAILED
In this sample code, we escape the while
loop when JobStatus is COMPLETED
or FAILED
.
ListTopicsDetectionJobs
Get list of topic detection job.
Execution result
job_id: 2733262c2747153ab8cb0b01********
job_status: SUBMITTED
job_status: IN_PROGRESS
[...]
job_status: COMPLETED
topics_detection_job_list: {
"ResponseMetadata": {
"HTTPHeaders": {
"connection": "keep-alive",
"content-length": "415",
"content-type": "application/x-amz-json-1.1",
"date": "Thu, 22 Mar 2018 04:27:59 GMT",
"x-amzn-requestid": "669ffb28-2d89-11e8-82a0-************"
},
"HTTPStatusCode": 200,
"RequestId": "669ffb28-2d89-11e8-82a0-************",
"RetryAttempts": 0
},
"TopicsDetectionJobPropertiesList": [
{
"EndTime": {
"$date": 1521692818930
},
"InputDataConfig": {
"InputFormat": "ONE_DOC_PER_FILE",
"S3Uri": "s3://your_input"
},
"JobId": "2733262c2747153ab8cb0b01********",
"JobName": "Job4",
"JobStatus": "COMPLETED",
"NumberOfTopics": 10,
"OutputDataConfig": {
"S3Uri": "s3://your_output/**********-2733262c2747153ab8cb0b01********-1521692274392/output/output.tar.gz"
},
"SubmitTime": {
"$date": 1521692274392
}
}
]
}
Check the output file
Check that a file is created in the S3 bucket at the output destination.
You can confirm using ListTopicsDetectionJobs
.
- OutputDataConfig
"OutputDataConfig": {
"S3Uri": "s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz"
},
$ aws s3 cp s3://your_output_bucket/************-700e040bd7ae56714b65f56049f574d1-1521592942171/output/output.tar.gz .
$ tar -zxvf output.tar.gz
x topic-terms.csv
x doc-topics.csv
- topic-terms.csv
- List os topics in the collection. Each topic, by default, contains top terms according to topic weighting.
- doc-topics.csv
- Lists the documents associated with the topic and the percentage of documents related to that topic. ß Note: To get best results you need to use at least 1,000 documents with each topic modeling job.
Conclusion
Since I think the API to use Amazon Comprehend will be useful, I have introduced sample code of each function using the AWS SDK for Python (boto3).