My Journey In Data Analytics Department as Trainee 第2話

2021.03.30

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

This Blog is continuation to My Journey In Data Analytics as a trainee 第1話 Blog. This blog mostly deal with week 2 training. It gives an insight of many packages, working of serverless and also some functionalities of Lambda.

2週目

Firstly from the RSS feed (https://dev.classmethod.jp/feed/) save the content to an xml file and then save it onto an S3 bucket with dedicated folder for XML files (recommended to create in advance the S3 bucket). Next parse few tags from xml file and convert to JSON format. Now save the converted JSON file in the same S3 bucket with a dedicated folder similar to xml folder. Enable lambda function which runs every 30 minutes. Finally the Code review is done using a pull request in backlog git. The above process is explained in detail below.

For setup and walkthrough of serverless check 参照 1 and 2.

Initial changes to be done in Serverless.yml file

service: #give your desired name

provider :
  stage: dev
   region: ap-south-1 #desired region to deploy
   deploymentBucket:
      name: da-new-graduate-serverless-deploy-bucket

package:
  exclude:
     - .git/**
     - .gitignore
     - .serverless
     - .serverless/**
     - README.md
     - shell/**
     - node_modules
     - node_modules/**

plugins:
  - serverless-python-requirements
 $ sls plugin install -n serverless-python-requirements #Installation of Serverless Python Requirements

(This is done in terminal so that appropriate requirements are fulfilled inorder to deploy serverless)

タスク1

So firstly from the RSS feed https://dev.classmethod.jp/feed/, save it as an xml file on the xml folder in S3 bucket, this S3 bucket was created before hand. The following is the below python code written in the handler.py file. Here we are making using of boto3 to upload files on to the S3.

import boto3
from datetime import datetime
import requests
s3 = boto3.resource('s3')
o = s3.Object('da-exercise-202103','hemanth_kumar_r/raw-data/%s.xml' % datetime.now().strftime("%Y%m%d-%H%M%S"))
o.put(Body = response.content)

Adding another file requirements.txt file. This done so that required packages downloads are done to the lambda. That are required in the handler.py. The file contains below contents.

requests
feedparser

For more information on boto3 check 参照 3

タスク2&3

Next using Feedparser to get "title", "link", "pubDate", "dc:creator", and "category" from the saved xml file, convert to JSON format. The follow code is written in the handler.py file.

import feedparser, json
from datetime import datetime
import requests

#inside the definition type the below code
URL = "https://dev.classmethod.jp/feed"
response = requests.get(URL)

key = 'tags'
feed = feedparser.parse(response.content)

data = []
for f in feed.entries:
     if key in f.keys():
           case = {'title' : f.title, 'link' : f.link, 'published' : f.published, 'author' : f.author, 'category' : f.tags}
           data.append(case)
     else:
           case = {'title' : f.title, 'link' : f.link, 'published' : f.published, 'author' : f.author}
           data.append(case)

jsondata = json.dumps(data, ensure_ascii=False)
o = s3.Object('da-exercise-202103','hemanth_kumar_r/json-data/%s.json' % datetime.now().strftime("%Y%m%d-%H%M%S"))
o.put(Body = jsondata)

For more information on feedparser check 参照 4.

タスク4

Configure the lambda function to execute every 30 minutes. So after every 30 minutes an xml and JSON file gets created in the S3 bucket as discussed above. For that along with above changes, add the below changes in serveless.yml file.

functions:
  hello:
    handler: handler.hello
    events:
      - schedule:
          rate: cron(*/30 * * * ? *)
          enabled: true

Cron is used to schedule lambda to run every 30 minutes here. For more information on Cron check 参照 5.

タスク5

Before creating pull request run "sls deploy" check lambda if it's running as expected. Down below is the expected result.

Refer to My Journey in Data Analytics Department as trainee 第1話 article(blog) under this check task 2.5 git flow. Follow the procedure to add files into backlog git. Now create a pull request in backlog git, click on your backlog repository, click on pull requests, click on + option, and it would lead you to the below tab.

Change to develop from master and other side to the branch created. Add description, assign your mentor, next click on add pull request and wait for it to be reviewed by your mentor.

まとめ

The blog above depicts my journey of week 2 in Data Analytics Department. After the walkthrough we get an how to make an xml file from RSS feed. how to get certain tags from xml file and turn it onto json file. how to deploy this using serverless onto lambda. Scheduling to run every 30 minutes. The upcoming blog would give how the json file will be queried.

参照