I tried Creating Amazon EMR Cluster

2022.09.28

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Introduction:

EMR is widely used by Data analyst for executing massively distributed workloads in the cloud utilizing open-source projects like Apache Hadoop, Apache Spark, Apache Hive, Apache Presto, Apache Pig, and a few more, Amazon offers Amazon EMR.

CREATING AN EMR CLUSTER

How to build an EMR cluster?

  1. Go to Service ➢ Analytics ➢ EMR.
  2. Select Create Cluster.
  3. To construct a cluster, you have two choices: Quick Create or Advanced Options (provides more control and choice). I am Picking Quick Create
  4. Give your cluster a name, e.g., Developersio-try.
  5. You can choose to enable logging. All cluster logs will be stored in Amazon S3 as a result.
  6. Choose the launch mode.

    You have two models of launching a cluster:

    • Cluster – You create a long-running cluster with a set of applications chosen from the list of apps in the next step.
    • Step Execution – With this option, EMR will create a cluster, execute the added steps, and terminate when the steps have completed.
  7. For the purpose of this exercise, we will choose the Cluster mode of operation.
  8. Choose a software configuration. Select EMR Release. At the time of this writing, the latest release was emr-6.7.0.

    You can select from a list of applications to be configured on the cluster that is being spun up:

    • Core Hadoop: Hadoop 2.8.5 with Ganglia 3.7.2, Hive 2.3.6, Hue 4.4.0, Mahout 0.13.0, Pig 0.17.0, and Tez 0.9.2
    • HBase: HBase 1.4.10 with Ganglia 3.7.2, Hadoop 2.8.5, Hive 2.3.6, Hue 4.4.0, Phoenix 4.14.3, and ZooKeeper 3.4.14
    • Presto: Presto 0.227 with Hadoop 2.8.5 HDFS and Hive 2.3.6 Metastore
    • Spark: Spark 2.4.4 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.2
    • Trino: Trino 378 with Hadoop 3.2.1 HDFS and Hive 3.1.3 Metastore[new in EMR 6.x]
  9. For the sake of this exercise, we'll choose Core Hadoop.

    You also have the option to choose AWS Glue Data Catalog for Table Metadata. This will provide an option of using an external Hive metastore that you can use with these applications.

  10. Hardware configuration:

    You have to choose the hardware configuration for your cluster, which includes the following settings:

    • Type of an instance
    • Number of instances (One of the instances will be a master node, and the remaining will act as core nodes.)
    • Cluster scaling - scale cluster nodes based on workload
    • Auto-termination -Terminate cluster when it is idle after xx minutes
  11. Security & Access configuration:
    • You can choose an EC2 key pair. If you don't select an EC2 key pair, you won't be able to SSH into your master node.
    • You can choose from two levels of permissions:
      • Default Permission – This will use default IAM roles. If roles are not present, they will be automatically created for you with managed policies for automatic policy updates.
        1. EMR role – EMR_DefaultRole
        2. EC2 instance profile – EMR_EC2_DefaultRole
      • Custom Permission – You can select custom roles to tailor permissions for your cluster.
        1. Select an EMR role.
        2. Select an EC2 instance profile.
  12. Select the Create Cluster option. It will take around 5–7 minutes to spin up a cluster.

Deleting EMR Cluster:

Select terminate

Conclusion:

This is the first time i tried using Amazon EMR And there are to many Services within Amazon EMR which i want to try in future we can also Create the Same by Different tools links are shared bellow

Create using Terraform:

https://dev.classmethod.jp/articles/create-amazon-emr-cluster-with-terraform/

Create using CLI:

aws emr create-cluster
--name "developersIO-cluster"
--release-label emr-5.28.0
--applications Name=Hive Name=Spark
--use-default-roles
--instance-groups
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m5.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m5.xlarge

 

Create Using SDK:

https://docs.aws.amazon.com/emr/latest/ManagementGuide/calling-emr-with-java-sdk.html