AWS big data combat-environment preparation (1)

Experiment introduction

This actual content will teach you how to use AWS big data and data lake related services and components to successfully complete the complete process of big data collection, storage, processing, analysis, and visualization. The following major AWS big data will be introduced service:

  • Lab1: Real-time streaming data processing, based on the Kinesis product family
  • Lab2: Batch data processing, based on EMR (Spark) implementation
  • Lab3: Data visualization, based on Quicksight + Athena
  • Lab4: Real-time data retrieval, based on Elasticsearch
  • Lab5: Data warehouse construction and data visualization display, based on Redshift + Quicksight implementation

In order to better simulate actual business needs, we have built a database (simulating historical data, or some ODS libraries that some customers already exist), we have built a real-time data stream (simulating click streams such as e-commerce, web, etc.), we A platform for streaming real-time analysis and batch analysis, as well as a corresponding platform for visualization and real-time data retrieval, has been constructed. The following is the overall architecture diagram of this experiment:

image-20210319094037967

In order to give everyone a clearer understanding of the data structure, we abstracted the data table structure in RDS (relational database) for reference:

image-20210319094114952

Experiment preparation

In order to successfully complete all the hands-on experiments, you need to make the following preparations. All resources are created in the AWS us-east-1 region:

step Prepare the environment Prepare content description
01 Account configuration Familiar with the account and login method provided by AWS, and configure the corresponding security options
02 Deploy EC2 Deploy an EC2 (Linux) client for operation and learn to log in remotely
03 Configure KDS Configure Kinesis Data Streams real-time data stream to generate data
04 Deploy RDS Configuration database (in the experimental environment, understood as historical data or ODS environment)
05 Deploy EMR Deploy big data platform EMR
06 Deploy ES Deploy Elasticsearch, a real-time analysis platform

Account configuration

IAM (Identity and Access Management) is AWS and security-related services such as users, permissions, and authentication. Here we configure two roles, one is the role used by EC2 to access some resources in the cloud (ec2-role), and the other is The role of Glue to access resources in the cloud (glue-role)

Configure role permissions for EC2

Open the IAM console as follows

image-20210319133935285

Click on the "Role" menu on the left and select "Create role"

image-20210319134025762

Select EC2 in AWS service

image-20210319134137937

In the Page Setup permissions, click the "direct attach an existing policy," adding AdministratorAccess, and IAMFullAccesstwo permissions

image-20210319134349507

The next tab page does not need to be configured, then the next step is to review the page to confirm that the policy has been added correctly

image-20210319134432221

Configure role permissions for Glue

Select Glue in AWS service

image-20210319134554659

In the pages screening strategy, select AdministratorAccessand IAMFullAccessclick Next, configure the label does not directly audit role configuration, setting name (here glue-role), confirmed the policy, and then confirm

image-20210319134655993

Deploy EC2

EC2 can be simply understood as a virtual machine on the AWS cloud. Open the EC2 console, select the AMI (Amazon Machine Image) type as "Amazon Linux 2", and confirm that the architecture is 64-bit x86

image-20210319135119789

You can choose t2.large or t2.xlarge as the model (the t3 corresponding series is also available, there is basically no load, in the principle of frugality, choosing a memory above 2G can meet the demand)

image-20210319135331577

In the instance configuration page, IAM role gives us the ec2-role we just created

image-20210319135429620

On the next label page, we add a label whose "key" is "Name" (note the case) and "value" is "LinuxClient" for reference (the label may not be set)

image-20210319135552380

When configuring the security group in the next step, directly give a security group that allows the network to be fully released. If not, create one by yourself

image-20210319135653135

On the next review page, you can confirm the configuration, and then click "Start" directly, and then you will be prompted which keypair to use to deploy this EC2. Here we choose to create a new one and save it locally (note: only this time) It can be saved, and can’t be downloaded later. It’s very troublesome to change the key)

Log in to EC2

After creation, we use SSH client to log in to the machine, and then check

wangzan:~ $ ssh -i ~/.ssh/bmc-aws.pem [email protected]
The authenticity of host '44.192.79.152 (44.192.79.152)' can't be established.
ECDSA key fingerprint is SHA256:/BosjrkiuZkSIsuSlUHRt2CPITqx8hh8IMfSv9mJVzo.
ECDSA key fingerprint is MD5:88:bc:5c:f0:c8:87:76:da:48:2b:24:06:6b:63:54:92.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '44.192.79.152' (ECDSA) to the list of known hosts.

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-172-31-77-126 ~]$ aws sts get-caller-identity
{
    "Account": "921283538843", 
    "UserId": "AROA5NAGHF6NUFMB62LS2:i-03c9b3efdb246a085", 
    "Arn": "arn:aws:sts::921283538843:assumed-role/ec2-role/i-03c9b3efdb246a085"
}

Configure KDS

The content of this chapter mainly configures two Kinesis Data Streams data streams (sometimes referred to as kds for short, as the source of continuous data generated by Lab1/4).

Create KDS stream

Kinesis is a stream computing-related service on the AWS cloud, including stream data receiving and carrying platform Kinesis Data Streams (corresponding to Kafka), stream processing pipeline Kinesis Data Firehose, stream analysis platform Kinesis Data Analytics, and stream video processing platform Kinesis Video Streams. Here we mainly configure two Kinesis Data Streams, which are used for Lab1/4.

Open the Kinesis console, select "Data streams" on the left menu, and on the page that opens, select "Create data stream"

image-20210319140513848

Create the data stream needed for lab1 (set the stream name and the number of shards, set it to 1 here)

image-20210319140605236

Use the same method to create lab2 (not used temporarily during our experiment, here is only for demonstration) and lab4 need to use the data stream, after the creation is as follows

image-20210319140749643

Now that we have created kds, we can now perform lab1-stream data processing experiments.

Deploy RDS

The content of this chapter is mainly to deploy a relational database (RDS, MySQL 5.7) and import the corresponding data.

Log in to the RDS console

RDS is a database platform service on AWS cloud, including self-developed Aurora, as well as relational databases based on different engines such as MySQL/MariaDB/Oracle/SQL Server/PostgreSQL. The RDS (MySQL 5.7) we deployed here is mainly used for Lab2 is used.

Configuration parameter group and option group

Select "Parameter Group" in the left menu, then click "Create Parameter Group" on the right, select the parameter group series as "mysql5.7", then enter the corresponding name and description, and click "Create".

image-20210319155709358

After creation, click the name of the parameter group to open the parameter group you just created, enter it in the parameter character_, and select "Edit parameter", because we need to operate the record of Chinese content on the command line, so we need to modify the database encoding, otherwise it will easily appear Garbled problem. Here we change all the "values" that can be modified to utf8mb4, and click "Save Changes".

image-20210319155926794

Select "Option groups" in the left menu, and then click "Create group" on the right

image-20210319160156374

Deploy RDS database

Select "Database" in the left menu, then click "Create Database" on the right, then select the engine, version and template of the database, as shown in the screenshot

image-20210319160402967

Next, set the database instance name, administrator name and password (here:, wzlinux2021can be customized, mainly to remember not to forget) and select the database type

image-20210319160552318

Select the default values ​​for storage, availability and network connection, click on "Other Configurations", enter the database name (here bdd), select the "Parameter Group" and "Option Group" you just created, and all others are defaulted to the most Click "Create Database" below

image-20210319160930852

After the database is ready, click on the database name, the connection and security page appears, and copy the contents of the corresponding terminal node. Here is

bdd.ccganutjnmfy.us-east-1.rds.amazonaws.com

Import Data

Everyone download the following sql to the deployed EC2 client

https://imgs.wzlinux.com/aws/bdd.sql

Then log in to the EC2 client and install the mysql client

sudo yum install mysql -y

Then use the mysql client to log in to the database with the following command (you need to enter the password immediately)

mysql -h bdd.ccganutjnmfy.us-east-1.rds.amazonaws.com -u admin -p bdd

image-20210319161927168

Then check whether the database content matches the screenshot

MySQL [bdd]> show tables;
+-----------------+
| Tables_in_bdd   |
+-----------------+
| tbl_address     |
| tbl_customer    |
| tbl_product     |
| tbl_transaction |
+-----------------+
4 rows in set (0.00 sec)

MySQL [bdd]> 

Here we have 4 tables, the contents are as follows

Table Name content Number of lines
tbl_address Customer address information table 1084
tbl_customer Customer table 1084
tbl_product Product Information Sheet 100
tbl_transaction Historical transaction record table 49874

Now that we have deployed RDS, we will deploy EMR.

Deploy EMR

EMR (Elastic MapReduce) is a hosted cluster platform on AWS that simplifies the operation of running big data frameworks (such as Apache Hadoop and Apache Spark) on AWS to process and analyze massive amounts of data.

Open the EMR console and directly select "Create Cluster". When the creation page appears, select "Go to Advanced Options", select the corresponding version (the latest version by default), select Spark (required for Lab2), and check Use Glue to do Catalog, and then go to the next step

image-20210319163017947

For the hardware configuration page of the second step, select all the defaults. For the configuration of the cluster information in the third step, refer to the screenshot below

image-20210319163228232

Then select the corresponding key pair and click "Create Cluster"

image-20210319163304064

It takes about 5 minutes for all the clusters to be deployed. After the creation is complete, we can perform the Lab2-batch data processing operation.

Deploy Elasticsearch

In this chapter, only one Elasticsearch cluster needs to be deployed, and no other configuration is required for the time being.

Elasticsearch Service (Amazon ES) is a managed service that allows you to easily deploy, operate, and scale Elasticsearch clusters in the AWS cloud.

Open the ES console as follows

image-20210319213813485

If it is the first time to use, the following page will appear, just select "Create a new domain"

image-20210319213839837

We choose to develop and test and the latest version

image-20210319213906157

On the configuration page, we set the cluster name (here lab-es) and set the EBS disk size of the data node to 100G

image-20210319214022079

image-20210319214008501

On the security configuration page, we choose to deploy Public access to facilitate our access. Select Create master user in the fine access. The user created here has the highest authority of ES and selects the open resource access strategy, as shown in the following figure:

image-20210319221118385

image-20210319221150561

image-20210319221209502

Other settings are default. Please confirm and create a cluster after review. It takes about 10 minutes for all cluster domains to be deployed. After the creation is complete, we can start the Lab4-data real-time retrieval experiment.

Environmental cleanup

The main contents that need to be cleared are as follows:
1. Delete the deployed EC2 (this is the source of the data). Clearing the EC2 first can avoid subsequent data generation;
2. Delete each stream, pipeline and analysis of Kinesis;
3. Delete the RDS database;
4 .Delete EMR cluster;
5.Delete Elasticsearch cluster;
6.Delete Redshift cluster;
7.Delete related configuration crawlers and tasks of Glue ; 8.Delete
S3 bucket created during the experiment;

[Note] If the Quicksight service is not used anywhere else, it is also recommended to cancel this service, please refer to the official documentation

https://docs.aws.amazon.com/zh_cn/quicksight/latest/user/closing-account.html

Welcome everyone to scan the QR code to pay attention and get more information

AWS big data combat-environment preparation (1)

Guess you like

Origin blog.51cto.com/wzlinux/2667935