Experiment introduction
This actual content will teach you how to use AWS big data and data lake related services and components to successfully complete the complete process of big data collection, storage, processing, analysis, and visualization. The following major AWS big data will be introduced service:
- Lab1: Real-time streaming data processing, based on the Kinesis product family
- Lab2: Batch data processing, based on EMR (Spark) implementation
- Lab3: Data visualization, based on Quicksight + Athena
- Lab4: Real-time data retrieval, based on Elasticsearch
- Lab5: Data warehouse construction and data visualization display, based on Redshift + Quicksight implementation
In order to better simulate actual business needs, we have built a database (simulating historical data, or some ODS libraries that some customers already exist), we have built a real-time data stream (simulating click streams such as e-commerce, web, etc.), we A platform for streaming real-time analysis and batch analysis, as well as a corresponding platform for visualization and real-time data retrieval, has been constructed. The following is the overall architecture diagram of this experiment:
In order to give everyone a clearer understanding of the data structure, we abstracted the data table structure in RDS (relational database) for reference:
Experiment preparation
In order to successfully complete all the hands-on experiments, you need to make the following preparations. All resources are created in the AWS us-east-1 region:
step | Prepare the environment | Prepare content description |
---|---|---|
01 | Account configuration | Familiar with the account and login method provided by AWS, and configure the corresponding security options |
02 | Deploy EC2 | Deploy an EC2 (Linux) client for operation and learn to log in remotely |
03 | Configure KDS | Configure Kinesis Data Streams real-time data stream to generate data |
04 | Deploy RDS | Configuration database (in the experimental environment, understood as historical data or ODS environment) |
05 | Deploy EMR | Deploy big data platform EMR |
06 | Deploy ES | Deploy Elasticsearch, a real-time analysis platform |
Account configuration
IAM (Identity and Access Management) is AWS and security-related services such as users, permissions, and authentication. Here we configure two roles, one is the role used by EC2 to access some resources in the cloud (ec2-role), and the other is The role of Glue to access resources in the cloud (glue-role)
Configure role permissions for EC2
Open the IAM console as follows
Click on the "Role" menu on the left and select "Create role"
Select EC2 in AWS service
In the Page Setup permissions, click the "direct attach an existing policy," adding AdministratorAccess
, and IAMFullAccess
two permissions
The next tab page does not need to be configured, then the next step is to review the page to confirm that the policy has been added correctly
Configure role permissions for Glue
Select Glue in AWS service
In the pages screening strategy, select AdministratorAccess
and IAMFullAccess
click Next, configure the label does not directly audit role configuration, setting name (here glue-role), confirmed the policy, and then confirm
Deploy EC2
EC2 can be simply understood as a virtual machine on the AWS cloud. Open the EC2 console, select the AMI (Amazon Machine Image) type as "Amazon Linux 2", and confirm that the architecture is 64-bit x86
You can choose t2.large or t2.xlarge as the model (the t3 corresponding series is also available, there is basically no load, in the principle of frugality, choosing a memory above 2G can meet the demand)
In the instance configuration page, IAM role gives us the ec2-role we just created
On the next label page, we add a label whose "key" is "Name" (note the case) and "value" is "LinuxClient" for reference (the label may not be set)
When configuring the security group in the next step, directly give a security group that allows the network to be fully released. If not, create one by yourself
On the next review page, you can confirm the configuration, and then click "Start" directly, and then you will be prompted which keypair to use to deploy this EC2. Here we choose to create a new one and save it locally (note: only this time) It can be saved, and can’t be downloaded later. It’s very troublesome to change the key)
Log in to EC2
After creation, we use SSH client to log in to the machine, and then check
wangzan:~ $ ssh -i ~/.ssh/bmc-aws.pem [email protected]
The authenticity of host '44.192.79.152 (44.192.79.152)' can't be established.
ECDSA key fingerprint is SHA256:/BosjrkiuZkSIsuSlUHRt2CPITqx8hh8IMfSv9mJVzo.
ECDSA key fingerprint is MD5:88:bc:5c:f0:c8:87:76:da:48:2b:24:06:6b:63:54:92.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '44.192.79.152' (ECDSA) to the list of known hosts.
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-172-31-77-126 ~]$ aws sts get-caller-identity
{
"Account": "921283538843",
"UserId": "AROA5NAGHF6NUFMB62LS2:i-03c9b3efdb246a085",
"Arn": "arn:aws:sts::921283538843:assumed-role/ec2-role/i-03c9b3efdb246a085"
}
Configure KDS
The content of this chapter mainly configures two Kinesis Data Streams data streams (sometimes referred to as kds for short, as the source of continuous data generated by Lab1/4).
Create KDS stream
Kinesis is a stream computing-related service on the AWS cloud, including stream data receiving and carrying platform Kinesis Data Streams (corresponding to Kafka), stream processing pipeline Kinesis Data Firehose, stream analysis platform Kinesis Data Analytics, and stream video processing platform Kinesis Video Streams. Here we mainly configure two Kinesis Data Streams, which are used for Lab1/4.
Open the Kinesis console, select "Data streams" on the left menu, and on the page that opens, select "Create data stream"
Create the data stream needed for lab1 (set the stream name and the number of shards, set it to 1 here)
Use the same method to create lab2 (not used temporarily during our experiment, here is only for demonstration) and lab4 need to use the data stream, after the creation is as follows
Now that we have created kds, we can now perform lab1-stream data processing experiments.
Deploy RDS
The content of this chapter is mainly to deploy a relational database (RDS, MySQL 5.7) and import the corresponding data.
Log in to the RDS console
RDS is a database platform service on AWS cloud, including self-developed Aurora, as well as relational databases based on different engines such as MySQL/MariaDB/Oracle/SQL Server/PostgreSQL. The RDS (MySQL 5.7) we deployed here is mainly used for Lab2 is used.
Configuration parameter group and option group
Select "Parameter Group" in the left menu, then click "Create Parameter Group" on the right, select the parameter group series as "mysql5.7", then enter the corresponding name and description, and click "Create".
After creation, click the name of the parameter group to open the parameter group you just created, enter it in the parameter character_
, and select "Edit parameter", because we need to operate the record of Chinese content on the command line, so we need to modify the database encoding, otherwise it will easily appear Garbled problem. Here we change all the "values" that can be modified to utf8mb4
, and click "Save Changes".
Select "Option groups" in the left menu, and then click "Create group" on the right
Deploy RDS database
Select "Database" in the left menu, then click "Create Database" on the right, then select the engine, version and template of the database, as shown in the screenshot
Next, set the database instance name, administrator name and password (here:, wzlinux2021
can be customized, mainly to remember not to forget) and select the database type
Select the default values for storage, availability and network connection, click on "Other Configurations", enter the database name (here bdd), select the "Parameter Group" and "Option Group" you just created, and all others are defaulted to the most Click "Create Database" below
After the database is ready, click on the database name, the connection and security page appears, and copy the contents of the corresponding terminal node. Here is
bdd.ccganutjnmfy.us-east-1.rds.amazonaws.com
Import Data
Everyone download the following sql to the deployed EC2 client
https://imgs.wzlinux.com/aws/bdd.sql
Then log in to the EC2 client and install the mysql client
sudo yum install mysql -y
Then use the mysql client to log in to the database with the following command (you need to enter the password immediately)
mysql -h bdd.ccganutjnmfy.us-east-1.rds.amazonaws.com -u admin -p bdd
Then check whether the database content matches the screenshot
MySQL [bdd]> show tables;
+-----------------+
| Tables_in_bdd |
+-----------------+
| tbl_address |
| tbl_customer |
| tbl_product |
| tbl_transaction |
+-----------------+
4 rows in set (0.00 sec)
MySQL [bdd]>
Here we have 4 tables, the contents are as follows
Table Name | content | Number of lines |
---|---|---|
tbl_address | Customer address information table | 1084 |
tbl_customer | Customer table | 1084 |
tbl_product | Product Information Sheet | 100 |
tbl_transaction | Historical transaction record table | 49874 |
Now that we have deployed RDS, we will deploy EMR.
Deploy EMR
EMR (Elastic MapReduce) is a hosted cluster platform on AWS that simplifies the operation of running big data frameworks (such as Apache Hadoop and Apache Spark) on AWS to process and analyze massive amounts of data.
Open the EMR console and directly select "Create Cluster". When the creation page appears, select "Go to Advanced Options", select the corresponding version (the latest version by default), select Spark (required for Lab2), and check Use Glue to do Catalog, and then go to the next step
For the hardware configuration page of the second step, select all the defaults. For the configuration of the cluster information in the third step, refer to the screenshot below
Then select the corresponding key pair and click "Create Cluster"
It takes about 5 minutes for all the clusters to be deployed. After the creation is complete, we can perform the Lab2-batch data processing operation.
Deploy Elasticsearch
In this chapter, only one Elasticsearch cluster needs to be deployed, and no other configuration is required for the time being.
Elasticsearch Service (Amazon ES) is a managed service that allows you to easily deploy, operate, and scale Elasticsearch clusters in the AWS cloud.
Open the ES console as follows
If it is the first time to use, the following page will appear, just select "Create a new domain"
We choose to develop and test and the latest version
On the configuration page, we set the cluster name (here lab-es
) and set the EBS disk size of the data node to 100G
On the security configuration page, we choose to deploy Public access to facilitate our access. Select Create master user in the fine access. The user created here has the highest authority of ES and selects the open resource access strategy, as shown in the following figure:
Other settings are default. Please confirm and create a cluster after review. It takes about 10 minutes for all cluster domains to be deployed. After the creation is complete, we can start the Lab4-data real-time retrieval experiment.
Environmental cleanup
The main contents that need to be cleared are as follows:
1. Delete the deployed EC2 (this is the source of the data). Clearing the EC2 first can avoid subsequent data generation;
2. Delete each stream, pipeline and analysis of Kinesis;
3. Delete the RDS database;
4 .Delete EMR cluster;
5.Delete Elasticsearch cluster;
6.Delete Redshift cluster;
7.Delete related configuration crawlers and tasks of Glue ; 8.Delete
S3 bucket created during the experiment;
[Note] If the Quicksight service is not used anywhere else, it is also recommended to cancel this service, please refer to the official documentation
https://docs.aws.amazon.com/zh_cn/quicksight/latest/user/closing-account.html