In this exercise, you will learn how to use Redshift and Amazon QuickSight platforms to build data visualization applications. You will see how to use Amazon's data warehouse to load data from the data lake and display it with fully managed data visualization tools.
The objectives of this experiment include:
- 1. Create a Redshift cluster
- 2. Load the S3 data files into the Redshift database in batches
- 3. Use Quicksight to visualize the data table
The architecture diagram of this experiment is as follows
Build a data warehouse
1. View the data
Check s3://lab-921283538843-wzlinux-com/spark/output
whether the parquet format file generated in the EMR experiment in the S3 bucket (here ) exists.
2. Create IAM Role
Select IAM service, click Role -> Create Role, select Redshift
Select Redshift-Customizable, click Next permission
Select permission AmazonS3ReadOnlyAccess
Add the permission name myRedshiftRole, click confirm
3. Create a subnet group
Before creating a Redshift cluster, create a subnet group. Select Redshift service, select "CONFIG" -> "Manage Subnet Group" in the left menu bar
Then select "Create a cluster subnet group", the subnet group name can accept the default name "cluster-subnet-group-1", and enter any descriptive text in the description box. Select "Default VPC", select "Add all subnets for this VPC", and then click "Create cluster subnet group" to complete the creation of the subnet group.
4. Create a Redshift cluster
Select "Cluster" in the left menu, click "Create Cluster", set the name of the cluster (do not use Chinese, do not use special characters, start with English, can have numbers, and can have minus signs), and select node type dc2.large
Database configuration accept the default value, enter the master user password (remember the password you entered)
In the cluster permissions, select the myRedshiftRole role created earlier, and click "Associate IAM role"
In other configurations, select the default VPC, the default security group and the cluster subnet group created earlier, and click to confirm "Create Cluster". After about 5 minutes, the cluster becomes "Available".
5. Access the Redshift database
There are two ways to access the Redshift database, one is through the query editor on the Redshift Console, and the other is through a SQL client (such as SQL Workbench/J client).
In this experiment, for easy operation, use the query editor on the Redshift Console to access the database. Select "Editor" in the left menu, enter the parameters in the "Connect to Database" window, and then "Connect to Database"
6. Create a table
Create a table in the query editor, select "Public" in the Select Schema on the left, and then enter the SQL statement to create the table in the SQL query window:
create table table1(
tno varchar(20),
tdate varchar(15),
uno varchar(10),
pno varchar(10),
tnum int,
uname varchar(20),
umobile varchar(20),
ano varchar(20),
acity varchar(50),
aname varchar(50),
pclass varchar(10),
pname varchar(50),
price decimal(10, 2)
);
As shown below
Select "Run" and the result should show "Completed"
7. Import S3 data
Open a new SQL query window (here is Query 2), enter the SQL command below to load S3 data, pay attention to replace the account with the actual account ID, and confirm the S3 bucket address you have obtained.
copy table1 from 's3://lab-921283538843-wzlinux-com/spark/output/'
credentials 'aws_iam_role=arn:aws:iam::921283538843:role/myRedshiftRole'
format as parquet;
As shown below
Click Run, the result should be displayed as "Completed". Enter "select from table1;" in Query3 to query the data in the table. Enter "select count( ) from table1;" in Query4 , the data in the table should be queried. This shows that the data in S3 has been copied to the Redshift data warehouse.
8. Allow Internet access
In the next step, we will use AWS Quicksight to visualize the data in Redshift. Before that, Quicksight needs to be given access to Redshift from the Internet. For this, we first create a public network elastic IP address in the EC2 menu (the process is omitted). Then modify the Redshift properties to grant public access.
Change publicly accessible to "Yes" and select the corresponding elastic public network IP address.
This operation takes a while, just wait a few minutes.
data visualization
1. Enable Quicksight
About enabling Quicksight, I will not introduce it here, you can watch Lab3.
2. Create a data set
Enter the Quicksight console interface, click on the data set on the left, and choose to create a "new data set"
Select the Redshift (automatic discovery) data set, Redshift also has a manual connection method, but we will not demonstrate here
Enter the connection parameters, select "Create data source", select the corresponding Redshift database, pay attention to configure the corresponding address, port, database name, user name and password
Select Table1, click "Select", and finally click "Virtualize" to complete the creation of the data set (here we choose to import the data from Redshift to Quicksigh, so the analysis speed will be much faster)
3. Data visualization
Open the visualization object window and select the display mode as "vertical bar graph"
The tdate
drag and drop X axis
bar, the tnum
drag value
bar (the system automatically selects the count)
This completes the display of "using the date as the X axis and the total sales volume of the day as the Y axis from high to low" display.