Build a code-free visual data analysis and modeling platform at Amazon

Modern enterprises often have the need to use data analysis and machine learning to help solve business pain points. For example, in the manufacturing industry, the data collected by the equipment is used for predictive maintenance and quality control; in the retail industry, the data collected by the client is used for channel conversion rate analysis and personalized recommendations.

The Amazon cloud technology developer community provides developers with global development technology resources. There are technical documents, development cases, technical columns, training videos, activities and competitions, etc. Help Chinese developers connect with the world's most cutting-edge technologies, ideas, and projects, and recommend outstanding Chinese developers or technologies to the global cloud community. If you haven't paid attention/favorite yet, please don't rush over when you see this, click here to make it your technical treasure house!

Usually the business department needs to submit requirements to the technical team, and the technical team converts business requirements into technical requirements, mobilizes data engineers, data scientists, machine learning engineers, etc., to do data processing, analysis, and modeling. The whole process is long and requires relatively high Cross-team communication costs, and there are requirements for enterprise talent reserve skills. Customer business departments generally expect to reduce the threshold and learning cost of using machine learning to solve business problems, so that business analysts can quickly solve business data insights in specific fields with less help from data scientists and data engineers.

This article takes the fault analysis of the automotive industry as an example to demonstrate how to build a code-free data analysis platform on Amazon Cloud Technology. Business personnel do not need programming ability, SQL or any prior knowledge of machine learning, and they can use it according to the business scenario. And specific needs, self-service upload and import data for analysis, so as to help business personnel use data in the shortest time and most conveniently.

Scenarios and Pain Points

The abnormal failure rate of vehicles is usually affected by multiple factors, such as production batch problems, service life, dealer maintenance, etc.

In the past, the quality assurance department passively accepted the problem reports and maintenance requests of scattered customers. When the problems of a certain model or a certain batch accumulated to a certain extent or even broke out, the problems of these vehicles could be located and recalled uniformly. The sudden outbreak of failure made it impossible for relevant departments to budget for maintenance funds in advance, prepare spare parts in advance, and take control measures in advance, which also affected the experience of car owners.

Therefore, data analysis of vehicle quality assurance data, time-series analysis or cluster analysis of fault occurrences, and prediction of fault trends based on existing data can help business departments achieve quality early warning and quality improvement goals. For enterprises and departments to make budgets in advance, and take corresponding measures in advance to reduce the overall maintenance costs.

technical goals

Based on the real-sale vehicle data and vehicle maintenance data (the two Schemas are shown in the table below), based on the model, draw a curve that can describe the failure situation, classify the curves, summarize similar failure curves, and screen out abnormal failures Curve, in order to achieve the purpose of predicting abnormal faults and early warning.

  • Model failure curve: such as the growth curve of the number of vehicle failures based on time, mileage, and vehicle age. This article takes the dimension of vehicle age as an example. Other conditions are similar, except that the aggregation conditions have changed.
  • Summarize similar shapes and screen abnormal fault curves
  • Predict trends based on existing data
Sampling Schema
  1. Repair data

image.png

  1. Sales data

image.png

architecture

  • Choose  Glue Databrew  as the main data processing tool. Amazon Glue Databrew provides a visual tool that can help data analysts and data scientists clean and transform data, so as to facilitate the subsequent application of data to analysis and machine learning scenarios. Glue Databrew provides operations in up to 250 pre-built transformations that can be executed automatically through the UI without any code required. For example, filter abnormal data, convert data to a standard format, correct invalid values, and so on.
  • Use Glue to crawl the data format (Schema), use Athena as the connector, and finally use QuickSight as the BI display tool for dashboard display
  • Use the data processed by Databrew as model input, and use Sagemaker Canvas to generate a predictive model. Amazon SageMaker Canvas extends access to machine learning (ML) by providing business analysts with a visual, point-and-click interface, leveraging AutoML technology to automatically create ML models based on your unique use cases without requiring any machine learning experience or writing any code . At the same time, SageMaker Canvas is integrated with Amazon SageMaker Studio, making it easier for business analysts to share models and datasets with data scientists so that they can validate and further optimize ML models.

image.png

prerequisite

  1. Glue Databrew is already supported in most regions (including Beijing and Ningxia), and you can switch to the target region in the console. SageMaker Canvas is currently available in some regions, please refer to this  FAQ document for specific regions . This article uses Ohio (us-east-2) as an example.
  2. Store the sample data in an S3 bucket in the target region first. If you don't know how to upload, please refer to this  S3 document

Protocol steps

1. Glue Databrew for data conversion

This chapter completes the following functions

  • invalid data cleaning
  • Convert vehicle maintenance data to fit sales data
  • Consolidation of vehicle maintenance data and sales data
  • Aggregate data by vehicle model and vehicle age
  • Calculate the failure rate

detailed steps

1. Open the Databrew console, click "Datasets" on the left column, click "Connect new dataset" to create a data source

2. After naming the data set (such as "repair-data") and selecting the storage location of S3, the data format is csv, and the default separator is ",", and click Create Data Set. Similarly, create a data set named sales-data, in csv format, with the default separator "," to create a data set.

image.png

3. Select the newly created dataset repair-data, select "create project with dataset", enter the project name and recipe name, select an existing or create a new IAM Role, and ensure this The IAM Role has permission to connect to the selected data. Click  Create Project . Wait for the DataBrew interface to load.

4. The first step is to clean up invalid data.

(1) Select the column CarAgeDay (representing the age of the car calculated by day), click filter, and choose to keep only data greater than 0 (or customize this value according to requirements). As shown in the figure, after editing the conditions of Greater than 0, click add to recipe, and a browser will be generated on the right, please click APPLY to take effect

image.png

(2) Filter the mileage, and only keep valid data greater than 0. Also click add to recipe and then click apply to take effect.

image.png

5. The second step is to merge the sales year and sales quarter of the vehicle maintenance data into one field to adapt the sales data: click the recipe on the right, add step, and select the merge operation (Concatenate Columns) to merge RegiDate_Year and RegiDate_Quarter, with " Q " as a connector, target column name "year_quarter"

image.png

image.png

6. The third step is to merge the vehicle maintenance data and sales data: click JOIN operation, select sales-data as the data set to join, select LEFT JOIN, and use year_quarter and Sales_Quarter as the join key, click finish after browsing

image.png

image.png

7. The fourth step is to aggregate data according to model and vehicle age: as shown in the figure below, aggregate Group by according to Series_new and CarAgeMonth to generate the number of failures and sales.

image.png

8. The fifth step is to calculate the failure rate

(1) In order to do division, first change the defect number to INT data type

image.png

image.png

(2) In order to calculate the failure rate, we choose the DIVIDE method, the first column selects defect_number, the second column selects total_sales_number, and the target column is named defect_rate

image.png

9. The fifth step is to calculate the failure rate: filter the defect_rate to remove those invalid values ​​less than 0

image.png

10. The final data and formula are shown in the figure. Based on the aggregation of each model, we can track the failure rate for different ages. Of course, you can also do more processing and conversion on the data according to your needs.

image.png

11. Click Create job on the upper right to use this recipe for the entire dataset. Define the job name, the file output address (S3 location), select the IAM role (this role needs to have read and write permissions to the corresponding location of S3), and click create and run job at the bottom, and the data processing will be completed in about 2 minutes.

image.png

12. When final, you can choose to publish the recipe to save the recipe. This recipe can be used directly to apply to other sample sets next time.

image.png

2. Use QuickSight for data display

This chapter completes the following functions

  • Glue crawls the Schema output by Databrew as the data directory
  • Use Athena as Connector to connect to BI tool QuickSight
  • Customize Widgets and Dashboards in QuickSight

detailed steps

1. Go to Glue Crawler (crawler), click add crawler (add crawler), and define the location of S3 data, which is the output location of Databrew just now. After the definition is complete, remember to click run crawler to start this running task. When the job is completed, please check the data schema in the column of Table on the left.

image.png

image.png

2. Because Athena uses Glue's data directory, click to Athena, and you can see the defect-rate table we crawled just now. This article also uses Glue to crawl the other two original tables, as shown below.

image.png

3. Console to QuickSight, add a dataset (dataset), select Athena as the data source. According to the prompt, select the target table and load the data into QuickSight.

image.png

image.png

4. After the import is successful, add  New analysis  , where you can use different dimensions of the data and different chart forms to conduct customized data exploration and display. This article takes the defect rate and sales volume of different models as an example. It can be seen that SERIES4 is a relatively popular model. The failure rate of this car is relatively high, and the failure rate basically occurs when the car is about three years old. We can According to this rule, do customer care and vehicle inspection in advance. Therefore, the focus of this article is not on the use of Quicksight, so it will not be expanded. If you are not familiar with the use of QuickSight, you can click this tutorial  for reference.

image.png

image.png

12. After all the data is integrated, click  share on the upper right  to publish this as a dashboard

3. Using Sagemaker Canvas as a machine learning tool

This chapter completes the following functions

  • additional processing of the data
  • Merge multiple copies of data
  • Build models with Sagemaker Canvas
  • generate forecast

detailed steps

1. We can use the existing data to make vehicle sales forecasts, maintenance forecasts, etc. This article takes the failure rate as the target of prediction as an example. First, remove other redundant columns in Databrew, and only keep three columns: series, caragemonth, and defect_rate. The goal is to infer the failure rate of different series of models based on series and carage. (You can also manually remove these two columns after the next csv download is complete)

2. Because Sagemaker Canvas currently only supports a single file as a model dataset, we first use Athena to merge multiple Databrew output files into a single csv. We open Athena and execute select * and then download the result as shown in the lower right corner of the screenshot, so that we get a single csv file.

image.png

3. Do the necessary column removal (as described in the first step)

4. Upload data to Sagemaker Canvas dataset

5.Create model, select this data set, and select target as defect_rate. If you want to generate a model quickly, choose quick build to build the model (2-15min); if you need higher accuracy, choose standard build (2-4 hours). This article chooses quick build.

image.png

6. After the model is generated, the failure rate can be predicted based on the carage. You can upload a csv file for batch prediction; you can also enter a single value for single prediction. This article takes the former as an example, uploads the series and carage (120 months) that you want to predict, and finally gets the following results

image.png

8. If you choose the Standard build, after the model is created, SageMaker Canvas can also share the model to Amazon SageMaker Studio with one click  , so that business analysts can invite data scientists to further verify and further optimize the ML model on the model and shared data sets, to the level of production deployment.

image.png

in conclusion

This article provides a graphic-based data processing and AutoML solution, using services such as Glue Databrew and Sagemaker Canvas to build a code-free data analysis and machine learning platform. On the one hand, it helps customer business analysts reduce data processing and ML expertise. Learning curve, reducing cross-departmental communication costs, maintaining the interpretability of AutoML results, facilitating sharing and continuous optimization with data scientists at the model and dataset levels. On the other hand, this platform is based on serverless, there is no need for customers to manage servers, pay-as-you-go.

The author of this article

image.png

Li Tiange is  an Amazon solution architect, responsible for the consulting and design of Amazon-based cloud computing solution architecture, good at development, serverless and other fields, and has rich experience in solving practical problems of customers.

image.png

Liang Rui  Liang Rui, an Amazon solution architect, is mainly responsible for the cloud work of enterprise-level customers, serving customers from automobiles, traditional manufacturing, finance, hotels, aviation, tourism, etc., and is good at DevOps. 11 years of experience in IT professional services, successively served as program development, software architect, and solution architect.

Article source: https://dev.amazoncloud.cn/column/article/630b3a84269604139cb5e9ea?sc_medium=regulartraffic&sc_campaign=crossplatform&sc_channel=CSDN 

Guess you like

Origin blog.csdn.net/u012365585/article/details/132612168