Huawei Big Data HCIE Big Data Essay Questions - Data Mining Process

Preface

Data flow in data is a frequently tested content, and it often appears in written examination questions and essay questions. Here is a brief summary.

Please explain the data mining process CRISP-DM model

CRISP-DM (Cross Industry Standard Process for Data Mining) is the abbreviation of Cross Industry Standard Process—Data Mining. It is one of the most popular standards in the data mining industry today. It emphasizes the application of data mining technology in business and is a standard standard used to manage and guide Data Miner to effectively and accurately carry out data mining work in order to obtain the best mining results.

CRISP-DM模型的基本步骤包括:
	商业理解
	数据理解
	数据准备
	建立模型
	模型评估
	模型实施

Insert image description here

business understanding

This initial phase focuses on understanding the goals and requirements of the project from a business perspective, and then translating that understanding into a definition of the data mining problem and a preliminary plan to achieve the goals. Specific include:

Determine business goals: analyze the background of the project, analyze the goals and needs of the project from a business perspective, and determine success criteria from a business perspective;
project feasibility analysis: analyze the resources, conditions and limitations, risk estimates, cost and benefit estimates;
determine data Mining goals: clarify the goals and success criteria of data mining. The goals of data mining are different from business goals. The former refers to technical ones, such as generating a decision tree, etc.; Propose a project plan: Make a plan for the entire project, and
initially Estimated tools and techniques used

Data understanding

The data understanding phase begins with the collection of raw data, then becomes familiar with the data, flags data quality issues, explores the data to gain a preliminary understanding of the data, and discovers interesting subsets to form hypotheses about hidden information. Specifically, it includes:
collecting original data: collecting the data involved in this project, loading the data into data processing tools if necessary, and doing some preliminary data integration work, and generating corresponding reports; describing the data:
doing Some rough descriptions, such as the number of records, attributes, etc., and corresponding reports are given;
explore the data: perform simple statistical analysis on the data, such as the distribution of key attributes, etc.;
check the data quality: including whether the data is complete, whether the data is wrong, Are there any issues such as missing values?

Data preparation stage

The data preparation phase includes all activities involved in constructing the final data sets from raw, raw data that will be embedded in the modeling tool. Data preparation tasks may be performed multiple times without any prescribed order. These tasks include the selection of tables, records, and attributes as well as the transformation and cleaning of data as required by the modeling tool. Specifically including:
data selection: selecting appropriate data according to data mining goals and data quality, including table selection, record selection and attribute selection; data
cleaning: improving the quality of selected data, such as removing noise, filling missing values, etc.;
data Creation: Generating new attributes or records based on original data;
Data merging: Using table connections to combine several data sets together;
Data formatting: Converting data into a format suitable for data mining processing

Modeling

At this stage, it is mainly about selecting and applying various modeling techniques while calibrating their parameters to achieve optimal values. Usually there are multiple modeling techniques for the same data mining problem type. Some technologies have special requirements for data formats. Therefore, it is often necessary to return to the data preparation phase. Specifically include:
Selecting modeling technology: Determining data mining algorithms and parameters, possibly using multiple algorithms;
Test plan design: Designing a mechanism to test the quality and effectiveness of the model;
Model training: Running on prepared data sets Data mining algorithm to derive one or more models;
model testing and evaluation: testing according to the test plan to determine whether the data mining target is successful from the perspective of data mining technology.

Model evaluation

When you enter this stage in your project, you have built a model (or models) that appears to be of high quality from a data analysis perspective. Before a model is finally released, it is important to more thoroughly evaluate the model and examine the various steps in building the model to ensure that it truly meets the business objectives. The key purpose of this stage is to determine whether there are some important business issues that have not been adequately considered. Decisions about the use of the data mining results should be made at the end of this phase. Specifically include:
result evaluation: evaluate the model obtained from a business perspective, and even actually try the model to test its effect;
process review: review all the processes of the project to ensure that there are no mistakes at each stage;
determine the next step: based on the result evaluation and process Review the conclusions drawn and decide whether to deploy the mining model or start over from a certain stage.

Model implementation

The creation of a model is usually not the end of the project. Even if the purpose of modeling is to increase understanding of the data, the understanding gained needs to be organized and presented in a way that the client can use. Specifically include:
implementation plan: plan for deploying the model in business operations;
monitoring and maintenance plan: how to monitor the use of the model in actual business and how to maintain the model;
make final report: project summary, project experience and project results;
Project review: Review the implementation process of the project, summarize experiences and lessons; make a prediction on the operating effect of data mining.

Guess you like

Origin blog.csdn.net/qq_37633855/article/details/123186619