Big data mining enterprise service platform (TipDM big data mining modeling platform) - quickly build data mining projects

"TipDM big data mining modeling platform" (hereinafter referred to as the platform) is a data mining modeling platform based on Python engine independently developed by Guangdong Teddy Intelligent Technology Co., Ltd. Using the out-of-the-box algorithm components configured on the platform, users can operate by dragging and dropping without programming foundation, and process data input and output, data preprocessing, mining modeling and other links in a streamlined manner Connect to help users quickly build data mining projects and improve the efficiency of data processing. At present, it has been widely used in many enterprises and institutions such as China Southern Power Grid, China Electric Power Research Institute, Zhujiang Digital, Beijing Smart Petition, China Petroleum Exploration Research Institute, Light Industry Environmental Protection Research Institute, and Highway Science Research Institute of the Ministry of Transport. The interface of the platform is shown in Figure 1.

 Figure 1 Platform interface diagram

Platform introduction

TipDM big data mining modeling platform mainly has the following characteristics.

(1) The platform algorithm is based on the Python engine for data mining modeling. Python is currently one of the most popular languages ​​for data mining and modeling, which is highly suitable for use.

(2) Users can use the intuitive visual graphical interface to build data mining processes by dragging and dropping without programming foundation in Python .

(3) Provide publicly available data mining example projects, one-click creation, and fast operation. Supports online preview of the results of each node in the mining process. Provides a real-time log viewing function to quickly locate problems.

(4) Provide dozens of algorithm components in eight categories, including common data mining algorithms such as data preprocessing, statistical analysis, classification, clustering, and text analysis. At the same time, a Python script is provided, and you can run it by pasting the code.

The platform is mainly divided into three modules: data space, my project, and algorithm components.

data space

[Data space] is mainly used for the import and management of datasets. Users can import any type of data from the local to the platform for use, as shown in Figure 2. At the same time, you can also choose whether to upload the data as a public dataset and share it with other users, as shown in Figure 3.

Figure 2 New data set

Figure 3 Upload public dataset

My project

[My project] is mainly used for the creation and management of data mining processes. Through the [My Project] module, you can create a blank project and configure the data mining process, as shown in Figure 4. For a project with excellent completion, it can be saved as a template, as shown in Figure 5, other users can create a data mining project with configured algorithms through the template, and run it with one click.

Figure 4 Engineering

Figure 5 template

Algorithm component

In the platform, each data mining algorithm can be called a component. [Algorithm components] It is mainly divided into two parts: system algorithm components and personal algorithm components. The system algorithm component is the default algorithm provided by the platform, and the user does not need to edit it and can directly use it in the project. The personal algorithm component is that when the system algorithm component cannot meet the requirements, the user can use Python to write the personal algorithm component for the user to use.

System algorithm components include input, statistical analysis, preprocessing, script components, clustering, classification, regression, and text analysis, a total of eight categories, as shown in Figure 6.

(1) [Input/Output] Provide input components for configuring data mining projects, including: input sources.

(2) 【Statistical analysis】Provides common components for statistical analysis of the overall situation of the data, including: correlation analysis, normality test, principal component analysis, full table statistics, stationarity test, factor analysis, and chi-square test.

(3) [Preprocessing] Provide components for cleaning data, including: primary key merging, table stacking, record deduplication, new sequence, data standardization, data splitting, frequency statistics, derived variables, missing value processing, data sorting , Group aggregation.

(4) [Script component]: Provide a code editing box, users can paste the written program code in the code editing box, and run it directly without additional configuration into components, including: Python script.

(5) [Classification] Provides commonly used classification algorithm components, including: CART classification tree, K nearest neighbor, naive Bayesian, support vector machine, logistic regression, Adaboost, random forest.

(6) [Clustering] Provides commonly used clustering algorithm components, including: hierarchical clustering, DBSCAN density clustering, K-Means clustering, K-centroid clustering, and fuzzy clustering.

(7) [Regression] Provides commonly used regression algorithm components, including: CART regression tree, linear regression, support vector regression, K nearest neighbor regression.

(8) 【Text Analysis】Provide commonly used text analysis algorithm components, including: HanLP word segmentation and part of speech, long short-term memory network training, filter stop words, word2vec, stutter extraction, regular matching, word vector/document vector based, TextRank, etc. .

Figure 6 Algorithm of system components

Figure 7 Individual component algorithm

Next, I will show you the operation and use the platform to build an iris clustering project.

Figure 8 upload data

Figure 9 Create a project

Figure 10 Configure input source components

 

Figure 11 Configuring the KMeans component

Guess you like

Origin blog.csdn.net/tipdm0526/article/details/131003837