What is the use of practical industrial big data analysis based on Python, and what conclusions can be given?

0. Preface

Recently, in the process of developing big data products for refined oil depots, I used Person related algorithms for analysis [1], such as related analysis of various data in the oil distribution system, and gave a "Pearson related heat map".

At the design discussion meeting, the leader said: What is the use of relevant analysis? What conclusions can be drawn? In the correlation analysis, the relationship between the two data items is of little significance. Can you see the overall correlation?

In recent years, the term "big data" has long been familiar to the public, and "big data" has always appeared in front of the public in a cold image. Facing big data, I believe many people are at a loss.

In the era of big data, "we are no longer keen on finding causality, but should look for correlations between things. [3]". For big data ideas, please refer to [3], this article will not repeat it.

PS: If you need Python learning materials, you can click on the link below to get it yourself

Python free learning materials and group communication answers Click to join

1. Background

Industrial big data is the product of the combination of the Internet, big data and industrial industries. It is the foothold of national strategies such as Manufacturing in China, Industrial Internet, and Industry 4.0 in 2025.

German industry has completed the process of industrial automation. On the basis of automation, on the basis of industrial data, cloud computing and artificial intelligence technology are introduced to improve the intelligent level of the industry to meet the society's demand for mass customization. The United States has strong cloud computing, Internet and data processing capabilities. On this basis, an industrial Internet strategy that connects the data of a single device, a production line and a factory through big data is proposed. Exploit the value of industrial service industry in diagnosis, prediction, after-sales service, etc. [4]

The application of big data in industrial enterprises is mainly reflected in three aspects:

  • One is product value mining based on data. Through secondary mining of products and related data, new value can be created.
  • The second is to improve the level of service-oriented production. To improve service-oriented production is to increase the proportion of service value in production (product).
  • The third is to innovate business models.

Professional consulting organizations give "eight application scenarios of industrial big data", which are:

  • 1. Accelerate product innovation
  • 2. Product fault diagnosis and prediction
  • 3. Big data application of industrial IoT production line
  • 4. Analysis and optimization of industrial supply chain
  • 5. Product sales forecast and demand management
  • Six, production planning and scheduling
  • 7. Product quality management and analysis
  • 8. Industrial pollution and environmental protection testing

2. What is relevance

"All things are connected" is one of the most important core thinking of big data.

The so-called association here refers to the mutual influence, mutual restriction, and mutual confirmation between things. And this kind of mutual influence and interrelated relationship of things is called correlation, or correlation for short.

Correlation of mathematical variables. [6]

Correlation: When one or several interrelated variables take a certain value, the value of the corresponding other variable is uncertain, but it still changes within a certain range according to a certain rule. This kind of relationship between variables is called a correlation with uncertainty.

Classified by degree

⑴ Complete correlation: The relationship between two variables. The quantity change of one variable is uniquely determined by the quantity change of the other variable, that is, the functional relationship.

⑵Incomplete correlation: The relationship between two variables is between irrelevance and complete correlation.

(3) Irrelevant: If the quantity changes of two variables are independent of each other, there is no relationship.

Sort by direction

⑴Positive correlation: The change trend of the two variables is the same. From the scatter diagram, it can be seen that the scattered position of each point is the area from the lower left corner to the upper right corner, that is, when the value of one variable changes from small to large, the value of the other variable From small to large.

⑵ Negative correlation: The changing trends of the two variables are opposite. From the scatter diagram, it can be seen that the scattered position of each point is the area from the upper left corner to the lower right corner, that is, when the value of one variable changes from small to large, the value of the other variable From big to small.

Classified by form

⑴ Linear correlation (linear correlation): When one variable of the correlation relationship changes, the other variable also changes equally.

⑵ Non-linear correlation (curve correlation): When one variable of the correlation relationship changes, the other variable also changes unequally.

Sort by number of variables

⑴Single correlation: only reflects the correlation between an independent variable and a dependent variable.

⑵ Complex correlation: reflecting the correlation between two or more independent variables and the same dependent variable.

(3) Partial correlation: When the dependent variable is related to two or more independent variables, if the remaining independent variables are regarded as constant (ie as constants), only the correlation between the dependent variable and one of the independent variables will be studied , It is called partial correlation.

Big data enables many things that sounded incredible before, because in the past we always talk about certainty in everything, but now we talk about possibility.

In the industrial control system, how to reflect the correlation of each system? This article focuses on the analysis of correlation and causation, correlation and influencing factors, and the realization of algorithm schemes.

2.1. Relations

Correlation analysis and regression analysis are closely related in practical applications. However, in regression analysis, what is concerned is the functional form of the dependence of one random variable Y on another (or a group of) random variables X. In correlation analysis, the variables discussed have the same status, and the analysis focuses on various correlation characteristics between random variables. For example, if XX X and YY Y are used to record the mathematics and Chinese scores of primary school students, what is interested in is the relationship between the two, rather than XX X predicting YY Y.

For example, the Pearson correlation coefficient is used to measure the correlation (linear correlation) between two variables X and Y, and its value is between -1 and 1.

ρ X, Y = cov (X, Y) σ X σ Y \ rho_ {X, Y} = \ frac {cov (X, Y) {{\ sigma _ {X} \ sigma_ {Y}} σ X σ Y co v (X, Y)

ρ X, Y = ∑ i = 1 n (X i - X ‾) (Y i - Y ‾) ∑ i = 1 n (X - X ‾) 2 ∑ i = 1 n (Y - Y ‾) 2 \ rho_ {X, Y} = \ frac {\ sum_ {i = 1} ^ {n} (X_ {i} - \ overline {X}) (Y_ {i} - \ overline {Y})} {\ sqrt {\ sum_ {i = 1} ^ {n} (X- \ overline {X}) ^ {2}} \ sqrt {\ sum_ {i = 1} ^ {n} (Y- \ overline {Y}) ^ {2 }}} ∑ i = 1 n (X - X) 2 ∑ i = 1 n (Y - Y) 2 ∑ i = 1 n (X i - X) (Y i - Y)

2.2. Complex correlation

Study one variable x 0 x_{0} x 0 ​ and another set of variables (x 1, x 2,…, xn) (x_{1},x_{2},…, x_{n}) (x 1 ​,x 2 ​,…, xn ​) The degree of correlation between. For example, professional prestige is simultaneously affected by a series of factors (income, culture, power...), then the relationship between the sum of these series of factors and professional prestige is complex correlation. To determine the complex correlation coefficient R 0. 1, 2… n R_{0}._{1,2…n} R 0 ​. 1, 2… n ​, we can first obtain x 0 x_{0} x 0 ​For a set of variables x 1, x 2,…, xn x_{1}, x_{2},…, x_{n} x 1 ​, x 2 ​,…, xn ​ The regression line, then calculate x 0 x_ {0} A simple linear regression between x 0 ​ and the estimated value with a regression line. The multiple correlation coefficient is:

R 0.12… n R0.12…n R 0. 1 2… n has a value range of 0 ≤ R 0.12… n ≤ 1 0≤R0.12…n≤1 R 0. 1 2… n ≤ 1. The larger the multiple correlation coefficient value, the closer the relationship between the variables.

The correlation between multiple variables and a variable at the same time cannot be directly measured, but can only be measured indirectly, the calculation of the multiple correlation coefficient:

Let the dependent variable yy y

, The independent variable is

x 1 , x 2 , . . . , x n x_1,x_2,...,x_n x 1 ​ , x 2 ​ , . . . , x n ​

, Construct a linear model as:

y = b 0 + b 1 x 1 + . . . + b n x n + ε y=b_{0} + b_{1} x_{1} + ... + b_{n} x_{n} + \varepsilon b 1 ​x 1 ​ + . . . + b n ​ x n ​ + ε

y ^ = b 0 + b 1 x 1 + . . . + b n x n \hat{y} =b_{0} + b_{1} x_{1} + ... + b_{n} x_{n} b 1 ​ x 1 ​ + . . .+ b n ​ x n ​

Correlation analysis between yy y and x 1, x 2,..., Xn x_1, x_2,...,x_n x 1 ​, x 2 ​,..., Xn ​ is to analyze yy y and y ^ \ hat{y} y ^ ​ Do a simple correlation analysis

Remember:

  • ry. x 1... xn r_{y.x_1...xn} ry. x 1 ​... xn ​ is yy y and x 1, x 2,..., xn x_1,x_2,... ,x_n x1 ​, x 2 ​,..., the complex correlation coefficient of xn ​,
  • ry. y ^ r_{y.\hat{y}} r y. y ^ ​ ​ is the simple correlation coefficient between yy y and y ^ \hat(y) y ^ ​

ry. x 1... xn r_{y.x_1...xn} r y. x 1 ​... x n ​ The calculation formula:

R = corr (y, x 1,..., Xn) = corr (y, y ^) = cov (y, y ^) var (y) var (y ^) R = corr (y, x_1, .. ., x_n) = corr (y, \ hat {y}) = \ frac {cov (y, \ hat {y})} {\ sqrt {var (y) var (\ hat {y})}} corr ( y, x 1,..., xn) = corr (y, y ^) = var (y) var (y ^) cov (y, y ^)

The multiple correlation coefficient is often used in multiple linear regression analysis. We want to know the degree of correlation between the dependent variable and a set of independent variables, that is, multiple correlation. The multiple correlation coefficient reflects the closeness of one variable to another set of variables.

2.3. Partial correlation

In the case of multiple variables, the degree of linear correlation between two variables is studied when the influence of other variables is controlled. Also called net correlation or partial correlation. For example, the partial correlation coefficient r 13.2 r13.2 r 1 3.2 represents the linear correlation between the variable x 1 x_1 x 1 ​ and the variable x 3 x_3 x 3 ​ after the influence of the control variable x 2 x_2 x 2 ​. The partial correlation coefficient is more capable of reflecting the relationship between the two variables than the simple linear correlation coefficient.

3. Case study

3.1. Simplified description of industrial application scenarios

During the oil delivery process of the oil depot, our industrial control system records the following data. The analysis goal is to provide oil delivery efficiency, and optimize the oil delivery process and related operating procedures by identifying key influencing factors.

Correlation analysis has the following results:

3.1.1. Pairwise data items are related

As shown in the figure below, list two Person correlations for analysis (see document [1] for detailed description of related program codes):

(1) The amount of oil sent is positively correlated with the set amount, and it is strongly correlated;

(2) Loss and overflow are negatively correlated with month.

If we further analyze and find that the fuel delivery rate and the fuel delivery time are weakly negatively correlated, that is to say, the process of filling large tonnage tankers will take longer. Why?

In addition, contradictions mostly exist in pairs, and the relationship between the weights and variables of various aspects of contradictions can also be measured by correlation coefficients.

3.1.2. Each data item of multiple correlation to the overall

Take the hair oil control system as an example, from the “duration”, “crane position”, “set amount”, “oil quantity”, “oil rate”, “time”, “month”, “date”, “loss” Data items such as'overflow','temperature', and'density' are constituted. If we analyze the relationship between duration and other characteristics of the hair oil system, the specific process is as follows:

(1) "Duration" is set to yy y value, other ['Crane position','set amount','oil volume','oil rate','time','month','date', 'Loss overflow','temperature','density'] constitute the multiple input x 0, x 1,..., Xn (x_0,x_1,...,x_n) x 0 ​, x 1 ​,... , xn ​;

(2) Establish a multiple linear regression model (LinearRegression), train the multiple linear regression model according to the above X, YX, Y X, Y;

(3) Predict the yy y value through the model, which is the "duration" here;

(4) The predicted value of "Duration" and the actual value are calculated by the Person correlation coefficient to obtain the correlation;

(5) Repeat the above steps, each characteristic is yy y, and the others are xx x.

At this point, the relevance list as shown in the figure below can be formed.

3.1.3. The use of relevance, the use basis of big data analysis

Due to the above example, it is in a single system that fails to show the advantages of relevant analysis. However, it can be consistent with the actual theoretical causality:

(1) The set amount, the amount of oil delivered, and the amount of loss and overflow are strongly positively correlated to the oil delivery system;

(2) In the same way, the duration, temperature, month, and the set amount, the amount of oil sent, and the amount of loss and spill have a strong positive correlation for the oil;

(3) For the oil delivery system, the date and the oil delivery time have little correlation, and the influencing factors can be ignored.

According to the business characteristics of the oil distribution system and referring to the above results, we can make the following assumptions:

(1) Carry out cluster analysis on the oil delivery time to obtain the law of the time length as a basis for the evaluation of weights or indicators;

(2) For fuel delivery efficiency analysis, it is necessary to associate more data items (including other control systems) and then perform relevant analysis and trend analysis, and evaluate fuel delivery efficiency (oil delivery rate is the calculated value of actual fuel delivery divided by duration , But showing a weak negative correlation, and the correlation with the oil system is not very high);

It is recommended to evaluate the fuel delivery efficiency of the tanker and its affiliated company.

(3) The operation characteristic of the oil depot is the transfer of oil products, including the upstream refinery plant and the downstream terminal gas station, etc.; the operating purpose is storage capacity and turnover throughput. Big data analysis is based on limiting the amount of storage, correlating market (plan) factors, and evaluating the operation of the oil depot under the conditions of safety, fire protection, and environmental protection as the red line, and analyzing it by item.

3.1.4. Business Innovation

In the industrial control environment, it is difficult to talk about innovation based on industrial big data. Here, referring to the Internet thinking, we will strengthen the correlation in data analysis, strengthen the analysis of the correlation between internal and external, and different systems, and avoid the solidification of causality in separate (small) systems. thinking.

For example, in the case of this article, the local environment of the oil distribution system is extended to other related systems, and the part is analyzed in the whole and expanded to the outside of the whole system.

For example, simulating the benchmarking of a certain company’s self-operated purchase, sales and inventory on the Internet, the fueling person/car (customer), oil depot (platform), oil product (cargo), the fuel delivery process is equivalent to the online shopping process. For the online shopping process, big data analysis gives The customer’s hobbies (when to pick up the oil and the oil products to be picked), the customer’s evaluation (tank truck fuel loading efficiency, operating level, safety level, etc.), delivery/receiving efficiency and safety (crane position and oil tank delivery) Matching degree and safety, etc.). details as follows:

(1) Internal and external correlation analysis-tank truck portrait, fuel island/crane position evaluation

The fuel delivery system and the tank truck. The tank truck is an external system relative to the oil depot. Through the correlation analysis of the two loading data items, the tank truck profile is constructed, including efficiency, safety evaluation, crane position matching evaluation, etc. (can be extended to the corresponding Transportation company/classification), cluster analysis is possible;

Use: To supplement the analysis of operating efficiency, to realize the optimal crane position of the tanker, and to achieve the goals of high loading efficiency, low damage and overflow, and safety.

(2) Correlation analysis of internal control system-analysis of crane position and oil tank matching

The crane position and oil tank storage system in the oil delivery system and the oil unloading system are correlated during the oil transportation process. Through the correlation analysis of the crane position and the oil tank data items, the oil transportation efficiency model and the oil transportation safety model are constructed. In the deterministic analysis, algorithms such as clustering in unsupervised learning and restricted Boltzmann machine (RBM) can be used to analyze the oil delivery efficiency/capability category and influencing factors.

Use: Supplement the analysis of operating efficiency, realize the optimization of crane position and oil tank matching, and achieve the goal of high efficiency, low loss and overflow, and safe oil sending and receiving.

(3) The relationship between management dimensions-analysis of the importance of management dimensions

According to an enterprise’s evaluation of the seven indicators of high-efficiency operation guarantee of refined oil depots, oil depot planning process, equipment and facilities guarantee, automation and informatization, on-site management, organization and management, health, safety and environmental protection, etc., the overall evaluation of each index The relevance of the evaluation.

Drive decision-making with data, let the data tell us which management dimension is quantified, the degree of importance and interdependence, and provide managers with a basis for decision-making based on development trends.

3.2. Complex related implementation Demo code

3.2.1. Code description

The source code is developed based on Python 3.6.7 version under Win 10 environment. The main custom functions are:

(1)feature_label_split

Used to initialize the data set, do normalization processing (normalization=True), and split the X and Y data sets.

(2)train_model

Used to train multiple linear models, using the regression model LinearRegression in sklearn.linear_model.

(3)multi_corr

Used to calculate the Person correlation coefficient between Y and the predicted value, using the method that comes with the Pandas tool:

data.corr(method=‘pearson’,min_periods=1)

3.2.2. Core source code

import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.externals import joblib 
from sklearn.preprocessing import StandardScaler 
#normalization from sklearn.metrics import mean_squared_error 
from pandas.core.frame import DataFrame #split 

data Set is x, y 
def feature_label_split(data,label_name,normalization=True): 
    data.rename(columns={ 
 'OutletTotalTime':'Duration','CraneP':'Crane position','SpecifiedL':'Set amount' ,'LCActualL':'Oil Volume','Effi':' 
                   Oil Oil Rate','Time':'Time','Month':'Month','Day':'Date','LossL':' Loss and overflow','OilTemperature':'温度','OilDensity':'密度'}, inplace=True)

    del_name = [label_name]
     
    normalization processing
    if normalization:
        scaler = StandardScaler()
        columns = data.columns
        indexs_train = data.index
        data = pd.DataFrame(scaler.fit_transform(data),index = indexs_train, columns = columns)
        
    #拆分特征与标签
    y = data[label_name]
    x = data.drop(del_name,axis = 1)
    
    return x,y
# 训练多元线性回归模型
def train_model(train_x,train_y):   
    test_percent = 0.7
    x_train,x_test,y_train,y_test = train_test_split(train_x,train_y,test_size = test_percent)
        
    model = LinearRegression()
    model.fit(x_train, y_train) 
   
    score = model.score(x_train, y_train)   
    print("Training score: ", score) 
        
    ypred = model.predict(x_test) 
    mse = mean_squared_error(y_test, ypred) 
    print("MSE: %.2f"% mse) 
    print("RMSE: %.2f"% (mse**(1/2.0))) 
    
    return model 
# Calculate complex correlation 
def multi_corr(): 
    df = pd.read_excel('e:/data1.xlsx') # e:/data1.xlsx 
    key_name = ['Duration','Crane position','Set amount', 'Oil volume','Oil rate','time','month','date','loss amount','temperature','density'] 
    
    key_corr = [] 
    
    for key in key_name: 
        train_x,train_y = feature_label_split(df,label_name=key,normalization=False) 
        model = train_model(train_x,train_y) 
        ypred = model.predict(train_x) 
        fit_y = DataFrame(columns=['Predicted value'],data = ypred) 

        data = pd.concat([train_y,fit_y],axis=1)data = ypred)
        print(data)
        corr = data.corr(method='pearson',min_periods=1)
        key_corr.append(corr.iat[0,1])
        
    print(key_corr)

if __name__ == '__main__':

    multi_corr()

3.2.3. Output results

Output result of actual value and predicted value of a certain feature:

Training score: 0.5983131306423572 
MSE: 0.00 
RMSE: 0.02 
         Density prediction value 
0 0.7397 0.836339 
1 0.7397 0.771213 
2 0.7397 0.787371 
3 0.7397 0.781056 
4 0.7348 0.822460 
... ... 
67615 0.7479 0.760884 
67616 0.7479 0.745060 
67617 0.7526 
0.742016 67618 0.8180 
0.855769 67619 0.7526 0.778705

The final correlation results are as follows:

4. Summary

In May of this year, the Ministry of Industry and Information Technology issued the "Guiding Opinions of the Ministry of Industry and Information Technology on the Development of Industrial Big Data" (Xinfa [2020] No. 67 of the Ministry of Industry and Information Technology), proposing to promote the convergence and sharing of industrial data, deepen data integration innovation, and improve data Governance capabilities, strengthen data security management, and strive to create an industrial big data ecosystem with rich resources, application prosperity, industrial progress, and orderly governance.

This article starts with the application of big data in the oil depot business domain, uses applications to drive data governance, and uses data models and algorithm model analysis as the entry point. On the basis of traditional statistical analysis, quantitative indicators and weighting basis, research and development of data depth and breadth application, give full play to the industry The value of big data. For example, the portraits of tankers mentioned in this article, the matching of crane positions and oil tanks, and data-driven decision-making.

Since the author's level is limited, discussions are welcome.

 

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/109053226