SQL Server Machine Learning Services - Overview and combat (turn)

The original post address: https://d-bi.gitee.io/sqlserver-ml-services/

New Year's articles, go to the old and welcome. In this article, both old also new. Old lies, SQL Server machine learning services is a new feature in Microsoft SQL Server 2016 introduced in respect of, but it was only supports R language, also known as "R Server", in SQL Server 2017 and later versions, provided support for Python, so now known as "machine learning services (machine learning service)" function since the release date has been more than two years, and therefore this is not a new feature. The new place is that, since the release of this feature very little information from the country, on the one hand using an older version of SQL Server users are still many, many people on the other hand is a lack of understanding of the function, however, the machine learning algorithm data can be the depth of excavation, that makes sense for very large enterprise BI intelligence, the AI is also the company's future integration into the BI business intelligence general direction.

Here Insert Picture Description

This article begins with an explanation of machine learning services (in Python for example) the concept of the meaning and application of its basic principles, talk about issues related to its simple installation and deployment, will eventually provide a tutorial to explain its specific application method.

What is Machine Learning Services?

Machine Learning Services is a new feature available in SQL Server, which allows the user in SQL Server, using stored procedure calls R or Python script data processing, training and deployment of machine learning models. In layman's terms, the SQL Server database data can be transferred directly to the modeling process, and Python, the Python is good results can be returned to the SQL Server database (this is further described below), this SQL Server integrated manner Python powerful learning algorithms in terms of data processing and machine, and all operations are done internally in SQL server (so this is also different from the stand-alone version of SQL server machine learning server ).

(Note: SQL Server 2019 also provides support for Java, but does not belong to machine learning service, ignored here)

Why use machine learning service?

To here, some readers may ask, even without the use of machine learning service, Python can still use some libraries (such as sqlalchemy) to achieve access and write-back data (such as from a database in this article ), you can still take advantage of powerful algorithms Python library forecast data and writes the result back to the database, then why should the use of machine learning services? In fact, this is different, because the machine learning using the SQL Server service, can bring the following advantages:

  • Recyclability. Python scripts stored procedure containing the trained model can be serialized in a single value stored in the entity table, as long as the model can be debugged reused when needed later. For example, you can offer to other applications (such as SSIS) or other stored procedure continues to call, deployed to the production environment, regularly perform data mining.
  • Scalability. SQL Server service uses machine learning Anaconda release, built-in rich Python library, you can also install third-party frameworks, such as TensorFlow and scikit-learn.
  • safety. This feature is the location where the data in the run scripts, without going through the network to transfer data to other servers. Trusted by executing the scripting language in the management of SQL Server security framework, a database administrator can allow data while maintaining the safety of mining engineers access to enterprise data.
  • You can monitor sex. You can use reports to monitor the implementation of the Python is provided by Microsoft or self-designed (SSMS regarding the installation of this report ).
  • Integration with BI tools. Parameter values stored procedure can be passed to Python, which means using SSRS (or Power BI Report Builder report) built in, users can adjust the parameters of the machine learning algorithm, will show a tuning algorithm model and results automation. Such as using multiple linear regression forecast sales data, the model is optimized by adjusting the different parameters in the model (for this, if you use Power BI,  this article provides use cases modeling and tuning DAX)

(And if Python itself support multithreading can execute R scripts for multi-threaded manner in R language, there is an advantage)

The operating principle of machine learning services

After the service is installed in the machine learning SQL Server, there will be a service named "SQL Server Launchpad" in SQL Server Configuration Manager, that is, for SQL Server to perform services for an external script (Python or R), the official It provides operational flow shown below, which are not detailed herein, details, refer to this document :

Here Insert Picture Description

(Note: For revoscalepy, which is a development by Microsoft's Machine Learning Python library that provides some commonly used machine learning algorithms, such as decision trees, linear regression and Logistic regression, etc.), support for remote and distributed computing environments, see document )

About Installation and Deployment

Training and deploying Python model in SQL Server requires SQL Server 2017 or later and at least on Windows and Linux can be installed on machine learning services. About installation and deployment, I will provide some useful links below to help you successfully install machine learning server in SQL Server.

(If you need to download the new SQL Server can click here to the download page)

Install this feature, you need to open the SQL Server Installation Center, enter the installation wizard, add Python Machine Learning Services and two functions, you can refer to the official documentation for detailed installation process. If you have problems led to the installation fails during installation, you can refer to this document , the document detailing the problems you might encounter and solve recommendations. Offline installation of machine learning service reference in this document , referred to in the document you need to download two CAB files in the root directory of your installation:

Here Insert Picture Description

If your SQL Server is not the initial version (RTM), then the two CAB files during the installation process and the need to provide the needs of your specific version of SQL Server that corresponds to (check the version of SQL Server specific commands: SELECT @@ VERSION), this page provides a complete SQL Server cumulative update each version corresponding CAB file downloads Download.

(Note: For the Simplified Chinese version of the CAB file extension needs to be changed by the 1033 2052, or the installer will not find the CAB file resulting in an error)

After a successful installation, you can try to run a more simple Python script as a test, this document provides a good introductory tutorial that allows you to slowly familiar with stored procedures to invoke methods in Python.

Service training on how to use machine learning prediction model (Python)

A typical procedure is to open SSMS, create a stored procedure containing Python modeling code, execute the stored process model to records stored in a single entity table and stored procedure to obtain the second record from the table, the call model generates a prediction (classification or regression) of the new data, and returns the result to the new entity table.

(Hint: it is recommended to use Python IDE Pyhon complete script, a good suggestion is to open the machine learning service installation root directory, open jupyter-notebook.exe, use jupyter notebook test Python script)

The following case is to my wine quality data sets , for example, 70% of data as training data, 30% of the data as the test data, using random forest algorithms to predict the quality level of wine. Specific steps are as follows:

1. Create a stored procedure - a training data set to generate a prediction model

(Wherein: @input_data_ parameters for acquiring data sets)

DROP PROCEDURE IF EXISTS generate_model_rfc;
go
CREATE PROCEDURE generate_model_rfc (@trained_model varbinary(max) OUTPUT)
AS
BEGIN
    EXECUTE sp_execute_external_script
      @language = N'Python'
    , @script = N'
import numpy as np
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
dt = train_data
dt["Level"] = dt["Level"].astype(''category'')
X = dt[["Fixed Acidity","Volatile Acidity","Citric acid","Residual Sugar","Chlorides","Free Sulfur Dioxide","Total Sulfur Dioxide","Density","PH","Sulphates","Alcohol"]]
Y = dt[["Level"]]
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size=0.3, random_state=42)
RFC=RandomForestClassifier(n_estimators=100,criterion=''gini'',min_samples_split=2,max_depth=2)
RFC.fit(X_Train,Y_Train.values.ravel())
trained_model = pickle.dumps(RFC)
'
, @input_data_1 = N'select "Fixed Acidity", "Volatile Acidity", "Citric acid", "Residual Sugar", "Chlorides", "Free Sulfur Dioxide", "Total Sulfur Dioxide", "Density", "PH", "Sulphates", "Alcohol", "Level" from dbo.RedWineQuality'
, @input_data_1_name = N'train_data'
, @params = N'@trained_model varbinary(max) OUTPUT'
, @trained_model = @trained_model OUTPUT;
END;
GO

2. Create a table - for storing predictive model

CREATE TABLE dbo.my_py_models (
    model_name VARCHAR(50) NOT NULL DEFAULT('default model') PRIMARY KEY,
    model_object VARBINARY(MAX) NOT NULL
);
GO

3. The implementation of a stored procedure - to generate a prediction model and stored in a table created above (my_py_models) in

DECLARE @model_object VARBINARY(MAX);
EXEC generate_model_rfc @model_object OUTPUT;
INSERT INTO my_py_models (model_name, model_object) VALUES('RandomForestClassification(RFC)', @model_object);

4. Create a second storage procedure - used to invoke the new data to an existing predictive model to predict

The case of the training data set and test data set is in use Python scripts train_test_split split, while the case of official documents different from this, the training data set and test data set using the WHERE split, has passed as a parameter to two Python scripts are stored procedure, so in this case, whether generate_model_rfc (the first stored procedure) or py_predict_rfc, @ input_data_1 parameters are the same. Therefore, this case has a certain inconvenience for use of the reference data set WHERE split, split in two Note here that need to specify the same random seed (as described herein: random_state = 42)

DROP PROCEDURE IF EXISTS py_predict_rfc;
GO
CREATE PROCEDURE py_predict_rfc (@model varchar(100))
AS
BEGIN
	DECLARE @py_model varbinary(max) = (select model_object from dbo.my_py_models where model_name = @model);
	EXEC sp_execute_external_script
				@language = N'Python',
				@script = N'
import pickle
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
py_model_rfc = pickle.loads(py_model)
dt = data
dt["Level"] = dt["Level"].astype(''category'')
X = dt[["Fixed Acidity","Volatile Acidity","Citric acid","Residual Sugar","Chlorides","Free Sulfur Dioxide","Total Sulfur Dioxide","Density","PH","Sulphates","Alcohol"]]
Y = dt[["Level"]]
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size=0.3, random_state=42)
Predictions = py_model_rfc.predict(X_Test)
OutputDataSet=pd.DataFrame(data=Predictions,columns=[''Prediction''])
OutputDataSet[''Actual''] = pd.Series(Y_Test[''Level''].values,index=np.arange(0,len(Y_Test)))
'
, @input_data_1 = N'select "Fixed Acidity", "Volatile Acidity", "Citric acid", "Residual Sugar", "Chlorides", "Free Sulfur Dioxide", "Total Sulfur Dioxide", "Density", "PH", "Sulphates", "Alcohol", "Level" from dbo.RedWineQuality'
, @input_data_1_name = N'data'
, @params = N'@py_model varbinary(max)'
, @py_model = @py_model
with result sets (("Prediction" int, "Actual" int));
END;
GO

5. Create a new table - storage model for test results

CREATE TABLE [dbo].[py_rfc_predictions](
 [Prediction] int,
 [Actual] int
)

Performing a second procedure storage 6. - prediction data and outputs the results to said new table (py_rfc_predictions) in

As shown in the code table which returns a verification table just two fields, a prediction of the data stored - [Prediction], it is a real data - [Actual].

INSERT INTO py_rfc_predictions
EXEC py_predict_rfc 'RandomForestClassification(RFC)';
SELECT * FROM py_rfc_predictions;

So far, that process is complete and the training machine learning models used.

to sum up

SQL Server and Python strong combination enables developers to report better access to data through the depth of excavation machine learning algorithms to generate, these data are often difficult to use front-end BI tools to complete, such as in Power BI, the use of DAX only simple algorithm based (as implemented in DAX KNN classification algorithm ), the Power BI Premium and Power BI Embedded although with machine learning but only supports cloud version, and this training machine learning models compared to the user's own Python code directly using the training to their needs, whether it is optional or effects, etc., have been very noticeable gap. Apparently in an enterprise environment, using machine learning and data mining services stored in a database, and then use the front-end BI tools directly read the result set is more standardized and more practical approach.

Creative Commons License
This work is Creative Commons Attribution - NonCommercial - ShareAlike 4.0 International License Agreement for licensing.

Guess you like

Origin www.cnblogs.com/joyanli/p/12529459.html