Machine learning framework ML.NET study notes text feature analysis [3]

First, the problem to be solved

Problem: Often some units or organizations need to input meeting record time of the meeting, we need to machine learning for text entered by the user to automatically judge, pass or fail. (The same problem also similar spam messages detected, work logs quality analysis.)

Processing ideas: meeting our existing artificial judge, pass or fail mark, by learning form the model for these records, still learning algorithm decision tree algorithm uses fast binary classification of, and the last article, this time the characteristics of the input value is no longer float, but the Chinese text. Here we must relate to the text feature extraction.

Why text feature extraction it? Because the text is human language, the symbol sequence of words can not be passed directly to the algorithm. And the computer program algorithm only accepts digital eigenvectors (float or float array) having a fixed length, the document can not understand text variable length.

The usual text feature extraction methods as follows:

These are just need to understand the general meaning, we do not need to achieve a text feature extraction algorithm, only need to use the platform's native methods on it.

Method system comes with a text processing features, input is a string, requires a statement of words separated by spaces, a sentence of English vocabulary is born by the space division, but Chinese sentences are not, so we need to segment words operation, the process is as follows:

 

Second, the code

Code article on the overall process and basically the same description, for simplicity, we have omitted the model storage and read.

Look at the data set:

 

code show as below:

namespace BinaryClassification_TextFeaturize
{
    class Program
    {
        static readonly string DataPath = Path.Combine(Environment.CurrentDirectory, "Data", "meeting_data_full.csv");

        static void Main(string[] args)
        {
            MLContext mlContext = new MLContext();
            var fulldata = mlContext.Data.LoadFromTextFile<MeetingInfo>(DataPath, separatorChar: ',', hasHeader: false);
            var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.15);
            var trainData = trainTestData.TrainSet;
            var testData = trainTestData.TestSet;

            var trainingPipeline = mlContext.Transforms.CustomMapping<JiebaLambdaInput, JiebaLambdaOutput>(mapAction: JiebaLambda.MyAction, contractName: "JiebaLambda")
                .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: "JiebaText"))
                .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features"));
            ITransformer trainedModel = trainingPipeline.Fit(trainData);

            
            //评估
            var predictions = trainedModel.Transform(testData);           
            var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label");
            Console.WriteLine($"Evalution Accuracy: {metrics.Accuracy:P2}");
           

            //创建预测引擎
            var predEngine = mlContext.Model.CreatePredictionEngine<MeetingInfo, PredictionResult>(trainedModel);

            //预测1
            MeetingInfo sampleStatement1 = new MeetingInfo { Text = "支委会。" };
            var predictionresult1 = predEngine.Predict(sampleStatement1);
            Console.WriteLine($"{sampleStatement1.Text}:{predictionresult1.PredictedLabel}");         

            //预测2
            MeetingInfo sampleStatement2 = new MeetingInfo { Text = "开展新时代中国特色社会主义思想三十讲党员答题活动。" };
            var predictionresult2 = predEngine.Predict(sampleStatement2);
            Console.WriteLine($"{sampleStatement2.Text}:{predictionresult2.PredictedLabel}");        

            Console.WriteLine("Press any to exit!");
            Console.ReadKey();
        }
        
    }

    public class MeetingInfo
    {
        [LoadColumn(0)]
        public bool Label { get; set; }
        [LoadColumn(1)]
        public string Text { get; set; }
    }

    public class PredictionResult : MeetingInfo
    {
        public string JiebaText { get; set; }
        public float[] Features { get; set; }
        public bool PredictedLabel;
        public float Score;
        public float Probability;        
    }
}
View Code

  

三、代码分析

 和上一篇文章中相似的内容我就不再重复解释了,重点介绍一下学习管道的建立。

var trainingPipeline = mlContext.Transforms.CustomMapping<JiebaLambdaInput, JiebaLambdaOutput>(mapAction: JiebaLambda.MyAction, contractName: "JiebaLambda")
    .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: "JiebaText"))
    .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features"));   

 首先,在进行文本特征转换之前,我们需要对文本进行分词操作,您可以对样本数据进行预处理,形成分词的结果再进行学习,我们没有采用这个方法,而是自定义了一个分词处理的数据处理管道,通过这个管道进行分词,其定义如下:

namespace BinaryClassification_TextFeaturize
{
    public class JiebaLambdaInput
    {
        public string Text { get; set; }
    }

    public class JiebaLambdaOutput
    {
        public string JiebaText { get; set; }
    }

    public class JiebaLambda
    {       
        public static void MyAction(JiebaLambdaInput input, JiebaLambdaOutput output)
        {
            JiebaNet.Segmenter.JiebaSegmenter jiebaSegmenter = new JiebaNet.Segmenter.JiebaSegmenter();
            output.JiebaText = string.Join(" ", jiebaSegmenter.Cut(input.Text));          
        }        
    }
}

   最后我们新建了两个对象进行实际预测:

            //预测1
            MeetingInfo sampleStatement1 = new MeetingInfo { Text = "支委会。" };
            var predictionresult1 = predEngine.Predict(sampleStatement1);
            Console.WriteLine($"{sampleStatement1.Text}:{predictionresult1.PredictedLabel}");         

            //预测2
            MeetingInfo sampleStatement2 = new MeetingInfo { Text = "开展新时代中国特色社会主义思想三十讲党员答题活动。" };
            var predictionresult2 = predEngine.Predict(sampleStatement2);
            Console.WriteLine($"{sampleStatement2.Text}:{predictionresult2.PredictedLabel}");

 预测结果如下:

 

四、调试

上一篇文章提到,当我们运行Transform方法时,会对所有记录进行转换,转换后的数据集是什么样子呢,我们可以写一个调试程序看一下。

        var predictions = trainedModel.Transform(testData);
        DebugData(mlContext, predictions);

        private static void DebugData(MLContext mlContext, IDataView predictions)
        {
            var trainDataShow = new List<PredictionResult>(mlContext.Data.CreateEnumerable<PredictionResult>(predictions, false, true));

            foreach (var dataline in trainDataShow)
            {
                dataline.PrintToConsole();
            }
        }

    public class PredictionResult 
    {
        public string JiebaText { get; set; }
        public float[] Features { get; set; }
        public bool PredictedLabel;
        public float Score;
        public float Probability;
        public void PrintToConsole()
        {
            Console.WriteLine($"JiebaText={JiebaText}");
            Console.WriteLine($"PredictedLabel:{PredictedLabel},Score:{Score},Probability:{Probability}");
            Console.WriteLine($"TextFeatures Length:{Features.Length}");
            if (Features != null)
            {
                foreach (var f in Features)
                {
                    Console.Write($"{f},");
                }
                Console.WriteLine();
            }
            Console.WriteLine();
        }
    }

  通过对调试结果的分析,可以看到整个数据处理管道的工作流程。

 

五、资源获取

源码下载地址:https://github.com/seabluescn/Study_ML.NET

工程名称:BinaryClassification_TextFeaturize

点击查看机器学习框架ML.NET学习笔记系列文章目录

Guess you like

Origin www.cnblogs.com/seabluescn/p/10914829.html