CRFs - template generation

Format training data and test data: 
 è¿éåå¾çæè¿ °
As indicated above, "He reckons the current account deficit will narrow to only # 1.8 billion in September." Representative of a training sentence xx, whereas CRF requires thus split into sentences each word line and fixed number of data columns, wherein the columns of the original input in addition, may also contain other information, such as the example above the second column contains the POS information, Label information is the last one, that is, the standard answer yy. The lag between the different training sequences and sequences, it is distinguished by a blank line.
CRF training time, asked us to provide feature templates, what is it feature templates, first look at the following picture: 

è¿éåå¾çæè¿ °
 
"% X [row, column] " representative for the current shift position upwardly or downwardly directed | Row | row and column values of the column points. Such as the figure above, the current position of the point of "the DT B-NP", then "% x [0,0]" representative for this row points to offset 0, a value of 0, that is, "The", and " % x [0,1] "representative for the current shift point line 0, the value of the first column, i.e." DT ","% x [ -2,1] " represents the shift up point 2 to obtain the current line, value of the first one, that is, "PRP", and so on. 
There are two main features of the CRF templates, and Bigram Unigram stencil, and Bigram Unigram attention with respect to the output sequences, rather than with respect to the input sequence. For "U01:% x [0,1] " Such a template, the above input data produces the following example of the characteristic function: 
 è¿éåå¾çæè¿ °
If an output sequence set size is: LL, then each line of the training data characteristic function LL , If the input sequence length is NN, it will generate a stencil Unigram N * LN * L characteristic function. Similarly, such a Bigram template "B01:% x [0,1] ", the label will also consider the current output on an output label, similar functions occur following characteristics: 
 è¿éåå¾çæè¿ °
such combinations will produce N * L * LN * L * L characteristic function.

U template, status characteristics, i.e. undirected graph points; i.e., the calculated cost is the weight value corresponding to feature vectors

B templates, transfer characteristics, i.e. undirected graph edges; ( a so-called transfer refers to a transfer of the output y, rather than the observed value x)

Each node in the probability map, each edge corresponds to a feature vector, the node is U characteristic, wherein the edge is B

 

   //将根据特征os,获取该特征的id,如果不存在该特征,生成新的id,将该id添加到feature变量中
    private boolean buildFeatureFromTempl(List<Integer> feature, List<String> templs, int curPos, TaggerImpl tagger)
    {
        for (String tmpl : templs)
        {
            String featureID = applyRule(tmpl, curPos, tagger);
            if (featureID == null || featureID.length() == 0)
            {
                System.err.println("format error");
                return false;
            }

            //将提取的特征转化为数值,存入feature中,即feature代表一个节点(node或path)的特征向量
            int id = getID(featureID);//获取该特征的id,如果不存在该特征,生成新的id
            if (id != -1)
            {
                feature.add(id);
            }
        }
        return true;
    }

    public boolean buildFeatures(TaggerImpl tagger)
    {
        List<Integer> feature = new ArrayList<Integer>();
        List<List<Integer>> featureCache = tagger.getFeatureCache_();//存放是每个节点或者边对应的特征向量,节点便是node[i][j],边的概念后续会接触
        tagger.setFeature_id_(featureCache.size());//做个标记,以后要取该句子的特征,可以从该id的位置取

        for (int cur = 0; cur < tagger.size(); cur++)//遍历每个词,计算每个词的特征
        {
            if (!buildFeatureFromTempl(feature, unigramTempls_, cur, tagger))//函数根据当前词(cur)以及当前的特征(如: %x[-2,0]),生成一个特征向量feature
            {
                return false;
            }
            feature.add(-1);
            featureCache.add(feature);//将该词的特征添加到feature_cache中,add方法会将feature拷贝一份并将最后添加-1,方便后续读取
            feature = new ArrayList<Integer>();
        }
        for (int cur = 1; cur < tagger.size(); cur++)//遍历每条边,计算每条边的特征
        {
            if (!buildFeatureFromTempl(feature, bigramTempls_, cur, tagger))
            {
                return false;
            }
            feature.add(-1);
            featureCache.add(feature);
            feature = new ArrayList<Integer>();
        }
        return true;
    }

 

Guess you like

Origin blog.csdn.net/asdfsadfasdfsa/article/details/90577377