Calculate the KL distance between two paragraphs of text

Java implementation of relative entropy (relative entropy or Kullback-Leibler divergence, KL distance) (1)
Using information theory methods can perform some simple natural language processing,
        such as using relative entropy for classification or using relative entropy to measure two random distributions When the two random distributions are the same, the relative entropy is 0. When the difference between the two random distributions increases, the relative entropy of the device also increases. Our following experiment is to measure the difference in probability distribution.

Test methods, materials requirements and
requirements:
    1. The extract of any piece of text, this text statistics relative frequencies of all characters. Assume that these relative frequencies are the probabilities of these characters (that is, use relative frequencies instead of probabilities);

    2. Take another piece of text and calculate the probability of character distribution in the same way;

    3. Calculate the KL distance of the character distribution in the two pieces of text;

    4. Give an example (Find two distributions p and q arbitrarily), the KL distance is asymmetric, that is, D(p//q)!=D(q//p);

method:
D(p//q)=sum(p( x)*log(p(x)/q(x))). Where p (x) and q (x) is the probability distribution of two

conventions 0 * log (0 / q ( x)) = 0; p (x) * log (p (x) / 0) = infinity;

Experimental materials:
   From Phoenix News, the names of the two news articles extracted are:

   What "secret" Zhang Ailing has been revealed in "Little Reunion"? "

    "Small Reunion": A Dream of Zhang Ailing"

"The Secret Intelligence War between Mao Zedong and Chiang Kai-shek before the Chongqing Negotiations in 1945" The

three news codes are all utf-8, the size is about 11k, and they are all multi-page news. .

The contents of the three news articles are as follows



[img] Java implementation of relative entropy (relative entropy or Kullback-Leibler divergence, KL distance) (1) The
use of information theory methods can perform some simple natural language processing
        such as the use of relative entropy for classification or use The relative entropy measures the difference between two random distributions. When the two random distributions are the same, the relative entropy is 0. When the difference between the two random distributions increases, the relative entropy of the device also increases. Our following experiment is to measure the difference in probability distribution.

Test methods, materials requirements and
requirements:
    1. The extract of any piece of text, this text statistics relative frequencies of all characters. Assume that these relative frequencies are the probabilities of these characters (that is, use relative frequencies instead of probabilities);

    2. Take another piece of text and calculate the probability of character distribution in the same way;

    3. Calculate the KL distance of the character distribution in the two pieces of text;

    4. Give an example (Find two distributions p and q arbitrarily), the KL distance is asymmetric, that is, D(p//q)!=D(q//p);

method:
D(p//q)=sum(p( x)*log(p(x)/q(x))). Where p (x) and q (x) is the probability distribution of two

conventions 0 * log (0 / q ( x)) = 0; p (x) * log (p (x) / 0) = infinity;

Experimental materials:
   From Phoenix News, the names of the two news articles extracted are:

   What "secret" Zhang Ailing has been revealed in "Little Reunion"? "

    "Small Reunion": A Dream of Zhang Ailing"

"The Secret Intelligence War Between Mao Zedong and Chiang Kai-shek before the Chongqing Negotiations in 1945"

The encoding of the three news articles is utf-8, the size is about 11k, and they are all multi-page news.

  In the experiment, we used two methods to calculate the probability. One: Calculate the probability in units of characters; Two: Calculate the probability in units of Chinese words. In the second case, we use the Jeasy word segmentation component for word segmentation. The word segmentation component is a word segmentation method based on forward maximum matching. The result of the word segmentation is It is correct in most cases.

Java code Copy code
  1. **   
  2.   
  3.  *  @author  liuyu   
  4.  * This entity is used as a unit of each character   
  5.  *   
  6.  */   
  7. public class Entity   
  8. {   
  9.     String word; //Store characters  
  10.     float  pValue; //Store the probability value corresponding to the character  
  11.     public  Entity() //Class constructor  
  12.     {     
  13.         pValue=0;   
  14.         word="";   
  15.            
  16.     }   
  17.   
  18. }  
**

 * @author liuyu
 * 此实体作为每个字符的一个单位
 *
 */
public class Entity
{
    String word;//存储字符
    float pValue;//存储该字符对应的概率值
    public Entity()//类的构造函数
    {  
        pValue=0;
        word="";
        
    }

}

 

Java code Copy code
  1. Read file    
  2. public static String GetFileText(String path) throws  FileNotFoundException,IOException   
  3.     {      
  4.         InputStreamReader inStreamReader=new InputStreamReader(new FileInputStream(path),"UTF-8");   
  5.         //String strFile1=   
  6.         BufferedReader bufReader=new BufferedReader(inStreamReader);   
  7.         String line;   
  8.         StringBuilder sb=new StringBuilder();   
  9.         while((line=bufReader.readLine())!=null)   
  10.         {   
  11.             sb.append(line+" ");   
  12.         }   
  13.         inStreamReader.close();   
  14.         bufReader.close();   
  15.         String strFile=sb.toString();   
  16.          
  17.            
  18.            
  19.         return strFile;   
  20.            
  21.     }  
读取文件 
public static String GetFileText(String path) throws  FileNotFoundException,IOException
    {   
        InputStreamReader inStreamReader=new InputStreamReader(new FileInputStream(path),"UTF-8");
        //String strFile1=
        BufferedReader bufReader=new BufferedReader(inStreamReader);
        String line;
        StringBuilder sb=new StringBuilder();
        while((line=bufReader.readLine())!=null)
        {
            sb.append(line+" ");
        }
        inStreamReader.close();
        bufReader.close();
        String strFile=sb.toString();
      
        
        
        return strFile;
        
    }


3. Segmentation of characters

(1) Segmentation

Java code Copy code
  1. public static String CutText(String path)throws FileNotFoundException,IOException   
  2.     {   
  3.            
  4.       String fileText=GetFileText(path);   
  5.         
  6.            
  7.         MMAnalyzer analyzer=new MMAnalyzer();   
  8.         String result =null;   
  9.         String spliter="|";   
  10.         try         
  11.         {   
  12.             result = analyzer.segment(fileText, spliter);       
  13.         }         
  14.         catch (IOException e)         
  15.         {        
  16.             e.printStackTrace ();        
  17.         }        
  18.         //System.out.print(result);   
  19.         return result;   
  20.            
  21.     }  
public static String CutText(String path)throws FileNotFoundException,IOException
    {
        
      String fileText=GetFileText(path);
     
        
        MMAnalyzer analyzer=new MMAnalyzer();
        String result =null;
        String spliter="|";
        try      
        {
            result = analyzer.segment(fileText, spliter);    
        }      
        catch (IOException e)      
        {     
            e.printStackTrace();     
        }     
        //System.out.print(result);
        return result;
        
    }


(2) Separate words

Java code Copy code
  1. public static String CutTextSingleCharacter(String path)throws FileNotFoundException,IOException   
  2.     {   String text=GetFileText(path);   
  3.         String proText=null;   
  4.         Pattern pattern=Pattern.compile("[//u4E00-//u9FA5//uF900-//uFA2D]");   
  5.         Matcher m = pattern.matcher (text);   
  6.         StringBuffer sb=new StringBuffer();   
  7.         Boolean flag=m.find();   
  8.         while(flag)   
  9.         {   
  10.             int start=m.start();   
  11.             int end=m.end();   
  12.             sb.append(text.substring(start, end)+"|");   
  13.             //System.out.println(text.substring(start,end));   
  14.             flag=m.find();   
  15.         }   
  16.        proText=sb.toString();   
  17.         return proText;   
  18. }  
public static String CutTextSingleCharacter(String path)throws FileNotFoundException,IOException
    {   String text=GetFileText(path);
        String proText=null;
        Pattern pattern=Pattern.compile("[//u4E00-//u9FA5//uF900-//uFA2D]");
        Matcher m=pattern.matcher(text);
        StringBuffer sb=new StringBuffer();
        Boolean flag=m.find();
        while(flag)
        {
            int start=m.start();
            int end=m.end();
            sb.append(text.substring(start, end)+"|");
            //System.out.println(text.substring(start,end));
            flag=m.find();
        }
       proText=sb.toString();
        return proText;
}




4. Calculate the probability of characters

Java code Copy code
  1. public static ArrayList<Entity> CalcuP(String path) throws IOException   
  2.     {     //Calculate relative entropy in words  
  3.         //String result=CutText(path);   
  4.         //Calculate relative entropy in words   
  5.         String result=CutTextSingleCharacter(path);   
  6.         String []words=result.split("//|");   
  7.            
  8.          
  9.         ArrayList<Entity> enList=new ArrayList();   
  10.         for(String w: words)   
  11.         {  w=w.trim();   
  12.             Entity en=new Entity();   
  13.             en.word = w;   
  14.             en.pValue = 1 ;   
  15.             enList.add (en);   
  16.             //System.out.println(w);   
  17.         }   
  18.        
  19.         float total=enList.size();   
  20.         for(int i=0;i<enList.size()-1;i++)   
  21.         {    
  22.                
  23.             if(!enList.get(i).word.isEmpty())   
  24.             {   
  25.                 for(int j=i+1;j<enList.size();j++)   
  26.                 {   
  27.                     if(enList.get(i).word.equals(enList.get(j).word))   
  28.                     {   
  29.                         enList.get(i).pValue++;   
  30.                         enList.get(j).pValue=0;   
  31.                         enList.get(j).word="";   
  32.                     }   
  33.                 }   
  34.             }   
  35.         }   
  36.         for(int i=enList.size()-1;i>=0;i--)   
  37.         {   
  38.             if(enList.get(i).pValue<1.0)   
  39.                 enList.remove(i);   
  40.         }   
  41.         for(int i=0;i<enList.size();i++)   
  42.         {   
  43.             enList.get(i).pValue=enList.get(i).pValue/total;   
  44.         }   
  45.            
  46.     return enList;   
  47.     }  
public static ArrayList<Entity> CalcuP(String path) throws IOException
    {    //以词为单位计算相对熵
        //String result=CutText(path);
        //以字为单位计算相对熵
        String result=CutTextSingleCharacter(path);
        String []words=result.split("//|");
        
      
        ArrayList<Entity> enList=new ArrayList();
        for(String w: words)
        {  w=w.trim();
            Entity en=new Entity();
            en.word=w;
            en.pValue=1;
            enList.add(en);
            //System.out.println(w);
        }
    
        float total=enList.size();
        for(int i=0;i<enList.size()-1;i++)
        { 
            
            if(!enList.get(i).word.isEmpty())
            {
                for(int j=i+1;j<enList.size();j++)
                {
                    if(enList.get(i).word.equals(enList.get(j).word))
                    {
                        enList.get(i).pValue++;
                        enList.get(j).pValue=0;
                        enList.get(j).word="";
                    }
                }
            }
        }
        for(int i=enList.size()-1;i>=0;i--)
        {
            if(enList.get(i).pValue<1.0)
                enList.remove(i);
        }
        for(int i=0;i<enList.size();i++)
        {
            enList.get(i).pValue=enList.get(i).pValue/total;
        }
        
    return enList;
    }


5. Calculate relative entropy

Java code Copy code
  1. /*Used to calculate the relative entropy of two paragraphs of text*/  
  2. public static float CalKL(ArrayList<Entity>p,ArrayList<Entity>q)   
  3. {     
  4.     float kl=0;   
  5.        
  6.     float  infinity = 10000000 ; //infinity  
  7.     double  accretion=infinity; //Set the initial value of the entropy increase to infinity.  
  8.     //Find the probability of the corresponding word in p from q, if found, update the value of accretion and add it to the relative entropy; if not found, increase to infinity   
  9.     for(int i=0;i<p.size();i++)   
  10.     {      
  11.         if(q.size()!=0)   
  12.         {   for(int j=q.size()-1;j>=0;j--)   
  13.             {       
  14.                 if(p.get(i).word.equals(q.get(j).word))   
  15.                 {  accretion=p.get(i).pValue*Math.log(p.get(i).pValue/q.get(j).pValue);   
  16.                     //q.remove(j);   
  17.                     break;   
  18.                        
  19.                 }   
  20.         }   
  21.            
  22.         kl+=accretion;   
  23.         accretion=infinity;   
  24.         }   
  25.            
  26.            
  27.     }   
  28.   
  29.        
  30.     return kl;   
  31.        
  32. }   
  33.   
  34. }  
/*用于计算两段文本的相对熵*/
public static float CalKL(ArrayList<Entity>p,ArrayList<Entity>q)
{  
    float kl=0;
    
    float infinity=10000000;//无穷大
    double accretion=infinity;//设置熵增加量的初始值为无穷大。
    //从q中找出与p中相对应词的概率,如果找到了,就将accretion的值更新,并累加到相对熵上面;如果没找到,则增加了为无穷大
    for(int i=0;i<p.size();i++)
    {   
        if(q.size()!=0)
        {   for(int j=q.size()-1;j>=0;j--)
            {    
                if(p.get(i).word.equals(q.get(j).word))
                {  accretion=p.get(i).pValue*Math.log(p.get(i).pValue/q.get(j).pValue);
                    //q.remove(j);
                    break;
                    
                }
        }
        
        kl+=accretion;
        accretion=infinity;
        }
        
        
    }

    
    return kl;
    
}

}


Results Analysis
main function codes

Java code Copy code
  1. public static void main(String[] args) throws  FileNotFoundException,IOException   
  2.     {   
  3.            
  4.            
  5.         // TODO Auto-generated method stub;   
  6.         ArrayList<Entity> enList1=new ArrayList<Entity>();   
  7.         enList1=CalcuP("C:/Users/liuyu/workspace/KL/KL/zhangailing.txt");   
  8.         ArrayList<Entity> enList2=new ArrayList<Entity>();   
  9.         enList2=CalcuP("C:/Users/liuyu/workspace/KL/KL/zhangailing2.txt");   
  10.         ArrayList<Entity>enList3=new ArrayList<Entity>();   
  11.         enList3=CalcuP("C:/Users/liuyu/workspace/KL/KL/maozedong.txt");   
  12.      double f1=CalKL(enList1,enList2);   
  13.         double f2=CalKL(enList2,enList1);   
  14.         double f3=CalKL(enList1,enList3);   
  15.         double f4=CalKL(enList3,enList1);   
  16.         double f5=CalKL(enList2,enList3);   
  17.         double f6=CalKL(enList3,enList2);   
  18.         System.out.println( "What "secret" does Zhang Ailing reveal in "Small Reunion"? The KL distance between "Small Reunion": A Dream of Zhang Ailing: " +f1);   
  19.         System.out.println( "The KL distance between "A Little Reunion": A Dream of Zhang Ailing" and "What "Secret" Does Zhang Ailing Reveal in "Little Reunion"?" +f2);   
  20.         System.out.println( "What "secret" does Zhang Ailing reveal in "Little Reunion"? "The KL distance between "The Secret Intelligence War between Mao and Chiang Kai-shek before the Chongqing Negotiations in 1945" " +f3);   
  21.         System.out.println( "The KL distance between "The Secret Intelligence War Between Mao and Chiang Kai-shek Before the Chongqing Negotiations in 1945" and "What "Secret" Did Zhang Ailing Reveal in "Little Reunion"?" +f4);   
  22.         System.out.println( "The KL distance between "A Dream of "A Little Reunion" Zhang Ailing" and "The Secret Intelligence War Between Mao and Chiang Kai-shek Before the Chongqing Negotiations in 1945" +f5);   
  23.         System.out.println( "The KL distance between "The Secret Intelligence War between Mao and Chiang Kai-shek before the Chongqing Negotiations in 1945" and "A Dream of Zhang Ailing" in "Little Reunion"" +f6);   
  24. ]  
public static void main(String[] args) throws  FileNotFoundException,IOException
    {
        
        
        // TODO Auto-generated method stub;
        ArrayList<Entity> enList1=new ArrayList<Entity>();
        enList1=CalcuP("C:/Users/liuyu/workspace/KL/KL/zhangailing.txt");
        ArrayList<Entity> enList2=new ArrayList<Entity>();
        enList2=CalcuP("C:/Users/liuyu/workspace/KL/KL/zhangailing2.txt");
        ArrayList<Entity>enList3=new ArrayList<Entity>();
        enList3=CalcuP("C:/Users/liuyu/workspace/KL/KL/maozedong.txt");
     double f1=CalKL(enList1,enList2);
        double f2=CalKL(enList2,enList1);
        double f3=CalKL(enList1,enList3);
        double f4=CalKL(enList3,enList1);
        double f5=CalKL(enList2,enList3);
        double f6=CalKL(enList3,enList2);
        System.out.println("《《小团圆》究竟泄了张爱玲什么“秘密”?》与《《小团圆》:张爱玲的一个梦》的KL距离: "+f1);
        System.out.println("《《小团圆》:张爱玲的一个梦》与《《小团圆》究竟泄了张爱玲什么“秘密”?》的KL距离"+f2);
        System.out.println("《《小团圆》究竟泄了张爱玲什么“秘密”?》与《1945年毛和蒋介石在重庆谈判前的秘密情报战》的KL距离 "+f3);
        System.out.println("《1945年毛和蒋介石在重庆谈判前的秘密情报战》与《《小团圆》究竟泄了张爱玲什么“秘密”?》的KL距离 "+f4);
        System.out.println("《“小团圆”张爱玲的一个梦》与《1945年毛和蒋介石在重庆谈判前的秘密情报战》的KL距离"+f5);
        System.out.println("《1945年毛和蒋介石在重庆谈判前的秘密情报战》与《“小团圆”张爱玲的一个梦》的KL距离"+f6);
]


a. The calculation results in units of characters are as follows:

What "secret" Zhang Ailing has been revealed in "Little Reunion"? KL distance between
"Little Reunion: A Dream of Zhang Ailing" : 2.269998592E9 What "secrets" did Zhang Ailing reveal in "Little Reunion: A Dream of Zhang Ailing" and "Little Reunion"? "KL distance 4.099975168E9
" What is the "secret" of Zhang Ailing in "Little Reunion"? The KL distance between
"Mao and Chiang Kai-shek’s Secret Intelligence War Before the Chongqing Negotiations in 1945" is 3.029988864E9 "The Secret Intelligence War Before Mao and Chiang Kai-shek’s Negotiations in Chongqing in 1945" and "Small Reunion" revealed what “secrets” Zhang Ailing did "? "KL distance of
" 4.289972736E9 " "A Dream of A Little Reunion" Zhang Ailing" and "The Secret Intelligence War of Mao and Chiang Kai-shek before the Chongqing Negotiations in 1945" KL distance of 4.10997504E9
" Secret intelligence of Dong and Chiang Kai-shek before the Chongqing negotiations in 1945" The KL distance between "Battle" and "A Dream of A Little Reunion" Zhang Ailing" is 3.539982336E9

b. The calculation results in terms of words are as follows

: What "secret" does "Little Reunion" reveal Zhang Ailing? "Little Reunion: A Dream of Zhang Ailing" KL distance: 5.629955584E9
"Little Reunion: A Dream of Zhang Ailing" and "Little Reunion" what "secret" Zhang Ailing has revealed? "KL distance of 8.62991872E9
" What "secret" does Zhang Ailing reveal in "Little Reunion"? "The Secret Intelligence War
Between Mao and Chiang Kai-shek Before the Chongqing Negotiations in 1945" is 6.50994432E9 "The Secret Intelligence War Before the Chongqing Negotiations between Mao and Chiang Kai-shek in 1945" and "Small Reunion" revealed what "secrets" Zhang Ailing did. "? 》KL distance 8.029924864E9
The KL distance between
" A Dream of "Little Reunion" Zhang Ailing" and "The Secret Intelligence War Between Mao and Chiang Kai-shek Before the Chongqing Negotiations in 1945" is 9.219941376E9 . The KL distance of "A Dream of Zhang Ailing" is 7.739928576E9

from the above results: The distance between "Zhang Secret" and "Zhang Meng" is the closest, and the direct probability distribution distance between "Mao" and "Zhang Meng" is close to that of "Mao" The probability distribution between "Zhang Secret" and "Zhang Secret".

Another point is the way of passing parameters in java: for simple types, the method of passing by value is used; for complex types, the mechanism of passing by reference is used. This is a bit similar to matlab. So,

double f1=CalKL(enList1,enList2);
  double f2=CalKL(enList2,enList1);
  double f3=CalKL(enList1,enList3);

If you change the value of enlist1, enlist2 in the CalKL function Will make the result incorrect.

 

http://www.javaeye.com/topic/609462

Guess you like

Origin blog.csdn.net/jrckkyy/article/details/5956905