Python implementation of discovering character relationships based on co-occurrence

Python implementation of discovering character relationships based on co-occurrence

Reference link:
Extract the relationship between characters in "Train to Busan" , and
use Python's networkx to draw a beautiful network diagram

1. Co-occurrence relationship

In bibliometrics, the co-word method of keywords is often used to determine the relationship between topics in the discipline represented by the bibliography. And here, we need to analyze the relationship between the characters in the play by analyzing a novel or script. Both have much in common.

Generally, we believe that there must be some kind of relationship between two characters appearing in the same paragraph in an article, so the general flow of our program can also be determined. We can do word segmentation first, extract the characters in each paragraph, then count the number of occurrences of the two characters at the same time in paragraphs, and store the results in a two-dimensional matrix. This matrix can also be used as the matrix of the relational graph, and the elements in the matrix (the number of occurrences of statistics) are the weights of the edges.

For example, for example, the word segmentation results of the existing three paragraphs are as follows: a/b/c, b/a/f, a/d/c, then ab occurs 2 times, ac occurs 2 times, and so on .

At the same time, for the sake of convenience, we also record the relationship between characters and characters through files, and the character relationship we want to analyze comes from the name of the person's name (novel).

2.jieba participle

The principle and grammar of jieba word segmentation can refer to this article " Basic Principles of Chinese Word Segmentation and Usage of Jieba Word Segmentation "

Although there are jieba word segmentation to analyze the article, it is still not very accurate. For example, there is a character in the name of a person called "Yi Xue", "Yi" is an adverb, and "Xue" is a verb, so it is difficult to separate this person's name. Fortunately, the stuttering word segmentation provides a custom dictionary, and we can modify our dictionary bit by bit according to the previous word segmentation results. Of course, I suggest that when building a custom dictionary, it is best to directly copy the role table in the name of the person's name, and mark all the parts of speech as nr (person's name).

In this way, we can filter out the names by firstly dividing the words and then filtering the parts of speech. After filtering, it is recorded in a list of each segment for the subsequent matrix formation.

This process is carried out in units of paragraphs, so a global dictionary can be set to record the weight of each character (ie word frequency statistics). code show as below:

# 将剧本进行分词,并将表示人名的词提出,将其他停用词和标点省略
# 提出人名的同时,同name字典记录下来,作为矩阵的行和列
def cut_word(text):
    words=pseg.cut(text)
    L_name=[]
    for x in words :
        if x.flag!='nr' or len(x.word) < 2:
            continue
        if not Names.get(x.word):
            Names[x.word]=1
        else:
            Names[x.word]=Names[x.word]+1
        L_name.append(x.word)
    return L_name

# 建立词频字典和每段中的人物列表
def namedict_built():
    global Names
    with open('e:/PY/relationship_find/test.txt','r') as f:
        for l in f.readlines():
            n=cut_word(l)
            if len(n)>=2: # 由于要计算关系,空list和单元素list没有用
                Lines.append(n)
    Names=dict(sorted(Names.items(),key = lambda x:x[1],reverse = True)[:36])
    # print(Line)

3. Build the matrix

Although the word matrix is ​​used, it is actually done in the code using a two-dimensional dictionary, because it is faster to access. The statistics are also very simple (bao) and single (li), that is, traverse the character list of each paragraph we obtained above. . .

Since there are always some strange words in the word segmentation results, when we construct the matrix, we directly use the characters in the Names in the above code as the benchmark to filter out other words that are not in the Names, otherwise there will be other things messing in. . code show as below:

# 通过遍历Lines来构建贡献矩阵
def relation_built():
    for key in Names:
        relationships[key]={}
    for line in Lines:
        for name1 in line:
            if not Names.get(name1):
                continue
            for name2 in line:
                if name1==name2 or (not Names.get(name2)):
                    continue
                if not relationships[name1].get(name2):     
                    relationships[name1][name2]= 1
                else:
                    relationships[name1][name2] = relationships[name1][name2]+ 1
    # print(relationships)

networkx+matplotlib drawing

With the previous relationships matrix, we can make a network graph with weighted edges based on the matrix. There are countless online tutorials on this drawing method, and the details are not recorded. The code is probably like this:

def Graph_show():
    mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体
    mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题
    G=nx.Graph()
    # 在NetworkX中,节点可以是任何哈希对象,像一个文本字符串,一幅图像,一个XML对象,甚至是另一个图或任意定制的节点对象
    with open('e:/PY/relationship_find/edge.txt','r') as f:
        for i in f.readlines():
            line=str(i).split()
            if line == []:
                continue
            if int(line[2])<=50:
                continue
            G.add_weighted_edges_from([(line[0],line[1],int(line[2]))])
    nx.draw(G,pos=nx.shell_layout(G),node_size=1000,node_color = '#A0CBE2',edge_color='#A0CBE1',with_labels = True,font_size=12)
    plt.show()

Figure made. . It's ugly to be honest, but it's a reliable picture anyway.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324620874&siteId=291194637