Python3 uses openpyxl and jieba to extract keywords from posts - processing non-democratic posts

Python3 uses openpyxl and jieba to extract keywords from posts - processing non-democratic posts

20180421 Work Notes

1. Work

Last time, I said that the newly added more than 500 non-democratic sub-topics contain a large number of words related to "corruption". It is filtered.

1. Screening

First of all, the filter conditions are stipulated, that is, if a word such as "corruption" appears in a post, it will not be selected into the "non-democratic sub-topic posts". Here again, I will prepare a post with more than 1000 articles related to the legal system, as a filter input - "fz.xlsx"

filter

L2=['腐败','反腐','受贿','贿赂','贪污','监督','反腐倡廉','行贿','反腐倡廉','廉洁','廉政','落马','自由','民主制度','言论自由']

The screening idea is roughly the same as the previous one, mainly for keyword comparison

from openpyxl import load_workbook
from openpyxl import Workbook
import jieba.analyse

wr=load_workbook('fz.xlsx')
osheet=wr.active
orow=osheet.max_row

ww=Workbook()
asheet=ww.active
asheet.title="fa"

print(osheet.max_row)
L1=[]
L2=['腐败','反腐','受贿','贿赂','贪污','监督','反腐倡廉','行贿','反腐倡廉','廉洁','廉政','落马','自由','民主制度','言论自由']
L3=[]
f1=True
for i in osheet["A"]:
    content=str(i.value)
    keywords=jieba.analyse.extract_tags(content,topK=1000)
    for j in keywords:
        for k in L2:
            if j==k:
                f1=False
    if f1:
        L1.append(content)
        L3.append(keywords)
    f1=True
print(len(L1))

a="A"
b="B"
n=1
for i in L1:
     asheet["%s%d" % (a,n)].value=str(i)
     
     n=n+1

ww.save('fzresult.xlsx')
2. Some modifications

Because of the completion meeting, the instructor told me that you can choose more keywords, so that the results will be more accurate, so the previous cycle conditions of fixed position 100 have been modified.

Since my column generation formula can generate 26^2 columns, which is much larger than 200, no modification is required.

In fact, the main thing is to add the Labeld value together

O1=alphabet[e]+alphabet[m-3]
O2=alphabet[e]+alphabet[m-2]
O3=alphabet[e]+alphabet[m-1]

Use O1, O2, O3 to dynamically get the last three column numbers (because the number of keywords is uncertain)

L5=[]
L5.append(O1)
L5.append(O2)
L5.append(O3)
L6=['B','C','D']
k=0
for j in L5:
    n=2
    c=1
    print(j)
    for i in osheet2[L6[k]]:
        if c>1:
            asheet["%s%d" % (j,n)].value=i.value
            n=n+1
        c=c+1
    k=k+1

Finally, you can read the values ​​of the three columns of BCD.

from openpyxl import load_workbook
from openpyxl import Workbook
import jieba.analyse

wr=load_workbook('sta.xlsx')
osheet=wr.active
orow=osheet.max_row

ww=Workbook()
asheet=ww.active
asheet.title="ml"

alphabet=[]
o='A'
for i in range(26):
    
    alphabet.append(o)
    p=ord(o)+1
    o=chr(p)

k=0
n=1

e=0
m=0

num=200
tempL=[]
testL=[]
tempc=0
for i in osheet["A"]:
    if tempc<num:
        tempL.append(i.value)
        testL.append(i.value)
    else:
        break
    tempc=tempc+1

tempL.append("民主制度")
tempL.append("言论自由")
tempL.append("民主监督")

num2=num+3
for i in tempL:
    if k<=num2:
        if k<26:
            a=alphabet[k]
            asheet["%s%d" % (a,n)].value=str(i)
        else:
            if m==26:
                m=0
                e=e+1
            b=alphabet[e]
            c=alphabet[m]
            d=b+c
            asheet["%s%d" % (d,n)].value=str(i)
            
            m=m+1
    else:
        break
    k=k+1

O1=alphabet[e]+alphabet[m-3]
O2=alphabet[e]+alphabet[m-2]
O3=alphabet[e]+alphabet[m-1]
ww.save('t2.xlsx')

wr2=load_workbook('FMLin.xlsx')
osheet2=wr2.active
print(osheet2.max_row)
L1=[]

for i in osheet2["A"]:
    k=0
    content=str(i.value)
    keywords=jieba.analyse.extract_tags(content,topK=1000)
    L1.append(keywords)
L1.pop(0)#第一个是空的

count=0
L3=[]
L2=[]
flag=False

for i in L1:
    L2=[]
    for g in testL:
        flag=False
        for j in i:
            if g==j:
                flag=True
        if flag:
            L2.append(1)
        else:
            L2.append(0)

    L3.append(L2)
k=0
n=2

for j in L3:
    e=0
    m=0
    for i in j:
        if k<=num2:
            if k<26:
                a=alphabet[k]
                asheet["%s%d" % (a,n)].value=i
            else:
                if m==26:
                    m=0
                    e=e+1
                b=alphabet[e]
                c=alphabet[m]
                d=b+c
                asheet["%s%d" % (d,n)].value=i
                m=m+1
            
        k=k+1
    n=n+1
    k=0
print(n)

L5=[]
L5.append(O1)
L5.append(O2)
L5.append(O3)
L6=['B','C','D']
k=0
for j in L5:
    n=2
    c=1
    print(j)
    for i in osheet2[L6[k]]:
        if c>1:
            asheet["%s%d" %(j,n)].value=i.value
            n=n+1
        c=c+1
    k=k+1
ww.save('t2.xlsx')

The FMLin table is composed of 500 democratic + 500 irrelevant + their Label level:

FMLin.xlsx:

t2.xlsx:

2. Summary and reflection

I did it all afternoon and night, and it was a waste of time. . . In the future, come to the laboratory and try not to look at it, read more article code, learn more, don't be lazy and waste time.

3. The next task

Continue to see how to use scikitlearn. On Monday, when the teacher saw the demonstration, he panicked. . . To be honest, this "algorithm application" of mine, I don't know what to do when it comes to the defense presentation. . . The teacher also didn't let me demonstrate with a black frame. . . No matter what, let's talk about the results.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324675806&siteId=291194637