Python3 uses openpyxl and jieba to extract keywords from posts - processing non-democratic posts
20180421 Work Notes
1. Work
Last time, I said that the newly added more than 500 non-democratic sub-topics contain a large number of words related to "corruption". It is filtered.
1. Screening
First of all, the filter conditions are stipulated, that is, if a word such as "corruption" appears in a post, it will not be selected into the "non-democratic sub-topic posts". Here again, I will prepare a post with more than 1000 articles related to the legal system, as a filter input - "fz.xlsx"
filter
L2=['腐败','反腐','受贿','贿赂','贪污','监督','反腐倡廉','行贿','反腐倡廉','廉洁','廉政','落马','自由','民主制度','言论自由']
The screening idea is roughly the same as the previous one, mainly for keyword comparison
from openpyxl import load_workbook
from openpyxl import Workbook
import jieba.analyse
wr=load_workbook('fz.xlsx')
osheet=wr.active
orow=osheet.max_row
ww=Workbook()
asheet=ww.active
asheet.title="fa"
print(osheet.max_row)
L1=[]
L2=['腐败','反腐','受贿','贿赂','贪污','监督','反腐倡廉','行贿','反腐倡廉','廉洁','廉政','落马','自由','民主制度','言论自由']
L3=[]
f1=True
for i in osheet["A"]:
content=str(i.value)
keywords=jieba.analyse.extract_tags(content,topK=1000)
for j in keywords:
for k in L2:
if j==k:
f1=False
if f1:
L1.append(content)
L3.append(keywords)
f1=True
print(len(L1))
a="A"
b="B"
n=1
for i in L1:
asheet["%s%d" % (a,n)].value=str(i)
n=n+1
ww.save('fzresult.xlsx')
2. Some modifications
Because of the completion meeting, the instructor told me that you can choose more keywords, so that the results will be more accurate, so the previous cycle conditions of fixed position 100 have been modified.
Since my column generation formula can generate 26^2 columns, which is much larger than 200, no modification is required.
In fact, the main thing is to add the Labeld value together
O1=alphabet[e]+alphabet[m-3]
O2=alphabet[e]+alphabet[m-2]
O3=alphabet[e]+alphabet[m-1]
Use O1, O2, O3 to dynamically get the last three column numbers (because the number of keywords is uncertain)
L5=[]
L5.append(O1)
L5.append(O2)
L5.append(O3)
L6=['B','C','D']
k=0
for j in L5:
n=2
c=1
print(j)
for i in osheet2[L6[k]]:
if c>1:
asheet["%s%d" % (j,n)].value=i.value
n=n+1
c=c+1
k=k+1
Finally, you can read the values of the three columns of BCD.
from openpyxl import load_workbook
from openpyxl import Workbook
import jieba.analyse
wr=load_workbook('sta.xlsx')
osheet=wr.active
orow=osheet.max_row
ww=Workbook()
asheet=ww.active
asheet.title="ml"
alphabet=[]
o='A'
for i in range(26):
alphabet.append(o)
p=ord(o)+1
o=chr(p)
k=0
n=1
e=0
m=0
num=200
tempL=[]
testL=[]
tempc=0
for i in osheet["A"]:
if tempc<num:
tempL.append(i.value)
testL.append(i.value)
else:
break
tempc=tempc+1
tempL.append("民主制度")
tempL.append("言论自由")
tempL.append("民主监督")
num2=num+3
for i in tempL:
if k<=num2:
if k<26:
a=alphabet[k]
asheet["%s%d" % (a,n)].value=str(i)
else:
if m==26:
m=0
e=e+1
b=alphabet[e]
c=alphabet[m]
d=b+c
asheet["%s%d" % (d,n)].value=str(i)
m=m+1
else:
break
k=k+1
O1=alphabet[e]+alphabet[m-3]
O2=alphabet[e]+alphabet[m-2]
O3=alphabet[e]+alphabet[m-1]
ww.save('t2.xlsx')
wr2=load_workbook('FMLin.xlsx')
osheet2=wr2.active
print(osheet2.max_row)
L1=[]
for i in osheet2["A"]:
k=0
content=str(i.value)
keywords=jieba.analyse.extract_tags(content,topK=1000)
L1.append(keywords)
L1.pop(0)#第一个是空的
count=0
L3=[]
L2=[]
flag=False
for i in L1:
L2=[]
for g in testL:
flag=False
for j in i:
if g==j:
flag=True
if flag:
L2.append(1)
else:
L2.append(0)
L3.append(L2)
k=0
n=2
for j in L3:
e=0
m=0
for i in j:
if k<=num2:
if k<26:
a=alphabet[k]
asheet["%s%d" % (a,n)].value=i
else:
if m==26:
m=0
e=e+1
b=alphabet[e]
c=alphabet[m]
d=b+c
asheet["%s%d" % (d,n)].value=i
m=m+1
k=k+1
n=n+1
k=0
print(n)
L5=[]
L5.append(O1)
L5.append(O2)
L5.append(O3)
L6=['B','C','D']
k=0
for j in L5:
n=2
c=1
print(j)
for i in osheet2[L6[k]]:
if c>1:
asheet["%s%d" %(j,n)].value=i.value
n=n+1
c=c+1
k=k+1
ww.save('t2.xlsx')
The FMLin table is composed of 500 democratic + 500 irrelevant + their Label level:
FMLin.xlsx:
t2.xlsx:
2. Summary and reflection
I did it all afternoon and night, and it was a waste of time. . . In the future, come to the laboratory and try not to look at it, read more article code, learn more, don't be lazy and waste time.
3. The next task
Continue to see how to use scikitlearn. On Monday, when the teacher saw the demonstration, he panicked. . . To be honest, this "algorithm application" of mine, I don't know what to do when it comes to the defense presentation. . . The teacher also didn't let me demonstrate with a black frame. . . No matter what, let's talk about the results.