Chinese text analysis, Text-Analysis
Text-Analysis includes analysis-word word analysis and analysis-classify text classification data analysis, etc., supports python3 to read and write word docx format (including font/color/highlight), read pdf, etc.
analysis-word word analysis
introduce
analysis_word can be used for unsupervised analysis of multi-file corpus (HTML/PDF/DOCX/DOC/TXT/MD), supports docx highlight extraction-reading and writing, new word discovery, Chinese word segmentation, TFIDF, word vector, word clustering, sentence clustering class and other functions.
address
github address: https://github.com/yongzhuo/Text-Analysis
details
Details of each file (folder)
- keywords/keywords_common.txt Common words, can be added by yourself, used to highlight into other colors
- keywords/keywords_field.txt already has field words, you can add them yourself to highlight them in other colors
- keywords/keywords_newword.txt New word discovery, automatically generated, used to highlight into other colors
- w00_transfer_file_to_txt.py Extract the files in the directory (support HTML/PDF/DOCX/DOC/TXT/MD) into TXT files
- w01_find_newword_txt.py Find new words, generate "xlsx" files corresponding to each file, and a summary xlsx file, and store them in the "New Word Discovery" directory
- w02_cut_word_to_txt.py Chinese word segmentation, generate the corresponding ".cut word.txt" file for each file, and a summary "summary large text.md" file, and store it in the "Chinese word segmentation" directory
- w03_transfer_to_doc.py Chinese word segmentation docx, generate corresponding ".cut word.docx" file for each file, easy to highlight and view, store in "Chinese word segmentation docx" directory
- w04_tfidf_xlsx.py TFIDF, generate "summary TFIDF.xlsx" file, including tf/idf/tf-idf and other columns, stored in "TFIDF" directory
- w05_train_w2v.py word vector, train w2v word vector, generate "w2v.vec" file, store in "word vector" directory
- w06_cluster_w2v.py word clustering, w2v word vector as distance, support K-MEANS/DBSCAN algorithm, generate "xlsx" file, store in "word clustering" directory
- w07_cluster_sen.py sentence clustering, w2v word vector/TFIDF as distance, support K-MEANS/DBSCAN algorithm, generate "xlsx" file, store in "sentence clustering" directory
- w08_extract_highligt_doc.py Extract the highlight, extract the highlighted words/sentences marked in the original docx file, generate a "Summary Highlight Text.md" file, and store it in the "Highlight Corpus" directory
- w100_main.py main function
- word_discovery.py new word discovery code
quick use
python w100_main.py
How to use
simple example
1. 必选, 配置文件所在目录,
- 1.1 可以选择在text_analysis/word_analysis/w100_main.py文件中配置地址, 包括 原始语料文件目录(需要存在文件) 和 分析结果文件目录
- 1.2 也可以将文件(支持HTML/PDF/DOCX/DOC/TXT/MD)置于目录text_analysis/data/corpus/原始语料下
- 1.1、1.2任选一种方式就好
2. 可选, 配置通用词/专业词等
- keywords/keywords_common.txt 通用词, 可自己加, 用于高亮成其他颜色
- keywords/keywords_field.txt 已有领域词, 可自己加, 用于高亮成其他颜色
3. 必选, 运行主函数, 例如 python w100_main.py
part of the code
Python3 reads the docx format of the word document (segmentation, including text and tabular data, etc.)
def docx_read(path):
"""读取word文档的段落数据text/table, docx
read corpus from docx
Args:
path: String, path/origin text, eg. "叉尾斗鱼介绍.docx"
Returns:
passages: List<tuple>, 文章段落
"""
def iter_tables(block_item_container):
"""Recursively generate all tables in `block_item_container`."""
for t in block_item_container.tables:
yield t
for row in t.rows:
for cell in row.cells:
yield from iter_tables(cell)
passages = []
try:
import docx
docx_temp = docx.Document(path)
# 文本
for p in docx_temp.paragraphs:
if p.text.strip():
passages.append(p.text.strip()+"\n")
# 表格
for t in iter_tables(docx_temp):
table = t
df = [["" for i in range(len(table.columns))] for j in range(len(table.rows))]
for i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
if cell.text:
df[i][j] = cell.text.replace("\n", "")
df = [" ".join(dfi).strip() + "\n" for dfi in df]
passages += df
except Exception as e:
logger.info(str(e))
return passages
python3 read PDF (segmentation)
def pdf_read(path):
"""读取pdf文档的段落数据text/table, docx
read corpus from docx
Args:
path: String, path/origin text, eg. "叉尾斗鱼介绍.pdf"
Returns:
passages: List<tuple>, 文章段落
"""
passages = []
try:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
from io import StringIO
output_string = StringIO()
with open(path, "rb") as prb:
parser = PDFParser(prb)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
passages = output_string.getvalue()
except Exception as e:
logger.info(str(e))
return passages
python3 read highlighted docx
def docx_extract_color(path, highligt_colors=["YELLOW"]):
"""抽取docx文档中高亮的text, 没有table
docx extract color
Args:
path: String, path/origin text, eg. "叉尾斗鱼介绍.docx"
highligt_colors: List, enum/highligt colors, eg. ["YELLOW", "RED"]
Returns:
passages: List<tuple>, 文章段落
"""
passages = []
try:
from docx.enum.text import WD_COLOR_INDEX
import docx
wd_colors_dict = {
wcim.name:wcim.value for wcim in WD_COLOR_INDEX.__members__[2:]}
wd_colors_select = [wd_colors_dict.get(hc.upper(), "") for hc in highligt_colors
if wd_colors_dict.get(hc.upper(), "")]
"""
# AUTO = 'default'
BLACK = 'black'
BLUE = 'blue'
BRIGHTGREEN = 'green'
DARKBLUE = 'darkBlue'
DARKRED = 'darkRed'
DARKYELLOW = 'darkYellow'
GRAY25 = 'lightGray'
GRAY50 = 'darkGray'
GREEN = 'darkGreen'
PINK = 'magenta'
RED = 'red'
TEAL = 'darkCyan'
TURQUOISE = 'cyan'
VOILET = 'darkMagenta'
WHITE = 'white'
YELLOW = 'yellow'
"""
document = docx.Document(path)
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.font.highlight_color and run.font.highlight_color in wd_colors_select:
passages.append(run.text.strip()+"\n")
elif not highligt_colors and not run.font.highlight_color:
passages.append(run.text.strip()+"\n")
except Exception as e:
logger.info(str(e))
return passages
write highlighted docx
def write_txt_to_docx(path_out, texts, keywords_field=None, keywords_common=None, keywords_newword=None):
"""将txt写入docx文档
write txt to docx
Args:
path_out: String, path_out/origin text, eg. "切词.docx"
texts: List, text, eg. ["你/是/谁/呀"]
keywords_field: List, text, eg. ["金融社保卡"]
keywords_common: List, text, eg. ["北京"]
Returns:
None
"""
from docx.shared import Inches, RGBColor, Pt
from docx.enum.text import WD_COLOR_INDEX
from docx import Document
from tqdm import tqdm
import re
def extract_string(text, regx="###"):
"""抽取string中间的特定字符串
extract string
Args:
text: String, path_in_dicr/origin text, eg. "dir_txt"
regx: String, path_save_dir/save text, eg. "dir_txt_cut"
Returns:
text_ext: List<str>
"""
pattern = r"{}(.*?){}".format(regx, regx)
text_ext = re.findall(pattern, text)
return text_ext
document = Document()
count = 0
for passage in tqdm(texts):
if not passage.strip():
continue
count += 1
if count % 2 == 0:
document.add_paragraph("\n")
document.add_paragraph(passage.replace("/", ""))
document.add_paragraph("【")
words = passage.split("/")
"""
AUTO = 'default'
BLACK = 'black'
BLUE = 'blue'
BRIGHTGREEN = 'green'
DARKBLUE = 'darkBlue'
DARKRED = 'darkRed'
DARKYELLOW = 'darkYellow'
GRAY25 = 'lightGray'
GRAY50 = 'darkGray'
GREEN = 'darkGreen'
PINK = 'magenta'
RED = 'red'
TEAL = 'darkCyan'
TURQUOISE = 'cyan'
VOILET = 'darkMagenta'
WHITE = 'white'
YELLOW = 'yellow'
"""
par = document.paragraphs[-1]
for w in words:
par.add_run(text="/")
par.add_run(text=w)
run = par.runs[-1]
# run.font.style = "宋体" # 文字大小
run.font.size = Pt(10) # 文字大小
if len(w) == 1:
run.font.highlight_color = WD_COLOR_INDEX.DARK_YELLOW
elif w in keywords_newword:
run.font.highlight_color = WD_COLOR_INDEX.BRIGHTGREEN
elif w in keywords_field:
run.font.highlight_color = WD_COLOR_INDEX.YELLOW
elif w in keywords_common:
run.font.highlight_color = WD_COLOR_INDEX.GREEN
else:
run.font.highlight_color = None # WD_COLOR_INDEX.RED
document.save(path_out)
reference
This library is inspired by and references following frameworks and papers.
- python-docx: https://github.com/python-openxml/python-docx
- pdfminer.six: https://github.com/pdfminer/pdfminer.six
- scikit-learn: https://github.com/scikit-learn/scikit-learn
- tqdm: https://github.com/tqdm/tqdm
Reference
For citing this work, you can refer to the present GitHub project. For example, with BibTeX:
@software{Text-Analysis,
url = {https://github.com/yongzhuo/Text-Analysis},
author = {Yongzhuo Mo},
title = {Text-Analysis},
year = {2021}
Data Exploration Available
1. Pure NLP task scenario
-
1.0 General components (marked corpus/unmarked corpus can be used)
- Text length distribution: longest, shortest, median, mean, 90%, 95%, draw curve/box plot;
- Label distribution: sample number distribution, frequency distribution statistics of labels (and synonyms) in the text;
- New word discovery: calculate tf/idf/tfidf/left and right entropy/solidification, etc., and generate an xlsx file;
- Key words/key sentences: extract keywords such as textrank, sort by word frequency;
- Co-occurrence analysis/frequent item mining (very time-consuming): sentence co-occurrence/association rules;
- Clustering: k-means distance clustering, DBSCN density clustering, etc.;
- Sentiment/negative analysis (dictionary/model, etc.): positive or negative or multiple, given prediction data, can be based on sentiment dictionary, or a trained model;
- Dependency analysis: Dependency syntax analysis, which can count sentence patterns; the latest AMR task;
-
1.1 Text Classification Task
- tfidf + lr runs the baseline, and checks the difficulty of the task;
-
1.2 Entity tasks
- 1.2.1 Entity Recognition
- Distribution statistics of each entity
- Entity Length Distribution Statistics
- 1.2.2 Entity Links
- Key-value frequency length statistics
- 1.2.3 Entity Disambiguation
- 1.2.1 Entity Recognition
-
1.3 Relationship tasks
- 1.3.1 Entity-relationship joint extraction
- Distribution statistics of each entity
- Entity Length Distribution Statistics
- Relationship label label length distribution
- The category distribution of the relationship label label
- 1.3.2 Relationship Prediction
- Similar to text classification tasks, but with the addition of entity length/distribution statistics
- 1.3.1 Entity-relationship joint extraction
-
1.4 Text Similarity Task
- 1.4.1 Sentence similarity
- Two sentence length, number of categories
- 1.4.2 Word similarity
- balls
- 1.4.1 Sentence similarity
-
1.5 Text summarization
- 1.5.1 Extraction
- is the summary sentence, 0/1, or sorted
- 1.5.2 Generative formula
- auto mode
- 1.5.1 Extraction
Hope it helps you!