Chinese text analysis, Text-Analysis

Chinese text analysis, Text-Analysis

Text-Analysis includes analysis-word word analysis and analysis-classify text classification data analysis, etc., supports python3 to read and write word docx format (including font/color/highlight), read pdf, etc.

analysis-word word analysis

introduce

analysis_word can be used for unsupervised analysis of multi-file corpus (HTML/PDF/DOCX/DOC/TXT/MD), supports docx highlight extraction-reading and writing, new word discovery, Chinese word segmentation, TFIDF, word vector, word clustering, sentence clustering class and other functions.

address

github address: https://github.com/yongzhuo/Text-Analysis

details

Details of each file (folder)

  • keywords/keywords_common.txt Common words, can be added by yourself, used to highlight into other colors
  • keywords/keywords_field.txt already has field words, you can add them yourself to highlight them in other colors
  • keywords/keywords_newword.txt New word discovery, automatically generated, used to highlight into other colors
  • w00_transfer_file_to_txt.py Extract the files in the directory (support HTML/PDF/DOCX/DOC/TXT/MD) into TXT files
  • w01_find_newword_txt.py Find new words, generate "xlsx" files corresponding to each file, and a summary xlsx file, and store them in the "New Word Discovery" directory
  • w02_cut_word_to_txt.py Chinese word segmentation, generate the corresponding ".cut word.txt" file for each file, and a summary "summary large text.md" file, and store it in the "Chinese word segmentation" directory
  • w03_transfer_to_doc.py Chinese word segmentation docx, generate corresponding ".cut word.docx" file for each file, easy to highlight and view, store in "Chinese word segmentation docx" directory
  • w04_tfidf_xlsx.py TFIDF, generate "summary TFIDF.xlsx" file, including tf/idf/tf-idf and other columns, stored in "TFIDF" directory
  • w05_train_w2v.py word vector, train w2v word vector, generate "w2v.vec" file, store in "word vector" directory
  • w06_cluster_w2v.py word clustering, w2v word vector as distance, support K-MEANS/DBSCAN algorithm, generate "xlsx" file, store in "word clustering" directory
  • w07_cluster_sen.py sentence clustering, w2v word vector/TFIDF as distance, support K-MEANS/DBSCAN algorithm, generate "xlsx" file, store in "sentence clustering" directory
  • w08_extract_highligt_doc.py Extract the highlight, extract the highlighted words/sentences marked in the original docx file, generate a "Summary Highlight Text.md" file, and store it in the "Highlight Corpus" directory
  • w100_main.py main function
  • word_discovery.py new word discovery code

quick use

python w100_main.py

How to use

simple example

1. 必选, 配置文件所在目录, 
 - 1.1 可以选择在text_analysis/word_analysis/w100_main.py文件中配置地址, 包括 原始语料文件目录(需要存在文件) 和 分析结果文件目录 
 - 1.2 也可以将文件(支持HTML/PDF/DOCX/DOC/TXT/MD)置于目录text_analysis/data/corpus/原始语料下
 - 1.11.2任选一种方式就好

2. 可选, 配置通用词/专业词等
 - keywords/keywords_common.txt        通用词, 可自己加, 用于高亮成其他颜色
 - keywords/keywords_field.txt         已有领域词, 可自己加, 用于高亮成其他颜色

3. 必选, 运行主函数, 例如  python w100_main.py

part of the code

Python3 reads the docx format of the word document (segmentation, including text and tabular data, etc.)


def docx_read(path):
    """读取word文档的段落数据text/table, docx
    read corpus from docx
    Args:
        path: String, path/origin text,  eg. "叉尾斗鱼介绍.docx"
    Returns:
        passages: List<tuple>, 文章段落
    """

    def iter_tables(block_item_container):
        """Recursively generate all tables in `block_item_container`."""
        for t in block_item_container.tables:
            yield t
            for row in t.rows:
                for cell in row.cells:
                    yield from iter_tables(cell)
    passages = []
    try:
        import docx
        docx_temp = docx.Document(path)
        # 文本
        for p in docx_temp.paragraphs:
            if p.text.strip():
                passages.append(p.text.strip()+"\n")
        # 表格
        for t in iter_tables(docx_temp):
            table = t
            df = [["" for i in range(len(table.columns))] for j in range(len(table.rows))]
            for i, row in enumerate(table.rows):
                for j, cell in enumerate(row.cells):
                    if cell.text:
                        df[i][j] = cell.text.replace("\n", "")
            df = [" ".join(dfi).strip() + "\n" for dfi in df]
            passages += df
    except Exception as e:
        logger.info(str(e))
    return passages

python3 read PDF (segmentation)

def pdf_read(path):
    """读取pdf文档的段落数据text/table, docx
    read corpus from docx
    Args:
        path: String, path/origin text,  eg. "叉尾斗鱼介绍.pdf"
    Returns:
        passages: List<tuple>, 文章段落
    """
    passages = []
    try:
        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
        from pdfminer.converter import TextConverter
        from pdfminer.pdfdocument import PDFDocument
        from pdfminer.pdfparser import PDFParser
        from pdfminer.pdfpage import PDFPage
        from pdfminer.layout import LAParams
        from io import StringIO

        output_string = StringIO()
        with open(path, "rb") as prb:
            parser = PDFParser(prb)
            doc = PDFDocument(parser)
            rsrcmgr = PDFResourceManager()
            device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
            interpreter = PDFPageInterpreter(rsrcmgr, device)
            for page in PDFPage.create_pages(doc):
                interpreter.process_page(page)

        passages = output_string.getvalue()
    except Exception as e:
        logger.info(str(e))
    return passages

python3 read highlighted docx

def docx_extract_color(path, highligt_colors=["YELLOW"]):
    """抽取docx文档中高亮的text, 没有table
    docx extract color
    Args:
        path: String, path/origin text,  eg. "叉尾斗鱼介绍.docx"
        highligt_colors: List, enum/highligt colors,  eg. ["YELLOW", "RED"]
    Returns:
        passages: List<tuple>, 文章段落
    """
    passages = []
    try:
        from docx.enum.text import WD_COLOR_INDEX
        import docx
        wd_colors_dict = {
    
    wcim.name:wcim.value for wcim in WD_COLOR_INDEX.__members__[2:]}
        wd_colors_select = [wd_colors_dict.get(hc.upper(), "") for hc in highligt_colors
                                 if wd_colors_dict.get(hc.upper(), "")]
        """
        # AUTO = 'default'
        BLACK = 'black'
        BLUE = 'blue'
        BRIGHTGREEN = 'green'
        DARKBLUE = 'darkBlue'
        DARKRED = 'darkRed'
        DARKYELLOW = 'darkYellow'
        GRAY25 = 'lightGray'
        GRAY50 = 'darkGray'
        GREEN = 'darkGreen'
        PINK = 'magenta'
        RED = 'red'
        TEAL = 'darkCyan'
        TURQUOISE = 'cyan'
        VOILET = 'darkMagenta'
        WHITE = 'white'
        YELLOW = 'yellow'
        """
        document = docx.Document(path)
        for paragraph in document.paragraphs:
            for run in paragraph.runs:
                if run.font.highlight_color and run.font.highlight_color in wd_colors_select:
                    passages.append(run.text.strip()+"\n")
                elif not highligt_colors and not run.font.highlight_color:
                    passages.append(run.text.strip()+"\n")
    except Exception as e:
        logger.info(str(e))
    return passages

write highlighted docx

def write_txt_to_docx(path_out, texts, keywords_field=None, keywords_common=None, keywords_newword=None):
    """将txt写入docx文档
        write txt to docx
        Args:
            path_out: String, path_out/origin text,  eg. "切词.docx"
            texts: List, text,  eg. ["你/是/谁/呀"]
            keywords_field: List, text,  eg. ["金融社保卡"]
            keywords_common: List, text,  eg. ["北京"]
        Returns:
            None
    """
    from docx.shared import Inches, RGBColor, Pt
    from docx.enum.text import WD_COLOR_INDEX
    from docx import Document
    from tqdm import tqdm
    import re

    def extract_string(text, regx="###"):
        """抽取string中间的特定字符串
        extract string
        Args:
            text: String, path_in_dicr/origin text,  eg. "dir_txt"
            regx: String, path_save_dir/save text,  eg. "dir_txt_cut"
        Returns:
            text_ext: List<str>
        """
        pattern = r"{}(.*?){}".format(regx, regx)
        text_ext = re.findall(pattern, text)
        return text_ext

    document = Document()
    count = 0
    for passage in tqdm(texts):
        if not passage.strip():
            continue
        count += 1
        if count % 2 == 0:
            document.add_paragraph("\n")
            document.add_paragraph(passage.replace("/", ""))

        document.add_paragraph("【")
        words = passage.split("/")

        """
        AUTO = 'default'
        BLACK = 'black'
        BLUE = 'blue'
        BRIGHTGREEN = 'green'
        DARKBLUE = 'darkBlue'
        DARKRED = 'darkRed'
        DARKYELLOW = 'darkYellow'
        GRAY25 = 'lightGray'
        GRAY50 = 'darkGray'
        GREEN = 'darkGreen'
        PINK = 'magenta'
        RED = 'red'
        TEAL = 'darkCyan'
        TURQUOISE = 'cyan'
        VOILET = 'darkMagenta'
        WHITE = 'white'
        YELLOW = 'yellow'
        """

        par = document.paragraphs[-1]
        for w in words:
            par.add_run(text="/")
            par.add_run(text=w)
            run = par.runs[-1]
            # run.font.style = "宋体"  # 文字大小
            run.font.size = Pt(10)  # 文字大小
            if len(w) == 1:
                run.font.highlight_color = WD_COLOR_INDEX.DARK_YELLOW
            elif w in keywords_newword:
                run.font.highlight_color = WD_COLOR_INDEX.BRIGHTGREEN
            elif w in keywords_field:
                run.font.highlight_color = WD_COLOR_INDEX.YELLOW
            elif w in keywords_common:
                run.font.highlight_color = WD_COLOR_INDEX.GREEN
            else:
                run.font.highlight_color = None  # WD_COLOR_INDEX.RED
        document.save(path_out)

reference

This library is inspired by and references following frameworks and papers.

Reference

For citing this work, you can refer to the present GitHub project. For example, with BibTeX:

@software{Text-Analysis,
    url = {https://github.com/yongzhuo/Text-Analysis},
    author = {Yongzhuo Mo},
    title = {Text-Analysis},
    year = {2021}

Data Exploration Available

1. Pure NLP task scenario

  • 1.0 General components (marked corpus/unmarked corpus can be used)

    • Text length distribution: longest, shortest, median, mean, 90%, 95%, draw curve/box plot;
    • Label distribution: sample number distribution, frequency distribution statistics of labels (and synonyms) in the text;
    • New word discovery: calculate tf/idf/tfidf/left and right entropy/solidification, etc., and generate an xlsx file;
    • Key words/key sentences: extract keywords such as textrank, sort by word frequency;
    • Co-occurrence analysis/frequent item mining (very time-consuming): sentence co-occurrence/association rules;
    • Clustering: k-means distance clustering, DBSCN density clustering, etc.;
    • Sentiment/negative analysis (dictionary/model, etc.): positive or negative or multiple, given prediction data, can be based on sentiment dictionary, or a trained model;
    • Dependency analysis: Dependency syntax analysis, which can count sentence patterns; the latest AMR task;
  • 1.1 Text Classification Task

    • tfidf + lr runs the baseline, and checks the difficulty of the task;
  • 1.2 Entity tasks

    • 1.2.1 Entity Recognition
      • Distribution statistics of each entity
      • Entity Length Distribution Statistics
    • 1.2.2 Entity Links
      • Key-value frequency length statistics
    • 1.2.3 Entity Disambiguation
  • 1.3 Relationship tasks

    • 1.3.1 Entity-relationship joint extraction
      • Distribution statistics of each entity
      • Entity Length Distribution Statistics
      • Relationship label label length distribution
      • The category distribution of the relationship label label
    • 1.3.2 Relationship Prediction
      • Similar to text classification tasks, but with the addition of entity length/distribution statistics
  • 1.4 Text Similarity Task

    • 1.4.1 Sentence similarity
      • Two sentence length, number of categories
    • 1.4.2 Word similarity
      • balls
  • 1.5 Text summarization

    • 1.5.1 Extraction
      • is the summary sentence, 0/1, or sorted
    • 1.5.2 Generative formula
      • auto mode

Hope it helps you!

Guess you like

Origin blog.csdn.net/rensihui/article/details/121091552