Software Testing | How to use Python to extract Chinese and English content

Insert image description here

Introduction

Extracting Chinese and English content from sentences is a common task in natural language processing and is commonly used in text processing, machine translation, and linguistics research. This article will introduce in detail how to use Python to extract Chinese and English content from sentences, including preparation work, selecting appropriate libraries, writing code examples, and demonstration examples.

Preparation

We can use Python's own remodules or jiebalibraries nltkto implement this function, jiebaas well as nltkthird-party libraries, so we need to install these two libraries through commands. The commands are as follows:

pip install jieba nltk

The following is an introduction to the modules we use:

re: used for regular expression operations, we will use it to match Chinese and English content
jieba: Used for Chinese word segmentation, dividing Chinese sentences into words
nltk: Natural language toolkit for English text processing

Use regular expressions to extract Chinese and English

Regular expressions are a powerful text matching tool that can be used to extract Chinese and English content in sentences. The following is a sample code that uses regular expressions to extract Chinese and English:

import re

def extract_chinese_and_english(sentence):
    chinese_pattern = re.compile('[\u4e00-\u9fa5]+')
    english_pattern = re.compile('[a-zA-Z]+')
    result = {
    
    
        'chinese': chinese_pattern.findall(sentence),
        'english': english_pattern.findall(sentence)
    }
    return result

sentence = '这是一个示例句子，包含了一些中文和英文。This is an example sentence with both Chinese and English.'
result = extract_chinese_and_english(sentence)
print(result['chinese'])
print(result['english'])

------------------------
运行脚本，输出结果如下：
['这是一个示例句子', '包含了一些中文和英文']
['This', 'is', 'an', 'example', 'sentence', 'with', 'both', 'Chinese', 'and', 'English']

Use third-party libraries for Chinese and English extraction

In addition to regular expressions, you can also use some third-party libraries to extract the Chinese and English content in sentences. The following is a sample code for Chinese and English extraction using Jieba and nltk libraries:

import re
import jieba
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# 初始化nltk
nltk.download("punkt")

# 示例句子
sentence = "这是一个示例句子，包含了一些中文和英文。This is an example sentence with both Chinese and English."

# 提取中文内容
def extract_chinese(text):
    chinese_pattern = re.compile(r'[\u4e00-\u9fa5]+')
    chinese_matches = chinese_pattern.findall(text)
    return " ".join(chinese_matches)

# 提取英文内容
def extract_english(text):
    english_pattern = re.compile(r'[a-zA-Z]+')
    english_matches = english_pattern.findall(text)
    return " ".join(english_matches)

# 分词中文内容
chinese_text = extract_chinese(sentence)
chinese_words = jieba.cut(chinese_text)

# 分词英文内容
english_text = extract_english(sentence)
english_words = word_tokenize(english_text)

# 输出结果
print("原句子:", sentence)
print("中文内容:", chinese_text)
print("中文分词:", " ".join(chinese_words))
print("英文内容:", english_text)
print("英文分词:", " ".join(english_words))

-----------------------------
输出结果如下：
原句子: 这是一个示例句子，包含了一些中文和英文。This is an example sentence with both Chinese and English.
中文内容: 这是一个示例句子 包含了一些中文和英文
中文分词: 这 是 一个 示例 句子   包含 了 一些 中文 和 英文
英文内容: This is an example sentence with both Chinese and English
英文分词: This is an example sentence with both Chinese and English

We first use regular expressions to extract Chinese and English content. Regular expressions for Chinese content [\u4e00-\u9fa5]+are used to match Chinese characters, and regular expressions for English content [a-zA-Z]+ are used to match English characters.
Use for jiebaword segmentation of Chinese content and divide Chinese sentences into words.
Use the function nltkof word_tokenizeto segment the English content and divide the English sentences into words.
Finally, we output the original sentence, Chinese content, Chinese word segmentation, English content and English word segmentation.

Summarize

This article mainly introduces the use of Python to extract Chinese and English content from text. We only use simple examples. If we need to process more complex text, we need to use more advanced frameworks and more complex regular expressions.