NLP kit installation configuration
A key installation address
pip install -r requirements.txt
pip Mirror address
-i https://pypi.tuna.tsinghua.edu.cn/simple
numpy
a matrix operation numpy
pip install numpy
NLTK
NLTK natural language processing tool bag
pip installnltk
Gensim
Gensim: for automatically extracting semantic topics
-
pip install gensim
-
Download whl file: http://www.lfd.uci.edu/~gohlke/pythonlibs/ , then pip install whl documents;
Tensorflow
Tensorflow: data stream using open source software libraries for numerical calculation of FIG;
pip install tensorflow
pip install tf-nightly-gpu/cpu
jieba
jieba: Chinese sub thesaurus, word has three modes, can be added to custom dictionaries;
pip install jieba
Stanford NLP
Stanford NLP:
- Installation stanford nlp natural language processing package:
pip install stanfordcorenlp
- Download Stanford CoreNLP file
https://stanfordnlp.github.io/CoreNLP/download.html - Download the Chinese model jar package:
http://nlp.stanford.edu/software/stanford-chinese-corenlp-2018-02-27-models.jar - After pressing the Stanford CoreNLP folders and download stanford-chinese-corenlp-2018-02-27-models.jar in the same directory
- Reference model in Python:
from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP(r‘path', lang='zh')
Hanlp
Brief introduction
Hanlp: Chinese word segmentation, POS tagging, named entity recognition (based on C ++ or the Java)
Hanlp consists of three parts: the library hanlp.jar package, the package data model, configuration files hanlp.properties, JVM after the test is completed, we need to its related configuration.
github URL: https://github.com/hankcs/HanLP#3 profile
JVM environment installation
-
First, install the Java version, java version I am using jdk1.8;
-
Then install Jpype, Jpype java code is invoked by a toolkit python
pip install Jpype
-
JVM can test whether the normal start in py environment:
from jpype import *
import os.path
startJVM(getDefaultJVMPath(),"-ea")
java.lang.System.out.printin("Hello World")
shutdownJVM()
Hanlp installation
Hanlp consists of three parts: hanlp.jar library package, the package model data, profile hanlp.properties;
-
Download hanlp.jar package:
https://github.com/hankcs/HanLP -
Download the data.zip: https://github.com/hankcs/HanLP/releases
http://hanlp.linrunsoft.com/release/data-for-1.7.0.zip
- Profiles
Hanlp configuration properties file: hanlp.properties, the role of the configuration file is to tell the packet Data Hanlp position, simply modify the first line:
root=usr/home/HanLP/
Once configured, then we need to HanLP.properties into the classpath, called into the classpath, essentially looks for class and properties in the classpath when the JVM starts, this sentence is in the specified classpath:
"-Djava.class.path=E:\NLP\hanlp\hanlp-1.5.0.jar;E:\NLP\hanlp"
Often appear error conditions
Class com.hankcs.hanlp.HanLP not found
The reason lies startJVM settings, be sure to check two aspects: the first is to put the right path? The second is the version number of the yet? (Note that the path put in the best time in English)unicodeescape' codec can't decode bytes
This is the cause of the error escape path before the string before adding r 'can escape.
Test code:
import jpype
from jpype import *
jvmPath = jpype.getDefaultJVMPath()
jpype.startJVM(jvmPath,r"-Djava.class.path=E:\NLP\hanlp\hanlp-1.5.0.jar;E:\NLP\hanlp",
"-Xms1g",
"-Xmx1g")
jpype.java.lang.System.out.println("hello world!")
HanLP = JClass('com.hankcs.hanlp.HanLP')
java.lang.System.out.println(HanLP.segment("你好,欢迎使用HanLP汉语处理包!"));
jpype.shutdownJVM()