1 Introduction
源数据集包括四个文件:
其中第一个压缩文件解压后是tsv格式文件
labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one.
unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review.
sampleSubmission - A comma-delimited sample submission file in the correct format.
Data import: OSError: Initializing from file failed
The cause of the problem: when copying the file path (win10), it is copied directly from the address bar, including only the folder name and not the file name
Error tokenizing data. C error: Expected 11 fields in line 4, saw 23
Cause of the problem: pd.read_csv read in, if only
train=pd.read_csv('D:\Kaggle/word2vec-nlp-tutorial\labeledTrainData\labeledTrainData.tsv',)
It is wrong, but if you add:
train=pd.read_csv('D:\Kaggle/word2vec-nlp-tutorial\labeledTrainData\labeledTrainData.tsv',header=0, \
delimiter="\t", quoting=3)
就是对的
#Here, “header=0” indicates that the first line of the file contains column names,
“delimiter=\t” indicates that the fields are separated by tabs,
and quoting=3 tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.
Regular expression
re.sub(‘a’,‘b’,context)
Replace the content of a in the context with b
r'a 'means rawstring native string, will not escape
[] means group membership a group of characters
^ means not
nltk library and its import
First I downloaded the nltk package on my computer before,
but I tried
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))
But
I got an error: At first, I thought it was a path problem. After adding it, it was invalid (probably wrong). Then I downloaded the stopwords library and finally succeeded. If the wrong steps are not written, try adding the path first, and download it if it doesn't work.
import nltk
nltk.data.path.append(r"D:\NLP\nlp_data")
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))
A magical path problem
train=pd.read_csv('D:\K\word2vec-nlp-tutorial\labeledTrainData\labeledTrainData.tsv',header=0, \
delimiter="\t", quoting=3)
test = pd.read_csv("D:/K/word2vec-nlp-tutorial/testData/testData.tsv", header=0, delimiter="\t", \
quoting=3 )
The first time I used the win10 default path symbol
OK, the
second time I used the default one. . . Change to / it becomes. . . Psychedelic