Summary of data cleaning exercises

1 Introduction

源数据集包括四个文件:
Insert picture description here
其中第一个压缩文件解压后是tsv格式文件
labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one.
unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review.
sampleSubmission - A comma-delimited sample submission file in the correct format.

Data import: OSError: Initializing from file failed

The cause of the problem: when copying the file path (win10), it is copied directly from the address bar, including only the folder name and not the file name

Error tokenizing data. C error: Expected 11 fields in line 4, saw 23

Cause of the problem: pd.read_csv read in, if only

train=pd.read_csv('D:\Kaggle/word2vec-nlp-tutorial\labeledTrainData\labeledTrainData.tsv',)

It is wrong, but if you add:

train=pd.read_csv('D:\Kaggle/word2vec-nlp-tutorial\labeledTrainData\labeledTrainData.tsv',header=0, \
                    delimiter="\t", quoting=3)

就是对的
#Here, “header=0” indicates that the first line of the file contains column names,
“delimiter=\t” indicates that the fields are separated by tabs,
and quoting=3 tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.

Regular expression

re.sub(‘a’,‘b’,context)

Replace the content of a in the context with b
r'a 'means rawstring native string, will not escape
[] means group membership a group of characters
^ means not

nltk library and its import

First I downloaded the nltk package on my computer before,
Insert picture description here
but I tried

from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))

But
Insert picture description here
I got an error: At first, I thought it was a path problem. After adding it, it was invalid (probably wrong). Then I downloaded the stopwords library and finally succeeded. If the wrong steps are not written, try adding the path first, and download it if it doesn't work.

import nltk
nltk.data.path.append(r"D:\NLP\nlp_data")
import nltk
nltk.download("stopwords")

from nltk.corpus import stopwords # Import the stop word list
print(stopwords.words("english"))

A magical path problem

train=pd.read_csv('D:\K\word2vec-nlp-tutorial\labeledTrainData\labeledTrainData.tsv',header=0, \
                    delimiter="\t", quoting=3)
test = pd.read_csv("D:/K/word2vec-nlp-tutorial/testData/testData.tsv", header=0, delimiter="\t", \
                   quoting=3 )

The first time I used the win10 default path symbol
OK, the
second time I used the default one. . . Change to / it becomes. . . Psychedelic

Published 14 original articles · praised 0 · visits 771

Guess you like

Origin blog.csdn.net/kunAUGUST/article/details/105340719