使用 IMDB review 数据集用于文本分类

IMDB review 数据集介绍
在这里插入图片描述
JSON文件读写操作详解
imdb_review.json

[
    {
        "rating": 5, 
        "title": "The dark is rising!", 
        "movie": "tt0484562", 
        "review": "It is adapted from the book. I did not read the book and maybe that is why I still enjoyed the movie. There are recent famous books adapted into movies like Eragon which is an unsuccessful movie compared to the rest but I like it better than The Seeker adaptation, another one is The Chronicles of Narnia: The lion, The witch and The wardrobe which is successful and has a sequel under it. The Seeker is this year adaptation. It did a fair job. It is not bad and it is not good. It depends on the viewer. If fans hate the unfaithful adaptation because it does not really follow the line of the story, then be it. Those who have not read the book like me would want to go and watch this movie for entertainment. It did make me a little interested but not enough.It does have its good and bad points. The director failed to bring the spark of the movie. The cast are okay, not too bad. The special effects are considered good for a fantasy movie. What I don't like it is that it is quite short, it just bring straight to the point and that is it. By the time, you will realise it is going to end like that with some short fantasy action. The story is like any fantasy movies. Fast and straight-forward plot. The talking seems long and boring followed by some short action. That is about it. Nothing else. Nothing so interesting to catch your eyes.Overall, it makes a harmless movie to watch in free time or the boring weekends. It is considered dark for children but they still can handle it. It seems long but it is short. Overall, I still think Eragon is better than this. Either you don't like it or like it, it does not matter. It is your view. In this case, I can't say anything. It is just okay.", 
        "link": "http://www.imdb.com/title/tt0484562/reviews-73", 
        "user": "ur12930537"
    }, 
    {
        "rating": 5, 
        "title": "Bad attempt by the people that borough us Eragon.", 
        "movie": "tt0484562", 
        "review": "Ever since Lord of the Rings became a hit and was internationally acclaimed all other studios are trying to do the same thing and I can tell you now we are not getting many successes out of these half hearted attempts. The decent ones are Chronicles of Narnia which Disney snapped up and Harry Potter from Warner Brothers. Even the Golden Compass was pretty good by the same people who did Lord of the Rings but then we get to the bad ones. Fox studios gave us Eragon which I still believe is the worst movie I have ever seen. Now Fox studios tries again with the Seeker: The Dark is Rising and I can tell you it is a lot better than Eragon. However, it still is not very good. The director filmed the movie and then realised that his movie was too short so he had a great idea of just making characters appear for no reason and just look scary. I have not read the books but from what I have heard it isn't even faithful their. Overall, it was a decent try but still not worth seeing.", 
        "link": "http://www.imdb.com/title/tt0484562/reviews-108", 
        "user": "ur15303216"
    }, 
    {
        "rating": 3, 
        "title": "fantasy movie lacks magic", 
        "movie": "tt0484562", 
        "review": "I've not read the novel this movie was based on, but do enjoy fantasy movies, and thought it looked interesting. But after seeing it...... oh dear.An American boy, Will living with his family in a small village somewhere in England, discovers on his 14th birthday that he's The Seeker for a group of old ones, who fight for the Light. He's got days to find them, before the Rider who fights for the Dark comes to full strength....As I said, I've not read the novel, but seeing the movie several things spring to mind. There are echoes of Harry Potter, the Russian movies Night Watch and Day Watch amongst other fantasy movies tossed into the mix. The script is all over the place, though perhaps this is due to some brutal editing as the movie seems disjointed in parts and the director can't resist having his camera moving all the time and with some quick editing it's almost as if he's trying to be Micheal Bay!! You also get the feeling that despite the production team's efforts, the movie didn't have the budget it really needed. There are a couple of so-called twists in the mix, but they are too obvious to work effectively.The acting isn't too bad, with special mention going to Ian McShane, as one of the elder ones but try as they might, they can't save the movie.As the first of a trio of fantasy movies coming out, the others being Stardust and The Golden Compass, I hope this is not a sign of things to come.", 
        "link": "http://www.imdb.com/title/tt0484562/reviews-60", 
        "user": "ur0680065"
    }
]

读取并保存
整体数据处理步骤 (json–>pickle) :
1、读取json中每一个 review 和 rating 得到 text 和 label，统计所有 text 中单词的个数，取max_length_word = int(0.8*len(sorted_word_length)) 作为一个文本标准的单词长度。
2、利用 Glove 词向量构造单词字典
3、对 text 进行单词分词，并利用字典给 text 进行单词索引编码
4、为了后续模型中batch_size的批量训练数据，对 text 单词数超过max_length_word 的进行截断，不足的进行补 -1（unknown word）。
5、将python对象保存到 .pickle 方便数据读取，也省去了再次分词处理带来的耗时。

# #使用nltk分词分句器
import nltk
from nltk.tokenize import WordPunctTokenizer
word_tokenizer = WordPunctTokenizer()
import pandas as pd
import csv
import numpy as np
import json
import pickle
import os
class JsonDataset:
    def __init__(self,
                 data_path, 
                 dict_path,
                 save_dir):
        super(JsonDataset, self).__init__()
        #得到数据集设定的最大单词数
        self.max_length_word = self.get_max_lengths(data_path)

        #创建下标字典
        self.dict = pd.read_csv(filepath_or_buffer=dict_path, header=None, sep=" ", quoting=csv.QUOTE_NONE,
                                usecols=[0]).values
        self.dict = [word[0] for word in self.dict]

        #将数据保存为pickle格式
        self.save_to_pickle(data_path, self.dict, self.max_length_word, save_dir)

    def save_to_pickle(self, data_path, dict, max_length_word, save_dir):
        save_path = self.get_save_path(data_path, save_dir)
        #读取json文件
        texts, labels = [], []

        with open(data_path, 'r') as f:
            reviews = json.loads(f.read())#得到list对象
            for review in reviews:
                text = review['review'].lower()
                label = review['rating'] - 1
                text_encode = self.convert_index(dict, text, max_length_word)
                texts.append(text_encode)
                labels.append(label)

        self.texts = np.stack(arrays=texts, axis=0).astype(np.int64) + 1 #[batch_size, max_length_word]
        self.labels = labels
        #保存
        with open(save_path,'wb') as g:
            pickle.dump((self.texts, self.labels), g)
            print("The num of texts:", self.texts.shape[0])

    def get_save_path(self, data_path, save_dir):
        data_name = data_path.split('/')[-2:]
        data_save_dir = os.path.join(save_dir, data_name[0])

        if not os.path.exists(data_save_dir):
            os.makedirs(data_save_dir)
        end_name = data_name[1].split('.')[0]+'.pickle'
        save_path = os.path.join(data_save_dir, end_name)

        return save_path

    def convert_index(self, dict, text, max_length_word):
        #对文本的单词进行分词
        text_encode = [dict.index(word) if word in dict else -1 
                       for word in word_tokenizer.tokenize(text)]

        #对于单词数不够的文本进行填充
        if len(text_encode) < max_length_word :
             extended_words = [-1 for _ in range(max_length_word - len(text_encode))]
             text_encode.extend(extended_words)

        text_encode = text_encode[:max_length_word]

        return text_encode

    #得到文本分词的指定最大长度
    def get_max_lengths(self, data_path):
        word_length_list = []

        with open(data_path, 'r') as f:
            reviews = json.loads(f.read())
            for review in reviews:
                text = review['review'].lower()
                word_list = word_tokenizer.tokenize(text)
                word_length_list.append(len(word_list))

            sorted_word_length = sorted(word_length_list)
        return sorted_word_length[int(0.8*len(sorted_word_length))]

if __name__ == '__main__':
    imdb_path = "./Dataset/imdb/imdb_review.json"
    glove_path = "./Glove/glove.6B.50d.txt"
    pickle_path = "./Dataset/Pickle"

    data = JsonDataset(data_path=imdb_path, dict_path=glove_path, save_dir=pickle_path)
    # with open('imdb_review.pickle', 'rb') as f:
    #   texts, labels = pickle.load(f)
    #   print("The num of texts:", texts.shape)
    #   print(labels)

使用 IMDB review 数据集用于文本分类

猜你喜欢