统计“3_人民日报语料”文本中的字符数和词数，把文件分别保存为 ansi， UTF8，UTF16，unicode 格式;

首先，统计文件的字符数，有两种方法。第一种是将文件复制到word中，word自动统计；第二种是在python中，将文本内容读取到字符串中，去除换行符和空格，字符串的长度就是字符的数目。然后统计文件中的单词数目，因为所给文件已经做好了分词处理，因此将文本读取到字符串中，用python中的split()函数将字符串转换成list，这样list的长度就是单词的数目。最后要做的是将文件转换成不同的编码，可以采用记事本、notepad++、sublime text等软件进行处理。

1.统计“3_人民日报语料”文本中的字符数

方法一：将文字复制到word文档中，word自动统计文本字符数:

方法二 python处理

#coding=utf-8
try:
    file_read = open("3.txt","r") #打开人民日报语料
    s = file_read.read().decode("UTF-8-SIG") #将文件读取到变量s中，并将其转换为unicode编码
    s = s.replace('\n', '') #除去其中的换行符
    s = s.replace('\r', '') #除去其中的换行符
    s = s.replace(" ", '') #除去其中的空格
    file_read.close() #关闭文件
    print "The total number of characters is "+str(len(s)) #输出结果 字符串s的长度就是总共的字符数
except Exception, e:
    print e.message

2.统计“3_人民日报语料”文本中的词数

python处理

#coding=utf-8
try:
    file_read = open('3.txt')     #打开文件
    s = file_read.read().decode("UTF-8-SIG")  #读取文件
    s = s.split()#因为语料已经做好了分词，所以只需split（）即可
    print "Total number of words is "+str(len(s))  #list  s 的长度就是字数
except Exception, e:
    print e.message

3.把文件分别保存为 ansi， UTF8，UTF16，unicode 格式;

（1）UTF16或UTF8

sublime

（2）ansi或utf8

notepad++

（3）unicode或ansi或utf8

记事本

（4）UTF16

python

#coding=utf-8
import codecs
import chardet
file_name = '3.txt'
file_utf_16_name = '3_utf_16.txt'
try:
    file_read = open(file_name)   #打开文件
    file_utf_16 = codecs.open(file_utf_16_name, mode='w', encoding='utf-16') #创建要写入UTF-16编码的文件，此处要调用codecs包
    text = file_read.read()      #读取文件内容
    file_utf_16.write(text.decode("UTF-8-SIG")) #将转换成unicode的内容写入文件
    file_read.close()    #关闭文件
    file_utf_16.close()  #关闭文件

    fs = open(file_utf_16_name, 'r')
    check = chardet.detect(fs.read())  #以上两行是对utf-16编码的文件的验证

    print 'the encoding of '+file_utf_16_name+' is ' + check.get('encoding') #输出结果
except Exception, e:
    print e.message

统计“3_人民日报语料”文本中的字符数和词数，把文件分别保存为 ansi， UTF8，UTF16，unicode 格式