统计一篇英文文章中出现次数最多的10个单词

https://blog.csdn.net/u010512607/article/details/40005641

思路:

1.读入文件,按行将文字拼接成字符串str
2.用正则过滤字符串中的标点,再分割成str[]
3.用hashmap依次统计每个单词出现的次数(可以加黑名单过滤情态动词等)
4.对hashmap的值排序(利用Collections的sort,重写比较器Comparator的compare)
5.输出hashmap前10个单词

代码:

public class Main{
    public static void main(String[] args) throws IOException {
        readPaper();
}

    //统计一篇英文文章中出现次数最多的10个单词
    public static void readPaper() throws IOException{

        HashMap<String, Integer> wordMap = new HashMap<String, Integer>();

        File file = new File("e:/info.log");
        BufferedReader br=new BufferedReader(new FileReader(file));

        StringBuilder sb=new StringBuilder();
        String line=null;
        while((line=br.readLine())!=null){
            sb.append(line);
        }
        br.close();


        String words=sb.toString();// 全部的单词字符串
        String target=words.replaceAll("\\pP|\\pS", "");// 将标点替换为空
        //小写 p 是 property 的意思,表示 Unicode 属性,用于 Unicode 正表达式的前缀
        //大写 P 表示 Unicode 字符集七个字符属性之一:标点字符
        //大写S:符号(比如数学符号、货币符号等);
        String[] single=target.split(" ");


        String[] keys={ "you", "i", "he", "she", "me", "him", "her", "it",
                "they", "them", "we", "us", "your", "yours", "our", "his",
                "her", "its", "my", "in", "into", "on", "for", "out", "up",
                "down", "at", "to", "too", "with", "by", "about", "among",
                "between", "over", "from", "be", "been", "am", "is", "are",
                "was", "were", "whthout", "the", "of", "and", "a", "an",
                "that", "this", "be", "or", "as", "will", "would", "can",
                "could", "may", "might", "shall", "should", "must", "has",
                "have", "had", "than" };

        // 将一部分常见的无意义的英语单词替换为字符 '#' 以便后面输出单词出现次数时的判断
//      for(int i=0;i<single.length;i++){
//          for(String str:keys){
//              if(str.equals(str)){
//                  single[i]="#";
//              }
//          }
//      }

        // 将单词以及其出现的次数关联起来
        for(int i=0;i<single.length;i++){
            if(wordMap.get(single[i])==null){
                wordMap.put(single[i],1);       
            }else{
                wordMap.put(single[i], wordMap.get(single[i])+1);
            }
        }

        //比较器,按值排序
        List<Entry<String,Integer>> list=new ArrayList
                <Entry<String,Integer>>(wordMap.entrySet());
        Collections.sort(list,new Comparator<Entry<String,Integer>>(){

            @Override
            public int compare(Entry<String, Integer> o1,
                    Entry<String, Integer> o2) {
                return o2.getValue()-o1.getValue();
            }

        }
            );


        //输出次数最多的单词
        int count=1;
        for(Map.Entry<String, Integer> entry:list){
            if(entry.getKey().equals("#")){
                continue;
            }
            System.out.println(entry.getKey()+":"+entry.getValue());
            count++;
            if(count==11){
                break;
            }
        }
    }

猜你喜欢

转载自blog.csdn.net/junjunba2689/article/details/82563722