shell 统计单词频率

#!/bin/bash
#n个出现频率最高的单词
help(){ echo "该shell脚本统计一个文本中出现次数最多的n个单词"
	      echo "usage: sh "$0" filename n"
	      echo "filename 为你要统计的文本名称 n为要统计的单词个数"
	      echo "sh "$0" englist_statment.txt 10"
	    }
	    
:<<EOF

First Flight
　　Mr. Johnson had never been up in an aerophane before and he had read a lot about air accidents, so one day when a friend offered to take him for a ride in his own small phane, Mr. Johns
on was very worried about accepting. Finally, however, his friend persuaded him that it was very safe, and Mr. Johnson boarded the plane.
　　His friend started the engine and began to taxi onto the runway of the airport. Mr. Johnson had heard that the most dangerous part of a flight were the take-off and the landing, so he w
as extremely frightened and closed his eyes.
　　After a minute or two he opened them again, looked out of the window of the plane, and said to his friend, Look at those people down there. They look as small as ants, dont they?
　　Those are ants, answered his friend. Were still on the ground.
EOF

if [[ -z "$1" || -z "$2" ]];then
	 help
	 exit 
fi 

if [[ -f "$1" ]];then
	 statis=$(more "$1" |tr -cs "[a-z][A-Z]" "\n"|tr A-Z a-z|sort|uniq -c|sort -k1nr -k2|head -"$2")
	 echo "$statis"
else 
    help 
  exit 1

fi



[root@oracle shellscript]# sh statis_word.sh englist_statment.txt 5
     10 the
      6 and
      6 his
      5 a
      5 friend

#如果没有正确使用 打印帮助信息
[root@oracle shellscript]# sh statis_word.sh englist_statment.txt 
该shell脚本统计一个文本中出现次数最多的n个单词
usage: sh statis_word.sh filename n
filename 为你要统计的文本名称 n为要统计的单词个数
sh statis_word.sh englist_statment.txt 10




[root@oracle shellscript]# tr --help
Usage: tr [OPTION]... SET1 [SET2]
Translate, squeeze, and/or delete characters from standard input,
writing to standard output.

  -c, -C, --complement    first complement SET1
  -d, --delete            delete characters in SET1, do not translate
  -s, --squeeze-repeats   replace each input sequence of a repeated character
                            that is listed in SET1 with a single occurrence
                            of that character
  -t, --truncate-set1     first truncate SET1 to length of SET2
      --help     display this help and exit
      --version  output version information and exit

SETs are specified as strings of characters.  Most represent themselves.
Interpreted sequences are:

  \NNN            character with octal value NNN (1 to 3 octal digits)
  \\              backslash
  \a              audible BEL
  \b              backspace
  \f              form feed
  \n              new line
  \r              return
  \t              horizontal tab
  \v              vertical tab
  CHAR1-CHAR2     all characters from CHAR1 to CHAR2 in ascending order
  [CHAR*]         in SET2, copies of CHAR until length of SET1
  [CHAR*REPEAT]   REPEAT copies of CHAR, REPEAT octal if starting with 0
  [:alnum:]       all letters and digits
  [:alpha:]       all letters
  [:blank:]       all horizontal whitespace
  [:cntrl:]       all control characters
  [:digit:]       all digits
  [:graph:]       all printable characters, not including space
  [:lower:]       all lower case letters
  [:print:]       all printable characters, including space
  [:punct:]       all punctuation characters
  [:space:]       all horizontal or vertical whitespace
  [:upper:]       all upper case letters
  [:xdigit:]      all hexadecimal digits
  [=CHAR=]        all characters which are equivalent to CHAR

Translation occurs if -d is not given and both SET1 and SET2 appear.
-t may be used only when translating.  SET2 is extended to length of
SET1 by repeating its last character as necessary.  Excess characters
of SET2 are ignored.  Only [:lower:] and [:upper:] are guaranteed to
expand in ascending order; used in SET2 while translating, they may
only be used in pairs to specify case conversion.  -s uses SET1 if not
translating nor deleting; else squeezing uses SET2 and occurs after
translation or deletion.

Report bugs to <[email protected]>.



tr -cs "[A-Z][a-z]" "[\n*]"

#测试下 -c的意思,有一个test0.sh的文件.里面有大写字母 小写字母 数字
[root@oracle shellscript]# more test0.sh


M C  a b 8 6

[root@oracle shellscript]# more test0.sh |tr -c "[A-Z]" "$"
$$M$C$$$$$$$$$$$

[root@oracle shellscript]# more test0.sh |tr -c "[a-z]" "$"
$$$$$$$a$b$$$$$$

[root@oracle shellscript]# more test0.sh |tr -c "[:digit:]" "$"
$$$$$$$$$$$8$6$$

可以看出-c是取反的意思.意思是把除SET1之外的替换为 SET2

-s 就是把连续出现的只保留一个.
[root@oracle shellscript]# more test0.sh |tr -cs "[:digit:]" "$"
$8$6$[root@oracle shellscript]#


tr -cs "[a-z][A-Z]" "\n" 就是把除单词之外的替换为换行符.然后只保留一个.
shell 统计单词频率

猜你喜欢