hadoop M/R 实现倒排索引

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/u011066470/article/details/86530783

一.背景要求:

以倒排索引,词项-文档列表的形式,针对统计文件中的单词的个数,以下格式打印输出:”单词  文件路径->统计数;文件路径->统计数;........”的格式

Aili hdfs://192.168.59.128:9000/inverseindex/b.txt->1;hdfs://192.168.59.128:9000/inverseindex/a.txt->1;

baidu hdfs://192.168.59.128:9000/inverseindex/a.txt->1;hdfs://192.168.59.128:9000/inverseindex/c.txt->1;

预料格式:

a.txt:

baidu top1

aili top2

tengxun top3

xiaomi top4

ultrapower top5

java top6

python top7

B.txt

 c top2

java top1

python top5

c++ top4

aili top0

tengxun top1

c++ top5

c.txt:

java top1

baidu top2

c top3

java top0

二.业务实现

2.1  调用主代码

2.2  mapper代码

2.3  combiner代码

2.4  reducer代码

三.上传文件和代码​​​​​​​

将程序打成jar包,预料文件a.txt,b.txt,c.txt上传到linux目录下:

四.上传文件到hdfs上,执行jar

新建目录

[root@naidong sbin]# hadoop fs -mkdir  /inverseindex

上传文件:

[root@naidong jurf_temp_data]# hadoop fs -put a.txt b.txt c.txt /inverseindex

[root@naidong jurf_temp_data]# hadoop jar hadoop-demo-inverseindex.jar  /inverseindex  /inverseindexout

2019-01-16 21:12:40,752 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

2019-01-16 21:12:43,883 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2019-01-16 21:12:44,027 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1547639631603_0002

2019-01-16 21:12:48,398 INFO input.FileInputFormat: Total input files to process : 3

2019-01-16 21:12:50,021 INFO mapreduce.JobSubmitter: number of splits:3

2019-01-16 21:12:50,520 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled

2019-01-16 21:12:51,580 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1547639631603_0002

2019-01-16 21:12:51,584 INFO mapreduce.JobSubmitter: Executing with tokens: []

2019-01-16 21:12:52,803 INFO conf.Configuration: resource-types.xml not found

2019-01-16 21:12:52,804 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.

2019-01-16 21:12:53,408 INFO impl.YarnClientImpl: Submitted application application_1547639631603_0002

2019-01-16 21:12:53,844 INFO mapreduce.Job: The url to track the job: http://naidong:8088/proxy/application_1547639631603_0002/

2019-01-16 21:12:53,845 INFO mapreduce.Job: Running job: job_1547639631603_0002

2019-01-16 21:13:28,404 INFO mapreduce.Job: Job job_1547639631603_0002 running in uber mode : false

2019-01-16 21:13:28,431 INFO mapreduce.Job:  map 0% reduce 0%

2019-01-16 21:14:46,181 INFO mapreduce.Job:  map 67% reduce 0%

2019-01-16 21:14:47,873 INFO mapreduce.Job:  map 100% reduce 0%

2019-01-16 21:15:42,011 INFO mapreduce.Job:  map 100% reduce 100%

2019-01-16 21:15:45,096 INFO mapreduce.Job: Job job_1547639631603_0002 completed successfully

2019-01-16 21:15:45,734 INFO mapreduce.Job: Counters: 53

File System Counters

FILE: Number of bytes read=1811

FILE: Number of bytes written=856475

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=534

HDFS: Number of bytes written=1683

HDFS: Number of read operations=14

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=3

Launched reduce tasks=1

Data-local map tasks=3

Total time spent by all maps in occupied slots (ms)=451496

Total time spent by all reduces in occupied slots (ms)=76626

Total time spent by all map tasks (ms)=225748

Total time spent by all reduce tasks (ms)=25542

Total vcore-milliseconds taken by all map tasks=225748

Total vcore-milliseconds taken by all reduce tasks=25542

Total megabyte-milliseconds taken by all map tasks=462331904

Total megabyte-milliseconds taken by all reduce tasks=78465024

Map-Reduce Framework

Map input records=18

Map output records=36

Map output bytes=1956

Map output materialized bytes=1823

Input split bytes=330

Combine input records=36

Combine output records=32

Reduce input groups=18

Reduce shuffle bytes=1823

Reduce input records=32

Reduce output records=18

Spilled Records=64

Shuffled Maps =3

Failed Shuffles=0

Merged Map outputs=3

GC time elapsed (ms)=2263

CPU time spent (ms)=10430

Physical memory (bytes) snapshot=745897984

Virtual memory (bytes) snapshot=12588302336

Total committed heap usage (bytes)=436482048

Peak Map Physical memory (bytes)=207892480

Peak Map Virtual memory (bytes)=2748080128

Peak Reduce Physical memory (bytes)=134983680

Peak Reduce Virtual memory (bytes)=4361895936

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=204

File Output Format Counters

Bytes Written=1683

2019-01-16 21:15:55,761 WARN util.ShutdownHookManager: ShutdownHook '' timeout, java.util.concurrent.TimeoutException

java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask.get(FutureTask.java:205)

at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:68)

四.查看结果

摘出结果:

文档见:百度网盘:大数据资料/2019大数据资料

都看到这里了,就顺手点击左上角的【关注】按钮,点击右上角的小手,给个评论,关注一下,再走呗!☺

猜你喜欢

转载自blog.csdn.net/u011066470/article/details/86530783