Experimental environment: Linux environment, python3 environment

Experiment requirements:

1. Create the map.py file, pay attention to the points:

1. For the processing of blank lines, you can judge line[0] !=''

2. Time data includes year, month and day, only need to intercept year

3. Each line of order data can be counted as 1

4. Target output format: year, amount

2. Write reduce.py code by yourself, pay attention to the points

1. The number of rows of data can be regarded as the number of transactions

2. The output should be sorted by year

3. Target output format: year, number of transactions-total

(The total amount is required to be converted into k with two decimal places, such as 13572468.98 -> 13572.46k)

Experimental ideas:

1. Map link: mapping, that is, selecting the fields to be processed. Judging from the experimental requirements of this question, the last key value processed by the map corresponds to: (k1, v1)=(year, amount)
2. Reduce link: protocol, the data with the same keyword is processed. The key-value pair finally processed by reduce is: (k2,v2)=(year, (number of transactions-total))

Realize functional links

Implementation of map function

#map功能的实现
import sys
   
for line in sys.stdin:
           if line[0]!="" :
            col = line.strip().split(',')
            col1=col[2].strip().split('-')
            col[2]=col1[0]
            print(col[2],',',col[6])

Implementation of reduce function

#reduce功能的实现
#!/bin/env python
# encoding: utf-8
from operator import itemgetter
import itertools
import sys


def read_mapper_output(file, separator = ','):
    for line in file:
       yield line.rstrip().split(separator,1)
stdin_generator=read_mapper_output(sys.stdin, ',')
for year, sals in itertools.groupby(stdin_generator,itemgetter(0) ):
  count=0
  total_sal=0
  for year,cur_sal in sals:
      count = count+1
      total_sal=total_sal + float(cur_sal)
  print(year,'\t',count,'\t','%.2f'%(total_sal/1000))

operation result

Map running results
Run the result with the map function and display the first 10 lines
Local pipeline test map code

cat sales.csv | python map.py

Map-reduce operation results
Total running results
map-reduce local pipeline test:

cat sales.csv | python map.py | sort -k 1 | python reduce.py

Implement MapReduce function in Hadoop cluster

1. Open a command line window and start Hadoop

cd /opt/hadoop/sbin
hadoop namenode -format #格式化名称节点
start-all.sh #启动服务器
jps #查看进程

2. Create the directory /001/input on HDFS and upload the data file to HDFS

hdfs dfs -mkdir -p /001/input #创建文件夹

hdfs dfs -put sales.csv /001/input #上传文件到集群文件夹里

hdfs dfs -ls /001/input #查看是否上传成功

Three, create a new XX.sh file

$HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
-D stream.non.zero.exit.is.failure=false \ #对于错误不报出
-D mapred.job.name="streaming_count" \ #作业的名称的命名，在浏览器输入：localhost:8088/，可以看到作业名。
-D mapred.job.priority=HIGH \ #先执行优先权高的作业 
-files "/home/ubuntu/PycharmProjects/untitled/map.py,/home/ubuntu/PycharmProjects/untitled/reduce.py" \
-input /001/input/sales.csv \
-output /001/input/out001 \
-mapper "python3 map.py" \
-reducer "python3 reduce.py"

Then open the command window under the XX.sh folder and enter the following code to run the result:

sh XX.sh

The result of running the above command is as follows: The result of running the command line

Use Map-Reduce to find the number of sales in each year and the total sales in each year