Tsinghua big data job MapReduce processes hundreds of gigabytes of JSON data

Link: https://github.com/datamaning/MapReduce

 

MapReduce

Tsinghua University Big Data Job MapReduce Processes User Weibo JSON Data

Hadoop Experiment

  1. Input data

The input data files have been saved in the directory /input-user under Hadoop's HDFS (Hadoop Distributed File System) (see Section 5 for the use of Hadoop). The input data is a compressed json file, Hadoop will automatically process the input data in compressed format, so you do not need to perform additional decompression processing.

The input data is stored in json format, and each line is a piece of microblog data or a piece of user information data (you need to judge in the program and process them separately). The data format of each microblog is as follows:

{ "_id" : XX, "user_id" : XXXXXXXX, "text" : "XXXXXXXXXXXXXX", "created_at" : "XXXXXX" }

In this experiment, we only need to use the user_id item, which is the id of the user who sent Weibo.

The format of each user information is as follows:

{ "_id" : XXXXXX, "name" : "XXXX" }

In this experiment, we will use both, _id is the id of the user, and name is the username of the user.

For an example of input data, see "Sample Input Data Format" at the end of this page.

  1. Output Data

This experiment requires the output of each user name and the number of his meager messages, sorted by the number from large to small, the example is as follows:

86 The poor boy Wang Benben

82 Sun Ziyue is CC, not CC

79 Kidd Scales

75 Sweet and Sour Fishbone

Separate the number of each line and the username with a tab character (\t).

  1. Program template

See the template UserCount.java for details. Please read the template carefully, there are many hints. You can also refer to the WordCount sample program on the Hadoop official website.

  1. Compiler

When compiling, please enter the /home/ubuntu/project directory in the virtual machine, and then enter:

sh compile.sh

This will compile the .java file and automatically compress the java program into a jar package. After that, a file called UserCount.jar will appear in this directory. This is the file we will use when running the Hadoop program.

Even if there is a compilation error, the script will automatically generate a jar package. This jar package cannot be submitted to Hadoop for execution. Please make sure that there are no compilation errors before proceeding to the next steps.

  1. run the program

Go to the /home/ubuntu/project directory of the virtual machine and run the command:

sh start-cluster.sh

Hadoop's dfs service and yarn service will be started automatically. At this point, the entire Hadoop cluster should have been started. You can view the running java processes through the jps command. Normally, it should include at least the following four: Jps, NameNode, SecondaryNameNode, and ResourceManager.

To run the jar package you just compiled, run the command in the /home/ubuntu/project directory:

sh run-job.sh

If you are familiar with the use of Hadoop, you can also use the Hadoop command to submit tasks. Hadoop is installed in the /home/ubuntu/hadoop-2.7.1 directory. But be sure to use start-cluster.sh to start the Hadoop cluster as I need to do some initialization of the cluster before starting.

Note that when the task is run repeatedly, if the result file or temporary intermediate file generated during the previous run is still stored in HDFS, a file conflict error will be reported at run time, and the conflicting directory or file needs to be deleted in advance. The specific deletion method is described later. Hadoop usage reference.

If you have more questions about the operation of Hadoop, you can refer to the WordCount sample program or the MapReduce tutorial.

  1. Calculation results

Which directory of HDFS the calculation result is stored in depends on your java program. To copy the result from HDFS to the local path of the virtual machine use:

/home/ubuntu/hadoop-2.7.1/bin/hadoop fs –copyToLocal /path/on/hdfs /path/on/master

Where /path/on/hdfs is the path of the result file on HDFS, and /path/on/master is the local path where you want to store the calculation result.

After that, there will be a file called part-r-00000 or partXXX in the /path/on/master directory, which is the result of the calculation.

If there is no problem with the result, rename the result file to result.txt and place it in the /home/ubuntu/project folder. Running sh submit.sh will automatically submit your code and result.txt to the scoring server for staging (you can also save unfinished code this way). Click the submit button on the assignment submission page, and the server will check your results for correctness and give you a grade.

Hadoop usage reference

Hadoop is installed in the /home/ubuntu/hadoop-2.7.1 directory of the virtual machine, and we can run Hadoop commands in the /home/ubuntu/hadoop-2.7.1/bin directory.

./hadoop fs -ls /path/on/hdfs : List the contents of a directory on HDFS, similar to the local ls command./hadoop fs -rm /path/on/hdfs : Delete a file on HDFS./hadoop fs -rm -r /path/on/hdfs : Delete a directory on HDFS./hadoop fs -cat /path/on/hdfs : View the content of a file on HDFS, but files on HDFS are generally larger in size , it is recommended to use ./hadoop fs -cat /path/on/hdfs | head -n 10 to view the first 10 lines of the file./hadoop fs -copyToLocal /path/on/hdfs /path/on/local : copy the files on HDFS To the virtual machine local./yarn application -list : List the currently executing tasks, you can use this command to query the task execution progress./yarn application -kill : Terminate the processing of a JSON format file of an executing task

For the input file in json format, you don't need to write a program to read these items, you can use the JSONObject class in the org.json package to process, the specific usage method:

String s = "{ "_id" : XX, "user_id" : XXXXXXXX, "text" : "XXXXXXXXXXXXXX", "created_at" : "XXXXXX" }"

JSONObject js = new JSONObject(s);

Then you can extract the items you need from this JSONObject, for example:

String id = null;

if ( js.has(“user_id”))

id = js.optString(“user_id”);

Remember to use has() to check whether the item exists before using optString(), otherwise the program may directly break with an error.

In addition, JSONObject js = new JSONObject(s); and id = js.optString("id"); statement will generate JSONException, please use try-catch statement or add throw exception.

Example of input data format

{ "_id" : 376049, "user_id" : 1643097417, "text" : "Apple cooperates with China Unicom, iphone is here; Google platform is married with China Mobile, ophone is here; what's up with telecommunications, is BlackBerry coming??" , "created_at" : "Tue Sep 01 08:38:17 +0800 2009" }

{ "_id" : 376134, "user_id" : 1223277140, "text" : "I resigned on the first day of September!", "created_at" : "Tue Sep 01 08:41:57 +0800 2009" }

{ "_id" : 376244, "user_id" : 1639065174, "text" : "Autumn is finally here and the weather is about to turn cold, which is the last thing I want to see", "created_at" : "Tue Sep 01 08: 46:52 +0800 2009" }

{ "_id" : 376336, "user_id" : 1640712172, "text" : "08:47 North 2nd Ring Road Deshengmen Bridge to Xiaojie Bridge west to east, Zhonglou North Bridge to Xizhimen east to west, traffic is slow.  http:/ /sinaurl.cn/h5daT ", "created_at" : "Tue Sep 01 08:50:33 +0800 2009" }

{ "_id" : 1886796931, "name" : "叙Carens" }

{ "_id" : 1374588365, "name" : "畅畅bella" }

{ "_id" : 1784427554, "name" : "I want to take two exams" }

{ "_id" : { "$numberLong" : "2372210700" }, "name" : "Wang Lixin Runaway Girl" }

{ "_id" : { "$numberLong" : "2253254920" }, "name" : "Granville丶" }

{ "_id" : 1915163215, "name" : "Xiaoya rushes forward" }

#Spark Experiment input data

The input data has been saved in HDFS: hdfs://192.168.70.141:8020/Assignment1. where each line represents an article in the format:

doc_id, doc_content

Sample article:

317,newsgroups rec motorcyclespath cantaloupe srv cs cmu rochester cornell batcomputer caen uwm linac uchinews quads geafrom gea quads uchicago gerardo enrique arnaez subject bike message apr midway uchicago sender news uchinews uchicago news system reply gea midway uchicago eduorganization university chicagodate mon apr gmtlines honda shadow intruder heard bikes plan time riding twin cruisers bikes dont massive goldw

Output Data

This experiment requires counting the Document Frequency of all words and the number of articles containing the word. For example, a certain word DF=10, which means that there are 10 articles including this word.

The experiment requires outputting all words with DF=10 and their corresponding inverted table key values ​​in JSON format. Each line corresponds to one word. The format is defined as follows:

{w1: [ { w1_d1: [ w1_d1_p1, d1_p2 ] },{ w1_d2: [ w1_d2_p1 ] },{ w1_d3: [ w1_d3_p1 ] } ]}

{w2: [ { w2_d1: [ w1_d1_p1, d1_p2, d1_p3] },{ w2_d2: [ w1_d2_p1] } ]}

As shown in the inverted table above, the word w1 appears in three documents, and the IDs of the documents are: w1_d1, w1_d2, and w1_d3. The word w1 appears twice in the document w1_d1 at w1_d1_p1 and d1_p2, and once in each of the documents w1_d2 and w1_d3 at w1_d2_p1 and w1_d3_p1. A sample output is as follows:

{"circle":[{"642":[136] },{"120":[165] },{"1796":[75] },{"1862":[168] },{"611":[210] },{"646":[37] },{"519":[150] },{"1469":[944] },{"558":[108] },{"1463":[102] }]}

Please use the println function to print the result to standard output.

Program template

For the program template, see the project directory, src/main/scala/com/dataman/demo/Assignment1.scala. The file can be edited directly using an online editor.

Compiler

We provide two compilation methods, if the sbt compilation speed is slow, you can try to compile with maven. Execute the following command in the project folder of the virtual machine.

sbt:

sbt clean assembly

If the compilation is successful, the compilation result is saved in target/scala-2.10/spark-demo-assembly-1.0.jar

maven:

mvn clean package

If the compilation is successful, the compilation result is saved in target/spark-demo-1.0.jar

run the program

This experiment needs to run the spark program in the docker container. If you are not familiar with the concept of container, it is recommended to read https://en.wikipedia.org/wiki/LXC first.

  1. Start the docker container:

docker run -it --net host -v /home/ubuntu/project:/tmp/project offlineregistry.dataman-inc.com/tsinghua/spark:1.5.0-hadoop2.6.0 bash

This command will create a spark docker container and enter the container's /opt/spark/dist directory, which is also the home directory of spark. At the same time, mount the external /home/ubuntu/project to the /tmp/project directory inside the container, you can find your compilation results here.

  1. Run spark:

In the Spark directory, which is /opt/spark/dist in the container, run the following command:

bin/spark-submit --jars /tmp/project/target/spark-demo-1.0.jar --class com.dataman.demo.Assignment1 /tmp/project/target/spark-demo-1.0.jar > /tmp/project/data/answer

The running result will be redirected to the data/answer file under project

  1. To exit docker please use ctrl+p ctrl+q, this will ensure that the docker container will not be stopped. If you need to re-enter docker, please use docker attach (containerID), where containerID is viewed through docker ps.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326523932&siteId=291194637