Win10 IDEA connects to the Hadoop cluster in the virtual machine (come in to ensure your success)

Table of contents

introduction:

environment:

premise:

accomplish:

test

Conclusion:

question

introduction:

        Distributed courses require the use of IDE (IDEA, Eclipse) to write programs to directly perform file operations on Hadoop clusters. Currently, there are mixed tutorials on connecting IDEA to Hadoop clusters. I have completed the process of connecting IDEA to Hadoop clusters based on multiple tutorials. The complete process is now shown below.

        If you think the article is not organized well, or there is something you don’t understand, please leave me a message.

environment:

        windows10 (IDEA 2021.1.3)

        VMware 16 workstation pro (you can search for tutorials for installation, it is easier)

        Linux Server (Hadoop-2.7.7 cluster 1 master 3 slaves)

You can see the cluster constructionHadoop cluster construction (super detailed)_Ruan Hahahahaha’s blog-CSDN blog

You can see how idea connects to Hadoop clusteridea connects to local virtual machine Hadoop cluster to run wordcount - Xu Chunhui - Blog Park (cnblogs.com)

premise:

       1. Complete the construction of a fully distributed Hadoop cluster through a virtual machine. Use start-all.sh to start the Hadoop cluster in the master node, and use jps to get the following output, which indicates that the Hadoop cluster is successfully built.

        Of course, you can also view it through the web interface provided by Hadoop. Generally speaking, we enter http://192.168.xx.101:50070 in the browser to access. (Note: Sometimes we can indeed jump to this interface, but we still need tocheck whether the datanode is running normally, because in such a situation, the datanode The configuration fails, but the Hadoop cluster can be started successfully, but subsequent file operations cannot run normally)

        Click Datanodes and the above interface will appear, indicating that the Hadoop cluster has been configured.

        2. Installed the IDEA development tools

accomplish:

        Configure Hadoop on window

        1. Download the hadoop-2.7.7.tar.gz file to window. Each version of Hadoop, I chose 2.7.7

Hadoop is cross-platform. There is no need to worry about the incompatibility between Linux and Windows. However, it should be noted that in hadoop-2.7.7/etc/hadoop/hadoop-env.sh< /span>the path of jdk under window needs to be modified to JAVA_HOME

        2. Select an empty directory to decompress hadoop-2.7.7.tar.gz

        

        3. Add hadoop-2.7.7 to the environment variables

Variable name: HADOOP_HOME

Variable value: E:\xx\xx\xx\hadoop-2.7.7 (see the picture below first and then copy)

%JAVA_HOME%\bin

%JAVA_HOME%\jre\bin (see the picture below first and then copy it)

        ​ ​ 4. Use the command line to check whether the environment variables are configured successfully.

hadoop version

        5. Install jdk (JDK 8 all versions)

        Extract it to the directory and add environment variables (similar to Hadoop configuration, you can go up and take a look)

Variable name: JAVA_HOME

Variable value: E:\ProgramSoftware\java\JAVAHOME\jdk1.8.0_162

Variable value: %JAVA_HOME%\bin

Variable value: %JAVA_HOME%\jre\bin

        Use java -version and javac to verify (note that the above bin and \jre\bin must be configured, otherwise there will be a problem that hadoop cannot find JAVA_HOME)

        6. Placewinutil.exe intohadoop-2.7.7\bin\ a> directory. (wintil.extDownload, choose a version from GitHub that is the same or higher than your own hadoop version)

        7. Generalwinutil.exeand laterhadoop-2.7.7\bin\hadoop.dll AbandonedC:\Windows\System32

        8. Use idea to open an empty directory

        9. Add maven and click Add Framwork Support 

        Add maven

        After the addition is successful, main and test will appear

        10. Configure maven and connect hadoop-2.7.7\etc\core-site.xml and Copy to resource (you can configure the output level of the console log through log4j.properties, and you can query it online Other output level strategies)hadoop-2.7.7\etc\hdfs-site.xml

log4j.rootLogger=debug,stdout,R 
log4j.appender.stdout=org.apache.log4j.ConsoleAppender 
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
log4j.appender.stdout.layout.ConversionPattern=%5p - %m%n 
log4j.appender.R=org.apache.log4j.RollingFileAppender 
log4j.appender.R.File=mapreduce_test.log 
log4j.appender.R.MaxFileSize=1MB 
log4j.appender.R.MaxBackupIndex=1 
log4j.appender.R.layout=org.apache.log4j.PatternLayout 
log4j.appender.R.layout.ConversionPattern=%p %t %c - %m%n 
log4j.logger.com.codefutures=DEBUG

       

        11. Configure pom.xml

        Initial state

        Add the following content to pom.xml. After adding, idea will start to load the required resource files violently. After the download is completed, the original red pom.xml will turn into blue. Color(Note: hadoop’s version must be the same as your own version)

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.8</maven.compiler.source>
    <maven.compiler.target>1.8</maven.compiler.target>
    <hadoop.version>2.7.7</hadoop.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>${hadoop.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>${hadoop.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>${hadoop.version}</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
    <dependency>
        <groupId>commons-cli</groupId>
        <artifactId>commons-cli</artifactId>
        <version>1.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>${hadoop.version}</version>
    </dependency>
</dependencies>

test

        Through the above operations, the idea’s connection to the Hadoop cluster has been basically realized. Now test

        1. Create a java file in java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.log4j.BasicConfigurator;

import java.io.IOException;


public class HdfsTest {

    public static void main(String[] args) {         //Automatically and quickly use the default Log4j environment.         BasicConfigurator.configure();         try {


            // Change to your own IP and the path of the corresponding file
            String filename = "hdfs://192.168.47.131:9000/words.txt";< /span>             a>             if (fs.exists(new Path(filename))){
            Configuration conf = new Configuration();             fs = FileSystem.get(conf);
FileSystem fs = null; Configuration conf = new Configuration();

            // Search the console for the file is exist or not exist. Depending on your situation, the file will be opened if it exists.

            // the file is exist 不存在就会打印 the file is not exist 
                System.out.println("the file is exist");
            }else{
                System.out.println("the file is not exist");
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

        At this time, there is a high probability that jdk is not configured. Please configure it as shown below.

        ​​​​ 2. After the configuration is successful, we run the program and check whether the word.txt exists in the console. All the files in my directory print the file is exist.

        3. Implement a word frequency statistics program

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.log4j.BasicConfigurator;

/**
 * Word statistics MapReduce
 */
public class  WordCount {

    /**
     *Mapper class
     */
    public static class WordCountMapper extends MapReduceBase implements Mapper<Object, Text, Text, IntWritable> {
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();
        /**
         * The map method completes its work by reading the file
         * Use each word in the file as a key and set the value to 1.
         * Then set this key-value pair as the output of map, that is, the input of reduce
         */
        @Override
        public void map(Object key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            /**
             * StringTokenizer: String delimited parsing type
             * I have never found such a useful tool before
             * java.util.StringTokenizer
             * 1. StringTokenizer(String str) :
             * Construct a StringTokenizer object used to parse str.
             * Java's default delimiters are "space", "tab ('\t')", "line feed ('\n')", and "carriage return ('\r')".
             * 2. StringTokenizer(String str, String delim) :
             * Construct a StringTokenizer object used to parse str and provide a specified delimiter.
             * 3. StringTokenizer(String str, String delim, boolean returnDelims) :
             * Construct a StringTokenizer object used to parse str, provide a specified delimiter, and specify whether to return the delimiter.
             *
             * By default, the default delimiters in Java are "space", "tab ('\t')", "line feed ('\n')", and "carriage return ('\r')".
             */
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                output.collect(word, one);
            }
        }
    }

    /**
     * The input of reduce is the output of map, and the values ​​of words with the same key are statistically accumulated.
     * You can get the statistical number of words, and finally use the word as the key and the number of words as the value.
     * Output to the set output file and save it
     */
    public static class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        @Override
        public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
            int sum = 0;
            while (values.hasNext()) {
                sum += values.next().get();
            }
            result.set(sum);
            output.collect(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        //Quickly use log4j logging function
        BasicConfigurator.configure();
        //Data input path     The path here needs to be replaced by your own hadoop address
        String input = "hdfs://192.168.139.100:9000/test /input/word.txt";
        /**
         * The output path is set to the out folder in the root directory of HDFS
         * Note: This folder should not exist, otherwise an error will occur
         */
        String output = "hdfs://192.168.139.100:9000/test/output1";

        JobConf conf = new JobConf(WordCount.class);
        //Set who submits
        conf.setUser("master");
        /**
         * Because the map-reduce process requires us to customize the map-reduce class,
         * Therefore, we need to export the project as a jar package
         * Word frequency statistics jar package in setjar and local hadoop
         */
        conf.setJar("E:\\ProgramSoftware\\java\\hadoop2.7.7\\hadoop-2.7.7\\share\\hadoop\\mapreduce\\hadoop-mapreduce-examples-2.7.7.jar");
        //Set job name
        conf.setJobName("wordcount");
        /**
         * Declare cross-platform submission of jobs
         */
        conf.set("mapreduce.app-submission.cross-platform","true");
        //very important statement
        conf.setJarByClass(WordCount.class);
        //Corresponding word string
        conf.setOutputKeyClass(Text.class);
        //The statistical number of corresponding words int type
        conf.setOutputValueClass(IntWritable.class);
        //Set mapper class
        conf.setMapperClass(WordCountMapper.class);
        /**
         * Set the merging function, and the output of the merging function is used as the input of the Reducer.
         * Improve performance and effectively reduce the amount of data transmission between map and reduce.
         * But the merge function cannot be abused. Need to be combined with specific business.
         * Since this application counts the number of words, using the merge function will not affect the results or the
         * Business logic results have an impact.
         * When it affects the result, the merge function cannot be used.
         * For example: when we count the business logic of the average word occurrence, we cannot use merge
         * function. If used at this time, it will affect the final result.
         */
        conf.setCombinerClass(WordCountReducer.class);
        //Set the reduce class
        conf.setReducerClass(WordCountReducer.class);
        /**
         * Set the input format, TextInputFormat is the default input format
         * You don’t need to write this code here.
         * The key type it generates is a LongWritable type (representing the offset value starting in each line in the file)
         * Its value type is Text type (text type)
         */
        conf.setInputFormat(TextInputFormat.class);
        /**
         * Set the output format, TextOutpuTFormat is the default output format
         * Each record is written as a text line. Its key and value can be of any type. The output callback calls toString()
         * The output string is written into text. By default keys and values ​​are tab-delimited.
         */
        conf.setOutputFormat(TextOutputFormat.class);
        //Set the input data file path
        FileInputFormat.setInputPaths(conf, new Path(input));
        //Set the output data file path (this path cannot exist, otherwise it will be abnormal)
        FileOutputFormat.setOutputPath(conf, new Path(output));
        //Start mapreduce
        JobClient.runJob(conf);
        System.exit(0);
    }

}

        Finally can be entered on the maser node in the Linux virtual machine

 hdfs dfs -ls /test/output/*

Conclusion:

        At this point, the idea is connected to the Hadoop cluster configuration. More operations can be implemented through the configuration class filesystem class, FSDataInputStream class and FSDataOutputStream class provided by Hadoop.

        Friends who have problems can leave a message, and I will promptly reply to the problems I can solve. Some of the problems I encountered during installation are placed at the back, and everyone can browse and view them.

question:

        ​ ​ 1. Place the Hadoop file in the windows environment, add Hadoop environment variables as required, and the following will appear:

JAVA_HOME is incorrectly set.Please update C:\hadoop-3.1.2\etc\hadoop\hadoop-env.cmd

In the first attempt, the Hadoop compressed file was uploaded from Linux to Windows. The above error occurred. After consulting the information, I found that JAVA_HOME was indeed set to the Linux path in the hadoo-env.sh file. I tried to use notepad++ to modify it. There are two main types. Modification method:

        (1)set JAVA_HOME=${JAVA_HOME}

        (2)set JAVA_HOME=xxxx\jdk1.8.0_162

After the modification, problems still occurred, so I started to check if there was a problem with the jdk configuration. Entering java -version on the command line could run normally, but there was no response when I entered javac. I didn't pay attention to this problem at first, so I checked again to see if there was a problem with the Hadoop configuration. If there is a problem, carefully check the configuration file based on the environment variables and confirm that it is correct. I pointed the finger at javac again, so I checked the reason why javac failed to start, and saw the correct configuration method of jdk.

My windows configuration jdk did not configure the second item, so after adding it, I ran Hadoop -version again and it succeeded.

        2. After configuring according to the ubantu server ip configuration tutorial, I found that www.baidu.com could not be pinged. The IP address configuration needs to follow the same principles as the virtual machine's IP, gateway, and subnet mask. Through inspection, we found that the last digit of the gateway given in the tutorial is 1, but the last digit of the gateway in the virtual machine is 2. After the change, the Internet can be accessed normally.

        3. When configuring the five major files of the hadoop cluster, it was found that most tutorials did not modify the hadoop-env.xml file in the etc/hadoop directory. If you just use the original set JAVA_HOME={JAVA_HOME}, the jdk will not be found. You need to Set it to your local jdk path

        ​​ 4. When configuring pom.xml, I found that the jdk.tool dependency will not compile properly due to version reasons. The correct solution is to change the systempath to the local JAVA_HOME (although it is very popular, it doesn't matter)

        5. If you are unable to connect after completing everything, it is most likely that aliases (master, s1) are used in the local hdfs-site.xml and core-site.xml, which should be replaced with the ip of the master host.

core-site.xml

hdfs-site.xml

Guess you like

Origin blog.csdn.net/2201_75875170/article/details/133821093