Use idea to remotely debug spark program and read hbase on hadoop under Win
Environment:
Win7
Jdk1.8
Hadoop2.7.3 winutils.exe tool
IntelliJ IDEA 2017.3 x64
IDEA 2017.3 scala support package
Spark2.1.1
Scala2.11.4
Step 0 Configure system environment variables
0.1 Jdk1.8, Scala2.11.4 configuration is fine, not to go into details
0.2 Hadoop configuration under win: (here is 2.7.3)
Copy the hadoop2.7.3 installation path in the cluster and put it in the root directory of any drive letter under win
Download hadoop2.7.3 of winutils.exe tool link: https://pan.baidu.com/s/1pKWAGe3 password: zyi7
replace the original hadoop bin directory with the bin
will hadoop2.7.3 configured into the system environment variables, hadoop2. 7.3/bin can not be equipped
The first step is to configure idea
1.1 Download and install ( https://www.jetbrains.com/idea/ )
After installing the first do not start, wait crack
crack (if economic conditions can, try to support genuine, after all, such a good tool)
download crack packages: link: https://pan.baidu.com/s/1eRSjwJ4 password: mo6d
the crack packets directly copied to the bin directory of the installation directory to
1.2 configuration idea environment
download IDEA 2017.3 of the support package scala
address: link: HTTPS: // PAN .baidu.com/s/1mixLiPU password: dbzu
install IDEA 2017.3 scala support package (required)
The second step of development
2.1 Create a project (maven project, easy to develop, create a project that supports both java and scala)
maven-archetype-quickstartolve/70/gravity/SouthEast)
Please ignore the content in the group id, just start it; delete the snapshot of the version
2.2 Modify the pom .xml file, add a framework used depends
add the following:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.11.4</scala.version>
<hbase.version>1.2.5</hbase.version>
<spark.version>2.1.1</spark.version>
<hadoop.version>2.7.3</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!-- spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- hadoop -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoop.version}</version>
</dependency>
<!-- hbase -->
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/java</sourceDirectory>
<testSourceDirectory>src/test/java</testSourceDirectory>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<maniClass></maniClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>1.3.1</version>
<executions>
<execution>
<goals>
<goal>exec</goal>
</goals>
</execution>
</executions>
<configuration>
<executable>java</executable>
<includeProjectDependencies>false</includeProjectDependencies>
<classpathScope>compile</classpathScope>
<mainClass>com.dt.spark.SparkApps.App</mainClass>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
2.3 Add a cluster configuration file to our project to
the hadoop, copy the configuration file to hbase resources folder
The third step is to write code and realize the case
Case 1: Use java to implement a wordcount case
Equipment test file
I edited after words.txt put on hdfs
to view the contents of words.txt
the red box is our words.txt, you can see, using a space between each word split
The JavaWordCount code is as follows:
package com.shanshu.demo;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import java.util.regex.Pattern;
public class JavaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args) throws Exception {
/*if (args.length < 1) {
System.err.println("Usage: JavaWordCount <file>");
System.exit(1);
}*/
System.setProperty("hadoop.home.dir","E:\\hadoop-2.7.3");
SparkSession spark = SparkSession
.builder().master("spark://192.168.10.84:7077")
.appName("JavaWordCount")
.getOrCreate();
spark.sparkContext()
.addJar("E:\\myIDEA\\sparkDemo\\out\\artifacts\\sparkDemo_jar\\sparkDemo.jar");
JavaRDD<String> lines = spark.read().textFile("hdfs://192.168.10.82:8020/user/jzz/word/words.txt").javaRDD();
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterator<String> call(String s) {
return Arrays.asList(SPACE.split(s)).iterator();
}
});
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
List<Tuple2<String, Integer>> output = counts.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
spark.stop();
}
}
Please note: When @Override is used in the code, you need to modify the java version, otherwise an error will be reported
Modify the java version of Project
Modify the java version of Module
Description: the code is run locally added to the jar to lay code
step jar packaging as follows:
i add path
Note: Sometimes it is necessary to copy the generated jars to the cluster to run. In order to prevent the compiled jars from being too large, delete these jars
ii compile
iii Compiled result
iv Copy the directory of the jar package on the disk
Write this path into the code (must be done, otherwise it cannot be run locally)
as follows:
v Execute the code, you can see the results as follows after success
Case 2: Use scala code to read the data of hbase
Preparation work: create a table fruit in hbase, column family is info, and insert the data (built) to
view:
i Copy the configuration file of hbase to the resources directory of idea
ii Add scala jar
iii Create a scala folder in the main directory and set scala to source
iv Directory Copy the META-INF directory in the java directory to scala, delete the MANIFEST.MF file, and create a package
v Write scala code
code show as below:
package com.shanshu.scala
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.{SparkConf, SparkContext}
object ReadHbase {
def main(args: Array[String]): Unit = {
val conf = HBaseConfiguration.create()
conf.set("hbase_zookeeper_property_clientPort","2181")
conf.set("hbase_zookeeper_quorum","192.168.10.82")
val sparkConf = new SparkConf().setMaster("local[3]").setAppName("readHbase")
val sc = new SparkContext(sparkConf)
//设置查询的表名
conf.set(TableInputFormat.INPUT_TABLE, "fruit")
val stuRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
//遍历输出
stuRDD.foreach({ case (_,result) =>
val key = Bytes.toString(result.getRow)
val name = Bytes.toString(result.getValue("info".getBytes,"name".getBytes))
val color = Bytes.toString(result.getValue("info".getBytes,"color".getBytes))
val num = Bytes.toString(result.getValue("info".getBytes,"num".getBytes))
val people = Bytes.toString(result.getValue("info".getBytes,"people".getBytes))
println("Row key:"+key+" Name:"+name+" color:"+color+" num:"+num+" people"+people)
})
sc.stop()
}
}
vi Similarly, open the jar package (not necessary, only needed when executing on the cluster) and
delete the original jar, because we have to choose the main class of scala this time
The results of vii operation are as follows:
QQ: 28169942401 (Do not disturb the advertisement)