Spark project combat enterprise-class, source depth analysis, machine learning, data analysis

Students in the learning process Spark Steaming, it may lack a practiced hand of the project, this time by a small practical project background, the learned Spark Steaming, Hbase, Kafka string together.

1. Projects

1.1 Project Flow

Spark Streaming data stream read incoming data source kafka json format, the batch is completed in the washing and filtering the data, and then read data from the supplementary HBase, spliced ​​into a new string json downstream kafka written.

1.2 Detailed Project

2. Preparing the Environment

2.1 components are installed

You first need to install the necessary large data components, install the version information is as follows:

Spark 2.1.2

kafka 0.10.0.1

HBase 1.2.0

Zookeeper 3.4.5

2.2 Hbase Table creation

Hbase create table student, column family named cf, and stored in two data

2.3 Kafka Topic created

Creating kafka two topic, namely kafka_streaming_topic, hello_topic.

3. Code

3.1 Project Structure

 

Simple explanation:

Output, Score, Output is three Java Bean

MsgHandler perform operations on the data stream, including determining json format, checks mandatory fields, results> 60 = Filter, Bean json to, operations such as merging Bean

ConfigManager reads the configuration parameters

conf.properties configuration information

The main function is a program StreamingDemo

HBaseUtils Hbase Tools

StreamingDemoTest test class

3.2 The main function

Initialization spark, and some of the configuration information is read, the read data kafka KafkaUtils.createDirectStream.

 

Next, following the completion of several operations:

Cleaning and screening data, return (id, ScoreBean) of RDD

Id List configuration set Hbase query results from bulk, configuration (id, studentJsonStr) is set resMap, facilitate subsequent O (1) Query

遍历每条数据,从resMap查到结果,合并出新的Java Bean

Java Bean to Json String,并写入到kafka

 

4. 结果

开启kafka producer shell, 向kafka_streaming_topic写数据

开启kafka consumer shell, 消费hello_topic

5. 总结

通过这个小项目,希望大家可以掌握基本的Spark Streaming流处理操作,包括读写kafka,查询hbase,spark streaming Dstream操作。篇幅有限,全部代码就不一一列出了,完整代码在

Guess you like

Origin www.cnblogs.com/spark88/p/11225820.html