Students in the learning process Spark Steaming, it may lack a practiced hand of the project, this time by a small practical project background, the learned Spark Steaming, Hbase, Kafka string together.
1. Projects
1.1 Project Flow
Spark Streaming data stream read incoming data source kafka json format, the batch is completed in the washing and filtering the data, and then read data from the supplementary HBase, spliced into a new string json downstream kafka written.
1.2 Detailed Project
2. Preparing the Environment
2.1 components are installed
You first need to install the necessary large data components, install the version information is as follows:
Spark 2.1.2
kafka 0.10.0.1
HBase 1.2.0
Zookeeper 3.4.5
2.2 Hbase Table creation
Hbase create table student, column family named cf, and stored in two data
2.3 Kafka Topic created
Creating kafka two topic, namely kafka_streaming_topic, hello_topic.
3. Code
3.1 Project Structure
Simple explanation:
Output, Score, Output is three Java Bean
MsgHandler perform operations on the data stream, including determining json format, checks mandatory fields, results> 60 = Filter, Bean json to, operations such as merging Bean
ConfigManager reads the configuration parameters
conf.properties configuration information
The main function is a program StreamingDemo
HBaseUtils Hbase Tools
StreamingDemoTest test class
3.2 The main function
Initialization spark, and some of the configuration information is read, the read data kafka KafkaUtils.createDirectStream.
Next, following the completion of several operations:
Cleaning and screening data, return (id, ScoreBean) of RDD
Id List configuration set Hbase query results from bulk, configuration (id, studentJsonStr) is set resMap, facilitate subsequent O (1) Query
遍历每条数据,从resMap查到结果,合并出新的Java Bean
Java Bean to Json String,并写入到kafka
4. 结果
开启kafka producer shell, 向kafka_streaming_topic写数据
开启kafka consumer shell, 消费hello_topic
5. 总结
通过这个小项目,希望大家可以掌握基本的Spark Streaming流处理操作,包括读写kafka,查询hbase,spark streaming Dstream操作。篇幅有限,全部代码就不一一列出了,完整代码在