Use Kafka + Spark Streaming + Cassandra build real-time data processing engine

Apache Kafka is a scalable, high-performance, low-latency platform that allows us to read and write the same data as the message system. We can easily use Kafka in Java.

Spark Streaming is part of the Apache Spark is a scalable, high throughput, fault-tolerant real-time streaming processing engine. Although the use of Scala development, but support Java API.

Apache Cassandra is a distributed NoSQL database.
In this article, we will describe how to build a highly scalable by these three components, fault-tolerant real-time data processing platform.

ready

Before performing the following article describes, we need to create a good theme Kafka and related tables Cassandra, as follows:

Create a theme called messages in Kafka in

$KAFKA_HOME$\bin\windows\kafka-topics.bat --create \
 --zookeeper localhost:2181 \
 --replication-factor 1 --partitions 1 \
 --topic messages

Creating KeySpace and table in the Cassandra

CREATE KEYSPACE vocabulary
    WITH REPLICATION = {
        'class' : 'SimpleStrategy',
        'replication_factor' : 1
    };
USE vocabulary;
CREATE TABLE words (word text PRIMARY KEY, count int);

Above we created a KeySpace named vocabulary, and called the words of the table.

Add dependent

We use Maven dependency management, to use this project dependencies are as follows:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.3.0</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.3.0</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.11</artifactId>
    <version>2.3.0</version>
    <scope>provided</scope>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupId>com.datastax.spark</groupId>
    <artifactId>spark-cassandra-connector_2.11</artifactId>
    <version>2.3.0</version>
</dependency>
<dependency>
    <groupId>com.datastax.spark</groupId>
    <artifactId>spark-cassandra-connector-java_2.11</artifactId>
    <version>1.5.2</version>
</dependency>

Data Pipeline Development

We will use Spark to create a simple application in Java, it will be integrated with Kafka theme we created earlier. The application will read the message published and calculate the frequency of each word in the message. The result is then updated to Cassandra table. The entire data structure is as follows:

Now we explain in detail how the code is implemented.

Get JavaStreamingContext

Spark Streaming The entry point is JavaStreamingContext, so we first need to get the object, as follows:

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("WordCountingApp");
sparkConf.set("spark.cassandra.connection.host", "127.0.0.1");
 
JavaStreamingContext streamingContext = new JavaStreamingContext(
  sparkConf, Durations.seconds(1));

In reading data from Kafka

With After JavaStreamingContext, we can just read real-time streaming data corresponding to the theme from Kafka, as follows:

Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("messages");
 
JavaInputDStream<ConsumerRecord<String, String>> messages = 
  KafkaUtils.createDirectStream(
    streamingContext, 
    LocationStrategies.PreferConsistent(), 
    ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams));

We provide deserializer key and value in the program. This is Kafka built to offer. We can also customize deserializer according to their needs.

Processing DStream

Earlier, we only get from Kafka defines which of the data in tables, where we will introduce how to process the data obtained:

JavaPairDStream<String, String> results = messages
  .mapToPair( 
      record -> new Tuple2<>(record.key(), record.value())
  );
JavaDStream<String> lines = results
  .map(
      tuple2 -> tuple2._2()
  );
JavaDStream<String> words = lines
  .flatMap(
      x -> Arrays.asList(x.split("\\s+")).iterator()
  );
JavaPairDStream<String, Integer> wordCounts = words
  .mapToPair(
      s -> new Tuple2<>(s, 1)
  ).reduceByKey(
      (i1, i2) -> i1 + i2
    );

To send data to the Cassandra

Finally, we need to send the results to Cassandra, the code is also very simple.

wordCounts.foreachRDD(
    javaRdd -> {
      Map<String, Integer> wordCountMap = javaRdd.collectAsMap();
      for (String key : wordCountMap.keySet()) {
        List<Word> wordList = Arrays.asList(new Word(key, wordCountMap.get(key)));
        JavaRDD<Word> rdd = streamingContext.sparkContext().parallelize(wordList);
        javaFunctions(rdd).writerBuilder(
          "vocabulary", "words", mapToRow(Word.class)).saveToCassandra();
      }
    }
  );

Start the application

Finally, we need this Spark Streaming program starts up, as follows:

streamingContext.start();
streamingContext.awaitTermination();

Use Checkpoints

In real-time streaming applications, saving the state of each batch down often useful. For example, in the previous example, we can calculate the frequency of the current word, if we want to calculate the cumulative frequency of words how to do it? At this time we can use the Checkpoints. The new data structure is as follows

To enable Checkpoints, we need to make some changes, as follows:

streamingContext.checkpoint("./.checkpoint");

Here we will write data to the checkpoint called .checkpoint local directory. But in reality the project, it is best to use HDFS directory.

Now we can calculate the cumulative frequency of words with the following code:

JavaMapWithStateDStream<String, Integer, Integer, Tuple2<String, Integer>> cumulativeWordCounts = wordCounts
  .mapWithState(
    StateSpec.function( 
        (word, one, state) -> {
          int sum = one.orElse(0) + (state.exists() ? state.get() : 0);
          Tuple2<String, Integer> output = new Tuple2<>(word, sum);
          state.update(sum);
          return output;
        }
      )
    );

Deploying the application

Finally, we can use spark-submit to deploy our application, as follows:

$SPARK_HOME$\bin\spark-submit \
  --class com.baeldung.data.pipeline.WordCountingAppWithCheckpoint \
  --master local[2] 
  \target\spark-streaming-app-0.0.1-SNAPSHOT-jar-with-dependencies.jar

Finally, we can look at Cassandra in the corresponding table in the data generated. The complete code can be found https://github.com/eugenp/tutorials/tree/master/apache-spark

## micro-channel public number and nails exchange group
in order to create an open Cassandra technical exchanges, we have established a micro-channel public number and nail group, to provide customers with professional and technical Q & A share, on a regular basis in the country to carry out technical line salon, live technical experts, welcome to join.

Micro-channel public number:

Cassandra Technical Community

Nail group

lALPDgQ9ql0mM3XMp8yo_168_167_png_620x10000q90g

Nail group into the group link: https://c.tb.cn/F3.ZRTY0o

Guess you like

Origin yq.aliyun.com/articles/704531