Detailed Kafka: big data development hottest core technology

Big Data era, if you do not know Kafka then you are really out of the ( fast grasp Kafka please refer to the article: How to master the full range of core technology Kafka )! According to statistics, one third of the Fortune 500 companies are using Kafka, including all travel companies TOP10, TOP10 seven banks, insurance companies TOP10 8, 9 TOP10 telecommunications firms.

 

LinkedIn, Microsoft and Netflix are dealt with Kafka one trillion of information every day. Kafka mainly used in large real-time information stream or real-time analysis of data collected (or both). Kafka can provide services that micro-memory persistence service can also be used for feedback to the complex event stream systems and IoT / IFTTT automation system events.

 

Why Kafka?

 

Real-time analysis in real time stream data structures used in Kafka. Since Kafka is a fast, scalable, fault-tolerant and high-lasting publish - subscribe messaging system (publish-subscribe messaging system), so Kafka for some Use Case (large amounts of data and high response demand) support is much better than JMS, RabbitMQ and AMQP. Compared to those tools, Kafka support higher throughput, greater stability and copies (replication) feature. This makes it more suitable than traditional MOM tracking service calls (you can track every call) or track IoT sensor data.

 

 

Kafka can be used with Flume / Flafka, Spark Streaming, Storm, HBase, Flink the Spark and, for real-time acquisition, analysis and processing of data streams. Kafka can provide a data stream to Lake Hadoop Big Data (Hadoop BigData lake). Kafka Broker supports low latency to process and analyze the massive flow of information in Spark or Hadoop. In addition, Kafka sub KafkaStreaming can be used for real-time analysis.

 

What is Kafka Use Case?

 

In short, Kafka for stream processing, Web site activity tracking, metrics collection and monitoring, log aggregation, real-time analytics, CEP, the data injection Spark and Hadoop, CQRS, replay messages, error recovery, and submit a distributed memory computing ( micro-service) log.

 

Who uses Kafka?

 

Many require fast processing of large amounts of data big companies are using Kafka. Kafka was originally developed by LinkedIn, use it to track activity data and operational metrics. Twitter Storm it as part of a process stream as the basis. Square Kafka as the bus, all system events (log, custom events, metrics, etc.) transmit data to the respective center Square or output to Splunk, or applied Graphite (dashboard), or implement Esper-like / CEP alarm system. Spotify, Uber, Tumbler, Goldman Sachs, PayPal, Box, Cisco, CloudFlare and Netflix and other companies are using it.

 

Why so popular Kafka

 

First, the main reason is Kafka has excellent performance. It is very stable and can provide a stable persistence, with flexible subscription - Post a message queue, can be a good extension of the N consumer groups, has a powerful replication capabilities, adjustable to provide for producers to ensure consistency and a reserved sort (ie Kafka theme partitions) at the level of fragmentation.

 

Secondly, Kafka needs to be well compatible data stream processing system, and integration of these systems, and converts to other storage loads. In addition, Kafka operation (configuration and use) are very simple, and Kafka's works also well understood. Of course, if Kafka data processing is slow, there are more other advantages are meaningless, therefore, "more with less" is the biggest advantage of Kafka.

 

Why so fast Kafka

 

Kafka rely on the operating system kernel to achieve zero copy based on the principle of fast-moving depth data. Kafka batch processing, data can be recorded. These batch data from the producer to the consumer can then file system (Kafka theme logs) by end to end. Batch enables more efficient data compression and reduce I / O latency. Kafka will be submitted to immutable log writes contiguous disk, thus avoiding the problem of random disk access and disk seeks slow. Kafka support an increase in partition scale. It will subject logging into hundreds (possibly thousands) partition distribution to thousands of servers. This way it can make Kafka carrying massive loads.

 

Kafka Streaming

 

Kafka is most commonly used real-time transmission of data to other systems. Kafka as an intermediate layer to decouple different real-time data pipeline. Kafka core is not suitable for polymerization into direct calculation data (Data Aggregation) or the CEP. As part of Kafka Kafka Streaming ecosystem, it provides the ability for real-time analysis. Kafka can provide a fast-track system (real-time operating data system) for the Storm, Flink, Spark Streaming and your services and CEP systems.

 

Kafka also used for traffic data analysis bulk data. It transmits data to large data platform or RDBMS, Cassandra, Spark S3 even for future data analysis. These data are usually stored support data analysis, reporting, scientific data analysis, compliance audits and backup. Said so much, let's talk about the ultimate proposition:

 

In the end what is Kafka?

You can learn skirt + next big data: Big Data 957 205 962, to receive a free set of system of tutorials

Kafka is a distributed workflow platform for publish and subscribe recorded stream. Kafka may be used for fault tolerant storage. Kafka Copy subject logging partition to multiple servers. Kafka was designed to make your application can be processed immediately after recording can be generated. Kafka processing fast, effective use of a batch and IO compressed record. Kafka would decouple data stream. Kafka data for the data stream Lake, real-time streaming applications and analysis system.

 

 

 

 

Kafka multi-language support

 

客户端和服务器之间的Kafka通信使用基于TCP的线路协议,该协议是版本化和文档化的。Kafka承诺保持对老客户端的向后兼容性,并支持多种语言,包括C#,Java,C,Python,Ruby等多种语言。Kafka生态系统还提供REST代理,可通过HTTP和JSON轻松集成。Kafka还通过Kafka的融合模式注册(ConfluentSchema Registry)支持Avro模式。Avro和模式注册允许客户以多种编程语言制作和读取复杂的记录,并允许记录的变化。

 

Kafka的用途

 

Kafka支持构建实时流数据管道。Kafka支持内存微服务(比如actors,Akka,Baratine.io,QBit,reactors,reactive,,Vert.x,RxJava,Spring Reactor)。Kafka支持构建实时流应用程序,进行实时数据分析,转换,响应,聚合、加入实时数据流以及执行CEP。

 

 

Kafka可以用来协助收集度量标准或KPI,从多个来源收集统计信息并实现eventsourcing(将应用状态的所有更改捕获为事件序列)。可以将它与内存微服务和actor系统一起使用,以实现内中服务(分布式系统的外部提交日志)。

 

Kafka可以用来在节点之间复制数据,为节点重新同步以及恢复状态。虽然Kafka主要用于实时数据分析和流处理,但也可以将其用于日志聚合,消息传递,跟踪点击流,审计跟踪等等。

 

Kafka可扩展的消息存储

 

Kafka是一个很好的记录或信息存储系统。Kafka就像一个提交日志存储和复制的高速文件系统。这些特点使Kafka适用于各种应用场合。写入Kafka主题的记录会持久保存到磁盘并复制到其他服务器以实现容错。由于现在磁盘速度快而且相当大,所以这种方式非常有用。Kafka生产者可以等待确认,所以消息是持久的,因为生产者在复制完成之前不会完成写入操作。Kafka磁盘结构可以很好地扩展。磁盘在大批量流式传输时具有非常高的吞吐量。

 

此外,Kafka客户端和消费者可以控制读取位置(偏移量),这允许在出现重要错误(即修复错误和重放)时重播日志等用例。而且,由于偏移量是按照每个消费者群体进行跟踪的,所以消费者可以非常灵活地重播日志。

 

Kafka的记录保留

 

Kafka集群保留所有公布的记录。如果没有设置限制,它将保留所有记录直到磁盘空间不足。可以设置基于时间的限制(可配置的保留期限),也可以基于空间的限制(可根据存储空间进行配置)或精简(保留最新版本的记录)。除非被时间,空间或精简等策略删除,主题日志中的记录一直处于可用状态。由于Kafka总是在主题日志的末尾写入,所以它的消费速度不会受到大小的影响。

 

如何快速学习Kafka?

 

《Kafka核心技术与实战》专栏作为一次全新的交付,胡夕(现任人人贷公司计算平台部总监,也是 Apache Kafka 的一名活跃代码贡献者)用更轻松更容易理解的语言和形式,帮你获取到最新的 Kafka 实战经验。

Guess you like

Origin blog.csdn.net/lele989/article/details/91879988