Application scenarios and differences between Hadoop and JStrom

Application scenarios and differences between Hadoop and JStrom 1. Hadoop


is an offline analysis tool for processing massive data, while Storm is a distributed, real-time data stream analysis tool. One focuses on offline analysis, and the other focuses on real-time data stream analysis.
2. Hadoop focuses on the powerful analysis function of offline data, while Storm emphasizes the analysis of real-time data streams.
3. Hadoop has low real-time performance (minute level), but has strong processing capability for large amounts of data (TB level), while Storm has high real-time performance (ms level), but its massive data processing capability is worse than Hadoop.
4. Data source: Hadoop is data that may be terabytes in a folder on HDFS, STORM is a new data in real time
5. Processing process: HADOOP is divided into MAP stage to REDUCE stage, STORM is a user Define the processing flow. The flow can contain multiple steps. Each step can be a data source (SPOUT) or processing logic (BOLT)
. , stop there until new data comes in, and then start from scratch
7. Processing speed: HADOOP is designed to process a large amount of data on HDFS, and the speed is slow. STORM can be done only by processing a certain amount of new data. soon.
8. Applicable scenarios: HADOOP is used when a batch of data is to be processed. It does not pay attention to timeliness. If it needs to be processed, a JOB is submitted. STORM is used when a new data is processed, and timeliness is required.
9. With MQ Contrast: HADOOP has no comparison. STORM can be regarded as having N steps. After each step is processed, it sends a message to the next MQ, and the consumers listening to this MQ continue to process


. Say a typical scenario:
1. Assuming that Hadoop is used, it needs to be stored in hdfs first, and calculated according to the granularity of cutting one file per minute (this granularity is already extremely fine, and if it is smaller, there will be a bunch of small files on hdfs), when Hadoop starts to calculate, 1 minute has passed, and then it took another minute to start scheduling the task, and then the job ran. Suppose there are too many machines, a few minutes are enough, and then writing the database hypothesis also takes a little time. In this way, from the data generated It has been at least two minutes since it was finally available.
2. JStrom streaming computing is when data is generated, there is a program to monitor the generation of logs all the time, and a line is sent to the streaming computing system through a transmission system, and then the streaming computing system processes it directly. Writing to the database, each piece of data is generated and written to the database, and can be completed in milliseconds when resources are sufficient.


Reference (application scenarios and differences): https://www.zhihu.com/question/20098507
Reference (differences) http://blog.csdn.net/educast/article/details/41723471

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326872759&siteId=291194637