Applicability insights for Hadoop, Spark, HBase and Redis

Question introduction:
1. What scenarios do you think Hadoop is suitable for?
2. What scenarios is Spark suitable for?
3. What are the characteristics of HBase and Redis?




Recently, I saw a discussion on the applicability of Hadoop on the Internet [1]. Thinking that this year, big data technology began to move from Internet giants to small and medium-sized Internet and traditional industries. It is estimated that many people are considering the applicability of various "complex" big data technologies. Here, I will discuss with you the usage scenarios of several mainstream big data technologies such as Hadoop, Spark, HBase, and Redis based on my working experience in Hadoop and other big data directions in the past few years (First of all, let me state that the Hadoop referred to in this article is , is a very "narrow" Hadoop, that is, the technology of running MapReduce directly on HDFS, the same below).

I have actually researched and used big data (including NoSQL) technologies in recent years, including Hadoop, Spark, HBase, Redis, and MongoDB, etc. The common feature of these technologies is that they are not suitable for supporting transactional applications, especially those related to "money". applications, such as "ordering relationship", "supermarket transaction", etc., these occasions are still dominated by traditional relational databases such as Oracle.

1. Hadoop Vs. Spark
Hadoop/MapReduce and Spark are most suitable for offline data analysis, but Hadoop is especially suitable for scenarios where the amount of data for a single analysis is "large", while Spark is suitable for scenarios where the amount of data is not very large. The "large" mentioned here is relative to the memory capacity of the entire cluster, because Spark needs to hold the data in memory. In general, the amount of data below 1TB cannot be considered very large, and the amount of data above 10TB is considered "large". For example, a cluster of 20 nodes (such a cluster size is small in the field of big data), each node has 64GB of memory (not too small, but not too large), for a total of 1.28TB. It is very easy for a cluster of this size to hold data of about 500GB in memory. At this time, the execution speed of Spark will be faster than that of Hadoop. After all, in the MapReduce process, operations such as spill need to be written to disk.

There are 2 points to mention here:
1) In general, for small and medium-sized Internet and enterprise-level big data applications, the number of single analysis will not be "large", so Spark can be given priority, especially when After Spark matures (Hadoop has been released to 2.5, and Spark has just released 1.0). For example, a provincial company of China Mobile (at the enterprise level, the data volume of mobile companies is still quite large), the amount of their single analysis is generally only a few hundred GB, and even 1TB is rarely exceeded, let alone more than 1TB. 10TB, so you can consider using Spark to gradually replace Hadoop.

2) Businesses often think that Spark is more suitable for "iterative" applications like machine learning, but this is only "more". In general, for medium-sized data volumes, even for applications that are not in the "better fit" category, Spark is about 2-5 times faster. I have done a comparison test myself, 80GB of compressed data (more than 200GB after decompression), a cluster size of 10 nodes, running an application like "sum+group-by", MapReduce took 5 minutes, while spark only took 2 minutes .

2. HBase
对于HBase,经常听到的一个说法是:HBase只适合于支撑离线分析型应用,特别是做为MapReduce任务的后台数据源。持这个观点不少,甚至在国内一个响当当的电信设备提供商中,HBase也是被归入数据分析产品线的,并明确不建议将HBase用于在线应用。可实际情况真是这样吗?让我们先看看它的几大案例:Facebook的消息类应用,包括Messages、Chats、Emails和SMS系统,用的都是HBase;淘宝的WEB版阿里旺旺,后台是HBase;小米的米聊用的也是HBase;移动某省公司的手机详单查询系统,去年也由原先的Oracle改成了一个32节点的HBase集群——兄弟们,这些可都是知名大公司的关键应用啊,够能说明问题了吧。

实际上从HBase的技术特点上看,它特别适用于简单数据写入(如“消息类”应用)和海量、结构简单数据的查询(如“详单类”应用)。在上面提到的4个HBase的应用中,Facebook消息、WEB版阿里旺旺、米聊等均属于以数据写入为主的消息类应用,而移动公司的手机详单查询系统则属于以数据查询为主的详单类应用。

HBase的另一个用途是作为MapReduce的后台数据源,以支撑离线分析型应用。这个固然可以,但其性能如何则是值得商榷的。比如说,superlxw1234同学通过实验对比了“Hive over HBase”和“Hive over HDFS”后惊奇的发现[2],除了在使用rowkey过滤时,基于HBase的性能上略好于直接基于HDFS外,在使用全表扫描和根据value过滤时,直接基于HDFS方案的性能均比HBase好的多——这真是一个谬论啊!不过对于这个问题,我个人感觉从原理上看,当使用rowkey过滤时,过滤程度越高,基于HBase方案的性能必然越好;而直接基于HDFS方案的性能则跟过滤程度没有关系。

3. HBase Vs. Redis
HBase和Redis在功能上比较类似,比如它们都属于NoSQL级别的数据库,都支持数据分片等,关键的不同点实际上只有一个:对HBase而言,一旦数据被成功写入,从原理上看是不会丢的,因为它有Writa-ahead Log(功能上类似于Oracle REDO);而对于Redis而言,即便是配置了主从复制功能,在Failover时完全存在发生数据丢失的可能(如果不配置主从复制,那么丢失的数据会更多),因为它第一没有类似REDO的重做日志,第二采用了异步复制的方式[email protected]

关键还在于性能。通常,Redis的读写性能在100,000 ops/s左右,时延一般为10~70微妙左右[4][5];而HBase的单机读写性能一般不会超过1,000ops/s,时延则在1~5毫秒之间[3]。忽略其中的硬件因素,100倍的读写性能差异已经足够说明问题了。顺便提一下的是,Redis在Tuning上还是比较讲究的,比如说,当使用numactl(或taskset)将Redis进程绑定到同一个CPU的不同CORE上时,它的性能一般可以提升30%左右[6],在一些特别的场景下甚至可以有近一倍的提升。

从上述的功能和性能比较上,我们就很容易的总结出HBase和Redis各自的适用范畴:
1)当用来支撑简单“消息类”应用时,如果数据失败是不能容忍的,那就用只能用HBase;如果需要一个高性能的环境,而且能够容忍一定的数据丢失,那完全可以考虑使用Redis。

2)Redis很适合用来做缓存,但除此之外,它实际上还可以在一些“读写分离”的场景下作为“读库”来用,特别是用来存放Hadoop或Spark的分析结果。
有不少人认为Redis只适合用作“缓存”,根据我的理解,这主要是基于以下2个原因:第一,Redis在设计上存在数据丢失的可能性;第二,当无法将数据全部HOLD在内存中时,其读写性能会急剧下降到每秒几百ops[6],这一现象类似于Google开源的Leveldb[7],Facebook的RocksDB团队的通过Performance Benchmark也证实了这一现象的存在[8]。但是,当用作“读库”或用于支撑允许数据丢失的“消息类”应用时,这两个问题实际上都没有关系。


参考: http://datainsight.blog.51cto.com/8987355/1426538

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325832728&siteId=291194637