Hadoop learning summary

The previous articles basically cover the main basic content of Hadoop. It can be seen that the key parts of Hadoop are the MapReduce computing framework and the HDFS distributed file system. Resource management is mainly carried out through YARN. For more in-depth content, you need to experience in work and projects, and seek solutions when encountering problems.


Today's article mainly summarizes Hadoop from a more macro perspective. Some of them are my own thoughts, and some are opinions on the Internet. I hope it will be helpful for you to use and learn Hadoop in the future.


Aside from the various frameworks of Hadoop, let's take a look at the modules that Hadoop itself has. Hadoop is mainly used for storage and analysis of big data. HDFS is used for storage and MapReduce for analysis.


HDFS is understandably the most widely used distributed file system at present. Regardless of frameworks such as HBase or Hive, the bottom layer is still HDFS. What they do is to make the operation mode of HDFS more diversified.


The weakness of Hadoop is mainly reflected in the MapReduce part, mainly because MapReduce is a batch system, which is not suitable for processing interactive analysis. MapReduce can only process offline data, that is, data stored on disk. Such an approach is not very suitable for the current application. So Spark came into being as a new framework, and it just replaced the MapReduce part. Therefore, it should not be said that the emergence of Spark makes Hadoop lose its living environment - Spark just replaces MapReduce, and Spark must survive in the Hadoop environment in order to reflect its value.


Hadoop is more suitable for processing PB-level data, which is difficult to achieve for current general enterprises. Therefore, the real use of Hadoop is not very wide, but domestic BAT is in use, in addition, China Mobile Unicom and banks are also using it.


Therefore, whether to use Hadoop is determined by your data volume, and whether you need to use a certain framework is determined by your architecture and business requirements. If you only have gigabytes of data, there is no need to use Hadoop at all, and a database system can solve the problem well.


So, what are the main frameworks of Hadoop? What role did each play? The following is a brief introduction. I am also learning. Not all of them are understood. When needed, let's learn more, but at least we should know what tools we can use.


Avro: Avro is a programming language independent data serialization system. Avro data is defined in a schema language similar to JSON. Avro parses the data description in the schema language and can perform serialization and deserialization. This makes the data more independent and independent of the platform and programming language that uses the data. We can write data to HDFS through Avro and read it from HDFS.


Flume: Flume is similar to a pipeline connecting various components in Hadoop. The pipeline transmits streaming data. Using Flume can optimize the transmission of data.


Sqoop: realizes the bidirectional transmission of data between relational databases and HDFS.


Pig: Lets you access data on HDFS in a SQL-like way (Pig calls it Pig Latin).


Hive: Pigs Latin can only be said to be similar to SQL. Using Hive is basically the same as SQL.


HBase: If Hive provides a SQL-like way to operate HDFS, HBase provides a NoSQL-like way to operate HDFS.


Spark: This is the guy who replaced MapReduce mentioned above. Its biggest advantage is that the calculation is performed in memory, which greatly improves the calculation speed. However, it should be clear that Spark is just a computing framework, it does not depend on specific cluster. So some people say that Spark has revolutionized Hadoop's life. It can only be said that Hadoop's MapReduce can be replaced by Spark.


ZooKeeper: A distributed coordination service for Hadoop. In addition, it allows your system to handle failures properly.


The above are some commonly used Hadoop frameworks, which provide solutions for various aspects of big data processing. We can prioritize them when we need them, and if using a certain component can solve our problem well, of course it's the best.


从上面可以看出,Hadoop依然是大数据处理的优选框架,虽然我们不一定有机会从事Hadoop相关的职业,但是,了解一些这个领域的知识还是获益匪浅,至少让我们的工具箱中多了一样工具,毕竟在这个数据量爆炸式增长的世界,我们谁也无法保证当前正在开发或者维护的系统不会达到这样的数据量。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325578743&siteId=291194637