Big data study notes (1) [original]

In the new year, the company starts to build a big data center project, so the architecture team needs to conduct research on related technologies. I have heard about big data related technologies before, but the actual projects are useless and seldom pay attention to it. Here are some technical understandings:
1. Hadoop My understanding is a big data processing framework, including a bunch of technologies such as hdfs, hbase, yarn, hive, zookeeper, etc. The processing algorithm is MapReduce.
2.hdfs is Massive distributed file system, general hadoop batch tasks need to rely on resource files on hdfs.
3.hbase is a KV database, similar to redis, hbase mainly uses disk storage, and redis mainly uses memory.
4.yarn task scheduling management
5.mapreduce is a batch algorithm, which first divides resources into maps for processing, and then aggregates them through reduce to achieve cluster processing.
6. Hive mapreduce is equivalent to a low-level interface, which is more difficult to use. Hive allows users to call mapreduce through sql-like scripts, which is equivalent to a high-level interface. If you make an analogy, mapreduce is equivalent to assembly language, and hive is equivalent to C language.
7.zookeeper task coordination system
8.spark is a relatively advanced big data processing framework, because hadoop's mapreduce performance is relatively slow, spark can use less resources to achieve faster speed.
9. DAG is a directed non-closed graph, that is, there is no loop in the process.
10. RDD elastic data set, a read-only memory record set that can only perform limited operations such as join, group, etc.
11.shark Because hive originally only supported hadoop, so shark is for hive on spark. Later, hive stopped updating after supporting spark.
12.hive on spark supports spark's hive.
13.sparksql Because shark was changed based on hive, a lot of hive code was retained. After shark stopped changing, sparksql was redeveloped.
14.spark streaming is spark's stream processing framework . Because hadoop can only run batches, spark streaming can calculate in real time, for example, real-time warning for some services is required. It can be seamlessly integrated with the spark technology stack. In fact, the real-time data is divided into batches by time, so the real-time performance can only reach the second level. If it reaches milliseconds early, other techniques are required.
15.storm is a big data stream processing framework. Compared with spark, it has high real-time performance, can reach the millisecond level, and the throughput is small, which needs to be processed one by one.
16.mesos Similar to yarn, it is also a resource scheduling manager.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326449851&siteId=291194637