Self Big Data where to start

Big Data technologies feeling too deep, you want to self-big data, the where to start?

About Big Data Techniques, focus on presenting today!

First of all, the term referred to the concept of big data. Big Data lies in the nature of the data, but it has a new feature highlights. Including: extensive data sources, diverse data formats (structured data, unstructured data, Excel files, text files, etc.), the amount of data (a minimum level of TB is also, perhaps even PB level), high speed data growth Wait.

Extended talking about big data four basic characteristics that we think will be the following amounts:

1. Wide Data Sources?

Come from a wide range of data sources, collected and summarized by what means? We appear to be relatively Sqoop,
Cammel, Datax and other tools.

2. After data collection, how to store?

After the acquisition, storage for the convenience of our correspondence appeared GFS, HDFS, TFS and other distributed file storage system.
Moreover, the growth rate between data very fast, which also requires us to, data storage must be able to scale horizontally.

3. After the data is stored, how quickly transformed into a consistent format through the operation, how quickly calculated the results you want?

In this regard, MapReduce distributed computing framework proposed solutions. However, due to large MapReduce need java programming code, thus, there was the Hive, Pig and so on will be converted to SQL MapReduce analytic engine; Again, because only common MapReduce batch to batch to batch processing of data, time consuming too, and the ultimate goal requires a data input we get the result, so he appeared in low latency flow computing framework such Storm / JStorm; however, if needed the batch and flow process at the same time, according to the above must take two clusters, the Hadoop cluster (including HDFS + MapReduce + Yarn) and Storm cluster, difficult to manage, so there has been such a stop the Spark computational framework, batch processing may be performed, and can be streamed (substantially micro-batch). Then Lambda architecture, the emergence of Kappa architecture, but also provides a common architecture for business processes.
  
  4. In addition, in order to improve efficiency, accelerate the speed of transportation, there are some aids:
  
  Ozzie, Azkaban: timed task scheduling tool.
  Hue, Zepplin: graphical task execution management, the results viewer.
  Scala Language: Write the best language Spark program, of course, you can choose to use Python.
  Python Language: will use when writing scripts.
Allluxio, Kylin like: pre-processed by the data storage, to accelerate the operation speed of the tool.

These are the tools Big Data ecosystem that can be used in large data-end training courses have a more detailed and complete knowledge of great data development, content includes Linux && Hadoop ecosystem, big data computing framework, a cloud computing system, and so on. We only for extraordinary achievements in life, build a dream bridge, looking interested friends to join us !

Guess you like

Origin blog.csdn.net/kangshufu/article/details/92426039