Big Data needs to learn what programming foundation? What Big Data learning steps?

Big Data needs to learn what programming foundation? What Big Data learning steps? Big data? There are many of my friends asked me, what in the end is big data? Word to ...

Big Data needs to learn what programming foundation? What Big Data learning steps?

Big Data needs to learn what programming foundation? What Big Data learning steps?
Big data?

There are many of my friends asked me, what in the end is big data? Summarized in one sentence

For non-software industry friends

According to some of your usual consumer behavior in supermarkets, gas stations, restaurants and other places through large data this technology, we can now know your age range, whether marriage, if there are children, old children generally, whether there is a fixed housing, car is roughly what price and other information.

For the software industry friends

Usually we write the programs are run on a single machine, processing capacity is limited, of course, the amount of data is limited. The big data technology, in fact, is that we can achieve our code is distributed in many machines up parallel processing vast amounts of data and get valuable from these massive data, meaningful information.

Learning basic skills required for big data

  1. linux foundation is a must, at least you need to master the basic operation of the command under linux command line

  2. javase basis [mysql] contains, attention is javase, not javaee. javaweb that piece of knowledge represents the largest data engineer is not required

Big Data technologies sector division

data collection

flume 、kafka、 logstash 、filebeat …

data storage

mysql 、redis 、hbase 、hdfs …

Although not part of a large mysql data in this category but I have listed, because you can not do without it at work

data query

hive impala elasticsearch kylin …

Data calculation

Real-time computing

storm sparkstreaming flink …

Off-line calculation

hadoop spark …

Other frameworks

zookeeper …

In fact, the Big Data learning is to learn a variety of ecological framework surrounding Big Data circle.

Big Data needs to learn what programming foundation? What Big Data learning steps?
Big Data learning step

Although the above listed a lot of framework, but the very beginning of learning are not necessarily all learn, even at work, these frameworks will not necessarily all be used.

Here I will look roughly column, a learning step frameworks of it:

Note: The order listed below is just to the individual recommendations can be adjusted according to the individual order of the actual situation

linux foundation and javase basis [mysql] contains

These are basic skills, just start it is impossible to learn very proficient, at least to some of linux basic commands mixed a Lian Shu, learning frameworks behind time will be used, with more on the familiar. javase mainly to see if the proposed object-oriented collection, io, multi-threading, and jdbc operation can be.

zookeeper

zookeeper is the basis of many large data frame, zoo Chinese name is meant as an icon of the current big data frameworks many of which are animal shapes, so the zookeeper in fact, a lot can manage big data framework. For this framework, mainly to learn how to build a single-node and cluster, as well as learn how to add or delete nodes zookeeper at zkcli client to change search operation can be.

hadoop

At present, enterprises are generally used in the version hadoop2.x, so there is no need to go to school hadoop1.x version, hadoop2.x contains three large pieces

hdfs early, some of the key learning hdfs of command can upload, download, delete, move, view commands ...

Under mapreduce the need to focus on learning to understand mr principles and code implementation, although it is working really write code that mr few times, but the principle still have to understand.

yarn preliminary understanding can only need to know yarn is a resource management platform, is responsible for allocating resources to the task can be, not only to the yarn mapreduce resource scheduling, resource scheduling can also spark ... yarn is a public resource scheduling platform All framework to meet the conditions of the yarn can be used to perform resource scheduling.

hive

hive is a data warehouse, all data is stored on hdfs, the specific difference [] of data warehouse and database you can go online to search, there are a lot of description. In fact, if the use of the more familiar mysql, use it a lot simpler hive, hive using mainly write hql, hql is a hive of sql language, very similar to sql mysql database, follow-up study to understand some of the main hive when the hive syntactic features can be. In fact hive in the implementation of hql, mapredce in the implementation of the underlying program or execution.

Note: In fact, hive itself is very powerful, data warehouse design is also very important at work, but when early learning, mainly to learn how to use enough. The latter can have a good look hive.

hbase

hbase nosql is a database, a database of key-value type, the underlying data is stored on hdfs. Hbase in learning to master the main row-key design, and the column cluster design. One feature to note is that, hbase based rowkey quickly query efficiency can be achieved-second query, but the query column-based cluster columns, especially when combined query, if the amount of data is large, it will be poor query performance.

repeat

redis also a nosql database and key-value type of database, but the database is based on pure memory, which is redis data in the database are stored in memory, so it's a feature is suitable for fast read and write applications scenario, the reader can reach 10W / sec, but is not suitable for mass data storage, memory after the machine is limited;

Of course, redis also support cluster, you can also store large amounts of data. When learning redis main master string, list, set, sortedset, hashmap difference between these types of data types and use, as well as pipeline pipeline, this time in the bulk storage of data is very useful, as well as transaction transaction capabilities.

-flume

flume is a log collection tool, this is quite common, the most common is the log file data generated in the acquisition. There are generally two processes, a collected data storage flume kafka, for later use or storm sparkstreaming real-time processing. Another process data is collected off the flume to the disc hdfs, for later use hadoop spark or processed offline. Actually, the key is to learn to look at when studying documents flume flume official website, learn a variety of configuration parameters established, since the use of flume is to write a variety of configurations.

-kafka

kafka is a message queue, for a scene in the real-time processing often work as an intermediate buffer layer, e.g., flume-> kafka-> storm / sparkstreaming. Kafka learning to master the main concepts and principles topic, partition, replicate the like. storm

storm is a real-time computing framework, and the difference is hadoop, hadoop massive data is processed off-line, while the storm is added each time data is processed, the process is a one, can ensure the timeliness of the data processing . How to adjust the preparation, storm parallelism of learning to learn topology of the major storm and storm integration kafka real-time consumption data.

-spark

Now spark development is also very good, but also developed into a ecosystem, which contains many technology spark, spark core, spark steaming, spark mlib, spark graphx.

spark ecosystem which contains the off-line processing spark core, and real-time processing spark streaming, here need to look at, storm and spark streaming, real-time processing framework is two, but the main difference is: storm is the real deal one by one , while the spark streaming is batch to batch processing.

spark contains a lot of framework, at the beginning of the study and the main study spark core can spark streaming. This will be used to engage in general big data. spark mlib and spark graphx can wait for later work need or have time to study.

-elasticsearch

elasticsearch is a full-text search engine for huge amounts of data in real-time queries, support for distributed cluster, the underlying fact is based on the lucene. When the query supports fast fuzzy query, seeking count, distinct, sum, avg and other operations, but does not support join operations.

There is also a elasticsearch ecosystem, elk (elasticsearch logstash kibana) is a typical log collection, storage, quickly find a set of solutions to the chart. Elasticsearch in learning, and learning how to use the main pre conducted CRUD es, es in the index, type, concept document, as well es the mapping of design.

Currently listed for the time being so much of it, big data ecosystem, there are still a lot of good technical framework, this would have to wait after you go after work expanded.

In fact listed above to a dozen of these frameworks, when learning to pick one or two special focuses on what is best for the underlying principles, optimization, and other parts of the source code has been covered, so if you can stand out in the interview process . Do not think of each frame are put out proficient, currently is unrealistic, in fact, even

If you can live on the frameworks above it will generally use, and for a deeper study of the two frames, then, in fact, wanted to find a satisfactory big data work becomes a corollary.

Guess you like

Origin blog.51cto.com/14459670/2422059