Big Data is now in school, want to engage in large data development, not for a work experience, how has the kind of big data skills? To what extent?

Below are engaged in work related to the development of large data need to have the skills, I hope for your help.

Java

We all know that Java's direction JavaSE, JavaEE, JavaME, Big Data learning to learn that direction? Just need to learn Java Standard Edition JavaSE on it, like Servlet, JSP, Tomcat, Struts, Spring, Hibernate, Mybatis are JavaEE technology in the direction of big data technology in use is not much, just you need to understand it of course Java database or to know how to connect, like JDBC must master it.

Some students say Hibernate or Mybites can connect to the database ah, why not learn, I learn here is not to say these bad, but that these may learn a lot of time with you, in the end the work is not common, I have not seen who used to big data processing these two things, of course, your energy is sufficient, you can learn Hibernate or Mybites principle, not only to learn API, so you can increase your understanding of Java database operations, because these two the core technology is a reflection of the Java JDBC plus a variety of use.


Want to draw nutrients in large data field so that they grow and grow. Share direction, sharing first action before the next big data exchange and sharing resources group 870,097,548, welcome to want to learn, want to switch, and you join the Advanced Learning.


Linux

Because big data software is running on Linux, so Linux to learn some of the solid, to learn Linux you quickly master big data technologies will be very helpful, allowing you to better understand hadoop, hive, hbase , spark and other big data software operating environment and network environment configuration, can step on a lot less pit, learn shell script so you can read it easier to understand and configure large data clusters. You can also make a new future for big data technology learning curve faster.

Having a good foundation, which also say that big data technology need to learn, you can learn the order I wrote down.


Hadoop

It is now popular big data processing platform has become almost synonymous with big data, so this is a must to learn. Hadoop which includes several components HDFS, MapReduce and YARN, HDFS is where we store data like a computer hard disk as files are stored in the above, MapReduce data processing is calculated, it has a feature that no matter how much data just give it time it will be able to finish the data, but the time may not be fast so it is called batch data.

YARN important component is the embodiment of the concept of Hadoop platform with which other software Big Data ecosystem will be able to run on hadoop, so that we can better take advantage of the large storage HDFS and save more resources, such as we do not have a single spark to build a cluster, and run it directly on top of the existing hadoop yarn on it.


In fact, the Hadoop of these components you can learn to understand the big deal with data, but you still might in the end be of "big data" has not been a very clear idea, I do not listen to this tangle. After working so you'll have a lot of scenes encountered dozens of T / T hundreds of large-scale data, to the time you will not feel nice big data, the greater the more your headache. Of course, so do not be afraid of large-scale data processing, because this is where your value, so that those who were engaged in Javaee of html5 php and envy of the DBA's go.


Remember where you can learn as you learn a node big data.


Zookeeper


This is a one size fits all, the installation of HA Hadoop will use it, it will be used later Hbase. It is generally used to store some information about mutual cooperation, the information is relatively small generally not more than 1M, it is using its software that depend on it, for us personally, just to have it installed correctly, the normal run it up on it.

v2-be259042ed15bada39ac69215b4b0fc2_hd.jpgv2-be259042ed15bada39ac69215b4b0fc2_hd.jpg

Mysql

We learn of the big data has been processed, then learn mysql database processing tools small data because a hive will be installed when to use, mysql need to know to what degree that layer? You can put it on Linux installed, up and running, will be simple to configure permissions, change the root password, create a database. Here the main thing is to learn SQL syntax, because syntax hive and this is very similar.


Sqoop


这个是用于把Mysql里的数据导入到Hadoop里的。当然你也可以不用这个,直接把Mysql数据表导出成文件再放到HDFS上也是一样的,当然生产环境中使用要注意Mysql的压力。


Hive


这个东西对于会SQL语法的来说就是神器,它能让你处理大数据变的很简单,不会再费劲的编写MapReduce程序。有的人说Pig那?它和Pig差不多掌握一个就可以了。


Oozie


既然学会Hive了,我相信你一定需要这个东西,它可以帮你管理你的Hive或者MapReduce、Spark脚本,还能检查你的程序是否执行正确,出错了给你发报警并能帮你重试程序,最重要的是还能帮你配置任务的依赖关系。我相信你一定会喜欢上它的,不然你看着那一大堆脚本,和密密麻麻的crond是不是有种想屎的感觉。


Hbase


这是Hadoop生态体系中的NOSQL数据库,他的数据是按照key和value的形式存储的并且key是唯一的,所以它能用来做数据的排重,它与MYSQL相比能存储的数据量大很多。所以他常被用于大数据处理完成之后的存储目的地。


Kafka


这是个比较好用的队列工具,队列是干吗的?排队买票你知道不?数据多了同样也需要排队处理,这样与你协作的其它同学不会叫起来,你干吗给我这么多的数据(比如好几百G的文件)我怎么处理得过来,你别怪他因为他不是搞大数据的,你可以跟他讲我把数据放在队列里你使用的时候一个个拿,这样他就不在抱怨了马上灰流流的去优化他的程序去了。


因为处理不过来就是他的事情。而不是你给的问题。当然我们也可以利用这个工具来做线上实时数据的入库或入HDFS,这时你可以与一个叫Flume的工具配合使用,它是专门用来提供对数据进行简单处理,并写到各种数据接受方(比如Kafka)的。

v2-e0be0b66c069c3e1b3de082c72b78f1b_hd.jpgv2-e0be0b66c069c3e1b3de082c72b78f1b_hd.jpg

Spark


它是用来弥补基于MapReduce处理数据速度上的缺点,它的特点是把数据装载到内存中计算而不是去读慢的要死进化还特别慢的硬盘。特别适合做迭代运算,所以算法流们特别稀饭它。它是用scala编写的。Java语言或者Scala都可以操作它,因为它们都是用JVM的。


Guess you like

Origin blog.51cto.com/14516202/2432585