Big data need to learn?

Big data need to learn? Many people have asked me this question. Every time you answer all feel that they talk too one-sided, and not always a suitable opportunity to go to a good summary of the contents until the start writing this stuff. Big Data in recent years the rise of industry, the rapid development of technology through many iterations over the years has become more mature, and new things are emerging, the only way you want to remain competitive is to keep learning.

mind Mapping

The following is my sort of a mind map, content is divided into several large pieces, including a distributed computing query, distributed scheduling and management, persistent storage, big data commonly used programming languages and much more, each category under a lot of open source tools, such as large data program is to toss a very tough battle apes love to hate things.

Big Data development of learning have a certain degree of difficulty, zero-based entry must first learn the Java language foundation, in general, learning Java SE, EE, takes about three months; then enter the Big Data learning technology system, mainly to learn Hadoop, Spark, Storm, etc., from a zero-based learning to be proficient in big data buttoned group: digital 606 digital 859 digital 705 Share big data learning resources, there is big brother to guide learning, learning routes clear.

1491039-20180921155211499-588474554.png

Big data requires a language

Java

java can be said that most big data base programming language, a large part of the big data development, according to my experience of these years, I contacted were from Jave Web development job transfer over (of course, is not absolute product I've even seen big job transfer data development, the inverse of a day).

  • First, because of the nature of big data computing is nothing more than huge amounts of data, queries easily accessible to the storage, back-end development scenarios large amount of data access

  • The second is the java language ability, and a natural advantage, because many large data components are in java as HDFS, Yarn, Hbase, MR, Zookeeper, etc., you want to study in depth, stepped in to fill the production environment of each kind of pit, and then have to first learn to chew on java source code.

When it comes to eating Source Incidentally, the beginning will certainly be difficult, the need for the component itself and develop language has more in-depth understanding, practice makes perfect slowly, so you passed this stage, used to look at the source code to solve the problem when you find the source code really fragrant.

Scala

scala和java很相似都是在jvm运行的语言,在开发过程中是可以无缝互相调用的。Scala在大数据领域的影响力大部分都是来自社区中的明星Spark和kafka,这两个东西大家应该都知道(后面我会有文章多维度介绍它们),它们的强势发展直接带动了Scala在这个领域的流行。

Python和Shell

shell应该不用过多的介绍非常的常用,属于程序猿必备的通用技能。python更多的是用在数据挖掘领域以及写一些复杂的且shell难以实现的日常脚本。

分布式计算

什么是分布式计算?分布式计算研究的是如何把一个需要非常巨大的计算能力才能解决的问题分成许多小的部分,然后把这些部分分配给许多服务器进行处理,最后把这些计算结果综合起来得到最终的结果。

举个栗子,就像是组长把一个大项目拆分,让组员每个人开发一部分,最后将所有人代码merge,大项目完成。听起来好像很简单,但是真正参与过大项目开发的人一定知道中间涉及的内容可不少。

比如这个大项目如何拆分?任务如何分配?每个人手头已有工作怎么办?每个人能力不一样怎么办?每个人开发进度不一样怎么办?开发过程中组员生病要请长假他手头的工作怎么办?指挥督促大家干活的组长请假了怎么办?最后代码合并过程出现问题怎么办?项目延期怎么办?项目最后黄了怎么办?

仔细想想上面的夺命十连问,其实每一条都是对应了分布式计算可能会出现的问题,具体怎么对应大家思考吧我就不多说了,其实已经是非常明显了。也许有人觉得这些问题其实在多人开发的时候都不重要不需要特别去考虑怎么办,但是在分布式计算系统中不一样,每一个都是非常严重并且非常基础的问题,需要有很好的解决方案。

最后提一下,分布式计算目前流行的工具有:

  • 离线工具Spark,MapReduce等

  • 实时工具Spark Streaming,Storm,Flink等

这几个东西的区别和各自的应用场景我们之后再聊。

分布式存储

传统的网络存储系统采用的是集中的存储服务器存放所有数据,单台存储服务器的io能力是有限的,这成为了系统性能的瓶颈,同时服务器的可靠性和安全性也不能满足需求,尤其是大规模的存储应用。

分布式存储系统,是将数据分散存储在多台独立的设备上。采用的是可扩展的系统结构,利用多台存储服务器分担存储负荷,利用位置服务器定位存储信息,它不但提高了系统的可靠性、可用性和存取效率,还易于扩展。

1491039-20180923163220973-309243746.jpg

上图是hdfs的存储架构图,hdfs作为分布式文件系统,兼备了可靠性和扩展性,数据存储3份在不同机器上(两份存在同一机架,一份存在其他机架)保证数据不丢失。由NameNode统一管理元数据,可以任意扩展集群。

主流的分布式数据库有很多hbase,mongoDB,GreenPlum,redis等等等等,没有孰好孰坏之分,只有合不合适,每个数据库的应用场景都不同,其实直接比较是没有意义的,后续我也会有文章一个个讲解它们的应用场景原理架构等。

分布式调度与管理

现在人们好像都很热衷于谈"去中心化",也许是区块链带起的这个潮流。但是"中心化"在大数据领域还是很重要的,至少目前来说是的。

  • 分布式的集群管理需要有个组件去分配调度资源给各个节点,这个东西叫yarn;

  • 需要有个组件来解决在分布式环境下"锁"的问题,这个东西叫zookeeper;

  • 需要有个组件来记录任务的依赖关系并定时调度任务,这个东西叫azkaban。

当然这些“东西”并不是唯一的,其实都是有很多替代品的,我这里只举了几个比较常用的例子。

说两句

回答完这个问题,准备说点其他的。最近想了很久,准备开始写一系列的文章,记录这些年来的所得所想,感觉内容比较多不知从哪里开始,就画了文章开头的思维导图确定了大的方向,大家都知道大数据的主流技术变化迭代很快,不断会有新的东西加入,所以这张图里内容也会根据情况不断添加。细节的东西我会边写边定,大家也可以给我一些建议,我会根据写的内容实时更新这张图以及下面的目录。

关于分组

The above components of large data packet is actually more tangled, especially as there is a program ape obsessive-compulsive disorder, as if on some components of other groups can, and I do not want too many sub-group will look a mess, so grouping above picture will be a little subjective few. Grouping is certainly not absolute.

For example, the message queue like this is generally not kafka and other databases or file systems such as HDFS together, but they also are equipped with a function of distributed persistent storage, so they are put together on ; there openTsDB this timing database, said database is really just an application based on HBase, I think this thing is more focused on the queries and stored as well as the manner in which, not for the store itself, so subjectively put in this category, as well as OLAP "distributed computing query" tool is also placed in this group.

The same situation there are many, we can also say there are objections under discussion.

purpose

We all know that Big Data technology advances, as a program ape wants to remain competitive we must have to keep learning. The purpose of writing these articles is relatively simple, one can as a note, carding knowledge; and second, I hope to help some people learn to understand big data .


Guess you like

Origin blog.51cto.com/14342636/2416873