Novice White Big Data learning route planning

The figure is a simplified process flow diagram of large data, the main data processing flow includes a large main link data collection, data storage, data processing, data applications.

Language Basics

1. Java

Most of the big data framework using Java language development, and nearly all of the framework will provide a Java API. Java is more mainstream background development language, free online learning resources will be more. If you are accustomed to learning through books, books here recommend you get started:

"Java programming logic": a country where people write books Java entry system, easy to understand, comprehensive;

"Java core technology": the latest is the 10th edition, Volume I and Volume II has two, Volume II can be selectively read, because the content of many chapters rarely used in the actual development.

Most Java framework requires at least version 1.8, which is due to Java 1.8 provides a functional programming, making it possible to use a more streamlined code to achieve the same function before, for example, you call the Spark API, using 1.8 times more than the few possible 1.7 Code, so here additional recommended reading "Java 8 combat" this book.

2. Scala

Scala is an integrated object-oriented and functional programming concepts statically typed programming language that runs on the Java virtual machine, you can work seamlessly with all of the Java class libraries, the famous Kafka is to use Scala language development.

Why do I need to learn Scala language? This is because the current hottest computing framework Flink and Spark provides an interface Scala language, use it for development, less than the code required to use Java 8, Spark and Scala is the use of written language, learning Scala can help you a deeper understanding Spark. Similarly, for book learning habits of a small partner, here recommend two introductory books:

"Fast learning Scala"

"Scala Programming"

Explain here, if your time is limited, do not have to finish school before going to learn Scala big data framework. Scala does enough to streamline and flexible, but slightly larger than the Java language in complexity, such as the concept of implicit conversion and implicit in the initial parameters involved would be more difficult to understand, so you can understand and then to learn Scala in Spark, because implicit conversion is similar to the concept used in a large number of Spark source code.

Linux Basics

Big Data frameworks are usually deployed on Linux servers, so it is necessary to have some knowledge of Linux. Linux books are among the more famous "Bird Brother private kitchens" series, this series is very comprehensive and very classic. But if you want to get started quickly, recommended here "Linux in respect of such a study," free e-book version on its website.

Building tools

It should master the automated build tools are mainly Maven. Maven in big data scene is relatively common, mainly in the following three aspects:

JAR project management package, to help you quickly build big data applications;

Whether your project is to use the Java language or Scala language development, run-time to submit a clustered environment, we need to use Maven to compile package;

Most large data source management framework were using Maven, when you need to compile the source code from the installation package, you need to use Maven.

Learning Framework

1. Framework Classification

Above us a lot of big data framework, a classification summary here:

Log collection framework: Flume, Logstash, Kibana

Distributed File Storage System: Hadoop HDFS

Database Systems: Mongodb, HBase

Distributed computing framework:

Batch framework: Hadoop MapReduce

Streaming frame: Storm

Mixing treatment framework: Spark, Flink

Analysis framework: Hive, Spark SQL, Flink SQL, Pig, Phoenix

Cluster resource manager: Hadoop YARN

Distributed Coordination Services: Zookeeper

Data Migration Tool: Sqoop

Task scheduling framework: Azkaban, Oozie

Cluster deployment and monitoring: Ambari, Cloudera Manager

Listed above are more mainstream big data framework, the community is very active, learning resources are more abundant. Getting started is recommended to learn from Hadoop, because it is the cornerstone of the entire Big Data ecosystem, other frameworks are directly or indirectly dependent on Hadoop. Then you can learn computing framework, Spark and Flink were more mainstream hybrid processing framework, Spark appear relatively early, so its application is more extensive. Flink today is the most hot mixing process of a new generation of the frame, with a number of excellent properties which have been favored by many companies. Both may need to learn according to your personal preferences or actual work.

As for the other frame, on learning there is no specific order, if your study time is limited, recommended for first-time study, the same type of frame master a can, such as log collection framework, there are many, the initial learning only when needed master a job done you can log collection, after work and then there is a need for targeted learning.

2. learning materials

Big Data's most authoritative and most comprehensive study materials is the official document. Popular Big Data framework are more active in the community, an updated version of the iteration is faster, so the publications are significantly lags behind its actual version, using the book learning is not the best option for this reason. Compare Fortunately, official documents are written big data framework is better, improve the content, focused, and have adopted a large number of diagrams with auxiliary explain. Of course, there are some excellent books after the test of time and is still very classic, here are some personal read classic books:

"Hadoop Definitive Guide" 2017

"Kafka Definitive Guide" 2017

"Zookeeper distributed consensus principle and practice from Paxos to" 2015

"Inside depth analysis Spark Spark kernel architecture design and implementation of the principle of" 2015

《Spark.The.Definitive.Guide》 2018年

"HBase Definitive Guide" 2012

"Hive Programming Guide" 2013

3. Video learning materials

上面我推荐的都是书籍学习资料,很少推荐视频学习资料,这里说明一下原因:因为书籍历经时间的考验,能够再版的或者豆瓣等平台评价高的证明都是被大众所认可的,从概率的角度上来说,其必然更加优秀,不容易浪费大家的学习时间和精力,所以我个人更倾向于官方文档或者书本的学习方式,而不是视频。因为视频学习资料,缺少一个公共的评价平台和完善的评价机制,所以其质量良莠不齐。

开发工具

这里推荐一些大数据常用的开发工具:

Java IDE:IDEA 和 Eclipse 都可以。从个人使用习惯而言,更倾向于 IDEA ;

VirtualBox:在学习过程中,你可能经常要在虚拟机上搭建服务和集群。VirtualBox 是一款开源、免费的虚拟机管理软件,虽然是轻量级软件,但功能很丰富,基本能够满足日常的使用需求;

MobaXterm:大数据的框架通常都部署在服务器上,这里推荐使用 MobaXterm 进行连接。同样是免费开源的,支持多种连接协议,支持拖拽上传文件,支持使用插件扩展;

Translate Man:一款浏览器上免费的翻译插件。它采用谷歌的翻译接口,准确性非常高,支持划词翻译,可以辅助进行官方文档的阅读。

大数据(BIG DATA)是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。大数据的5V特点:VOLUME(大量)、VELOCITY(高速)、VARIETY(多样)、VALUE(低价值密度)、VERACITY(真实性)。为什么要学习大数据?目前,全球数据呈现爆发增长、海量集聚的特点...

 

发布了83 篇原创文章 · 获赞 3 · 访问量 4257

Guess you like

Origin blog.csdn.net/juan333/article/details/104280797