Big Data Development Learning Paths:
The first stage: Hadoop eco-architecture technology
1, basic language
Java: more understanding and practice on it, does not require a deep understanding of the Java virtual machine memory management, and multi-threading, thread pool, design patterns, parallelization.
Linux: installation, basic commands, network configuration, Vim editor, process manager, Shell scripts, virtual machines, and so familiar with the menu.
Fundamentals basic syntax, data structures, functions, condition judgment, circulation: Python.
2, prepare the environment
Here are the windows to build a fully distributed computer, from the main 2.
VMware virtual machines, Linux system (Centos6.5), Hadoop installation package, good preparation here fully distributed Hadoop cluster environment.
3、MapReduce
Offline MapReduce distributed computing framework, Hadoop is the core programming model.
HDFS provides high throughput data access for applications on large data sets.
Yarn is a resource management platform, is responsible for allocating resources to the task.
Hive is a data warehouse, all data is stored in the on HDFS. Use Hive mainly write Hql.
7、Spark
Spark is designed for large-scale data processing designed for fast general-purpose computing engine.
8、SparkStreaming
Spark Streaming real-time processing framework, the data is processed in batch to batch.
Spark as Hive calculation engine, the Hive queries submitted as Spark tasks to be calculated on Spark cluster, you can improve the performance of Hive queries.
10、Storm
Storm is a real-time computing framework, Storm is a real-time data for each new processed, a process is one, can ensure the timeliness of data processing.
11、Zookeeper
Zookeeper is the basis of many large data frame, the cluster manager.
12、Hbase
Hbase Nosql is a database, is highly reliable, column-oriented, scalable, distributed database.
13、Kafka
kafka is a middleware message, as an intermediate buffer layer.
14、Flume
Flume常见的就是采集应用产生的日志文件中的数据,一般有两个流程。
一个是Flume采集数据存储到Kafka中,方便Storm或者SparkStreaming进行实时处理。
另一个流程是Flume采集的数据存储到HDFS上,为了后期使用hadoop或者spark进行离线处理。
第二阶段:数据挖掘算法
1、中文分词
开源分词库的离线和在线应用
2、自然语言处理
文本相关性算法
3、推荐算法
基于CB、CF,归一法,Mahout应用。
4、分类算法
NB、SVM
5、回归算法
LR、DecisionTree
6、聚类算法
层次聚类、Kmeans
7、神经网络与深度学习
NN、Tensorflow
以上就是学习Hadoop开发的一个详细路线
学习大数据开发需要掌握哪些技术呢?
(1)Java语言基础
Java开发介绍、熟悉Eclipse开发工具、Java语言基础、Java流程控制、Java字符串、Java数组与类和对象、数字处理类与核心技术、I/O与反射、多线程、Swing程序与集合类
(2)HTML、CSS与Java
PC端网站布局、HTML5+CSS3基础、WebApp页面布局、原生Java交互功能开发、Ajax异步交互、jQuery应用
(3)JavaWeb和数据库
数据库、JavaWeb开发核心、JavaWeb开发内幕
Linux&Hadoop生态体系
Linux体系、Hadoop离线计算大纲、分布式数据库Hbase、数据仓库Hive、数据迁移工具Sqoop、Flume分布式日志框架分布式计算框架和Spark&Strom生态体系(1)分布式计算框架
Python programming language, Scala programming language, Spark big data processing, Spark-Streaming Big Data processing, Spark-Mlib machine learning, Spark-GraphX map calculation, real one: Spark recommendation system (a line of company real project) based combat two : Sina (www.sina.com.cn) If you are interested in big data development, want the system to learn big data , you can join the big data exchange technology to learn buttoned Junyang: 522 189 307 , welcome additions, understand courses
(2) storm system technology architecture
Storm principle and foundation, message queues kafka, Redis tools, zookeeper Hi, big data project combat data acquisition, data processing, data analysis, data presentation, data applications
Big Data analysis -AI (Artificial Intelligence) Data
Analyze & prepare the work environment data base analysis, data visualization, Python Machine Learning
Outdoor equipment identification analysis: 2, & neural network image recognition, natural language processing & social network processing, machine learning Python actual project