Big Data development of learning, what difficulty is heavy?

Big Data development engineers, many people are yearning for occupation, but there is always this or that reason, forcing it to give up. Knowing good prospects for large data, knowing that after studying large data can be found a good job, but that is not of their own under the cruel. Summed up the difficulties students encounter in the learning process large data development to help people overcome difficulties one by one!

Big data development, there are four stages:

1, the data acquisition

Data were collected on online and offline in two ways, usually by crawling online, by crawling, or through the acquisition of existing application systems, at this stage, we can do a big acquisition platform, relying on automatic reptiles (using python or nodejs making web crawler), ETL tools, or custom extraction transformation engine, from a file, database, web page special crawling data, if this step is done through an automated system, you can easily manage all of the raw data and the data acquired from the tag data is started, we can regulate the working developer. And the target data source can be more easily managed.

Difficulties in that the data acquisition multiple data sources, e.g. mysql, postgresql, sqlserver, mongodb, sqllite. As well as local files, excel statistical documents, and even doc file. How they structured, the program has a finishing into a part of our large data flow is essential.

2, data aggregation

Aggregation of data is a key step in a large data flow, where you can add data standardization, you can do here data cleansing, data consolidation, data can also be archived in this step, will confirm the available data can be monitored through process be sorted, all the data here is the entire output of the company's data assets, and to a certain amount is the sum of the fixed assets.

The difficulty lies in how data aggregation of standardized data, such as standardization of the table name, label classification table, use the table, the amount of data, whether there is incremental data? Whether data is available? Takes a lot of effort in business from top to bottom, if necessary to introduce intelligent processing, such as auto-play based on the results of the training content label, automatically assigned recommend table, table field names. And how to import data from the raw data.

3, data conversion and mapping

How to pass data aggregation of data assets to provide specific consumer use? In this step, the main consideration is how data applications, how will two? Three? Converted into a data table can provide data services. Then periodically update delta.

After the previous steps that, in this step is not too much difficulty, how to convert data and how to clean the data, standard data goes, the value of the two fields into a field, or according to the statistics of a plurality of available tables Zhang chart data and so on.

4, data applications

Many application mode data, there are external, there is internal, and if you have a pre large amounts of data assets available to users via restful API? Or provide streaming engine KAFKA to the application of consumption? Or directly composition thematic data for their own application queries? here the requirements of data assets is relatively high, so the preliminary work done, a high degree of freedom here.

Difficulties in big data development is mainly to monitor, how planning and development staff? Developers casually collected a bunch of garbage, and directly connected to the database. In the short term, these small problems can be corrected. However, increasing the amount of assets of the time, this is a time bomb ready to detonate, and then trigger a series of effects on asset data, such as data flurry is the value of data assets decreased customer confidence is low.

Big Data learning route

the Java (the Java SE, JavaWeb)
Linux (shell, high concurrency architecture, Lucene, Solr)
Hadoop (Hadoop, HDFS, Mapreduce, the Yarn, Hive, HBase, Sqoop, ZooKeeper, Flume)
machine learning (R, Mahout)
Storm (Storm, Kafka, Redis)
the Spark (Scala, Spark, Spark Core, SQL Spark, Spark Streaming, mllib Spark, Spark Graphx)
the Python (Python, Python Spark) 
cloud computing platform (docker, kvm, openstack)

Glossary

A, Linux
lucene: full-text search engine architecture
solr: lucene-based full-text search server, to achieve a configurable, scalable and optimized query performance, and provides a full-featured management interface.

Two, Hadoop
HDFS: a distributed storage system, including NameNode, DataNode. NameNode: metadata, DataNode. DataNode: the number of data stored.
yarn: MapReduce can be understood as a coordination mechanism, essentially Hadoop's processing and analysis mechanisms into ResourceManager NodeManager.
MapReduce: software framework, programming.
Hive: data warehouse can use SQL query, you can run the Map / Reduce programs. Used to calculate trends or Web logs, not be used for real-time query takes a long time to return results. Want to learn the system big data, you can join the big data exchange technology to learn buttoned Junyang: 522 189 307, welcome to add, to understand course introduces
HBase: database. Very suitable for bigger real-time query data. Facebook Hbase stored message data with the message and real-time analysis
ZooKeeper: reliability coordination system for large distributed. Hadoop Zookeeper distributed synchronization realized by, for example, a plurality of NameNode, active standby switch.
Sqoop: mutual transfer database, relational database and HDFS mutual transfer
Mahout: Scalable machine learning and data mining library. Used to make the recommended mining, aggregation, classification, frequent itemset mining.
Chukwa: revenue collection system to monitor large-scale distributed systems, built on top of HDFS and Map / Reduce framework. Display, monitor, analyze results.
Ambari: used to configure, manage and monitor Hadoop cluster, based on Web, friendly interface.

Two, Cloudera
Cloudera Manager: integrated management and monitoring diagnostic
Cloudera CDH: (Cloudera's Distribution, including Apache Hadoop) Cloudera Hadoop to make the appropriate changes, release called CDH.
Cloudera Flume: log collection system to support all kinds of custom data sender logs system for collecting data.
Cloudera Impala: to be stored in Apache Hadoop HDFS's, HBase data provide direct access to interactive SQL.
Cloudera hue: web manager, including hue ui, hui server, hui db . CDH hue provides all the components of the shell interface interface, you can write mr in hue.

Third, the machine learning / R
R: for statistical analysis, mapping language and operating environment, there are R-Hadoop
Mahout: provides scalable implementations of machine learning classical algorithms, including clustering, classification, recommendation filtration, frequent child term mining, and can be extended to the cloud via Hadoop.

Four, Storm
Storm: a distributed, fault-tolerant real-time streaming computing system that can be used as real-time analysis, online machine learning, information flow processing, the continuity of computing, distributed RPC, real-time processing the message and update the database.
Kafka: high throughput distributed publish-subscribe messaging system that can handle all the action streaming data (browse, search, etc.) size of the consumer website. Log data and off-line analysis of the relative Hadoop, can achieve real-time processing. At present a unified message processing through the parallel line and off-loading mechanism of Hadoop
Redis: prepared by c language, support network, also based on the persistent log memory type, key-value database.

Five, the Spark
Scala: a completely object-oriented programming language similar to the java.

 

jblas: a fast linear algebra library (JAVA). Calculated based on the BLAS and LAPACK, matrix facto industry standard, and that implement ATLAS art advanced infrastructure and all of the calculation procedures, making it very fast.

Spark: Spark Hadoop MapReduce is similar to the general framework implemented in parallel Scala language, in addition to Hadoop MapReduce has advantages, but is different from the MapReduce intermediate output job can be stored in memory, thereby eliminating the need to read and write the HDFS, Therefore Spark better MapReduce algorithm applied to data mining and machine learning needs iteration. Hadoop file system and can operate in parallel, used a third-party clustering framework Mesos can support this behavior.
Spark SQL: as part of the Apache Spark large data frame may be used to structure the data and may perform processing SQL-like query data Spark
Spark Streaming: calculating a real-time frame on the Spark constructed, expanded Spark large data stream data Ability.
Spark MLlib: MLlib Spark is a library that implements the common machine learning algorithms, the current (2014.05) support for binary classification, regression, clustering and collaborative filtering. It also includes a bottom base of the gradient descent optimization algorithm. Jblas linear algebra library since MLlib, Fortran program since jblas remote itself.

Spark GraphX: GraphX ​​is Spark for parallel computing and FIGS API, you can provide a one-stop solution on Spark data, it can be conveniently and efficiently complete set of assembly-line calculation of FIG.

Fortran: the earliest high-level computer programming language, widely used in scientific and engineering computing.

BLAS: basic linear algebra subroutine libraries, the program has already written a lot about linear algebra operations.
LAPACK: The famous open software, including solving the most common numerical linear algebra problems in science and engineering computing, such as solving linear equations, linear least squares problem, eigenvalue problems and singular value problems.
ATLAS: BLAS linear algorithm optimized version of the library.
Spark Python: Spark by the scala language, but in order to promote compatible and provide java and python interface.

Six, Python
Python: an object-oriented, interpreted computer programming language.


Seven cloud computing platform
Docker: open-source application container engine
kvm: (Keyboard Video Mouse)

 

Published 160 original articles · won praise 2 · views 10000 +

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/104654034