Cluster deployment and monitoring: Ambari, Cloudera Manager

The first step in 2020 large data processing is to collect data.

The first step is to collect large data processing data. Now commonly used in large-scale projects micro-service architecture for distributed deployment, so the data need to be collected on multiple servers, and the collection process does not affect the normal conduct of business. Based on this requirement, it is derived from a variety of log collection tools, such as Flume, Logstash the like, they can be easily configured complex data collection and completion of the polymerization.

data storage

After collecting the data, the next question is: How can the data be stored? We usually known is the data stored in MySQL, Oracle and other traditional relational database, the characteristics of these traditional database is the ability to quickly store structured data, and support for random access. However, a large data structure of data is typically semi-structured (e.g., log data), even unstructured (e.g., video, audio, data), in order to solve the mass storage semi-structured and unstructured data, derived from the Hadoop HDFS, KFS , GFS other distributed file systems, which are able to support structured, semi-structured and unstructured data store, and can be extended by increasing the machine laterally.

Distributed File System perfect solution to the problem of mass data storage, but a good data storage systems need to consider data storage and access to two issues, for example, you want to be able to perform random access to data, which is the traditional relational database are good, but not distributed file systems are good, then there a storage solution capable of simultaneously combines the advantages of distributed file systems and relational databases, based on this demand, it creates HBase, MongoDB.

data analysis

The first step in 2020 large data processing is to collect data.

The most important part is the large data processing data analysis, data analysis is usually divided into two types: batch processing and stream processing.

Batch: massive offline data over time for uniform treatment, there is a corresponding process frame Hadoop MapReduce, Spark, Flink the like;

Stream processing: Data are processed in motion, i.e. it is subjected to treatment while receiving the data, the corresponding process frame has Storm, Spark Streaming, Flink Streaming and the like.

Batch processing each stream and its application scenario, the time sensitive or not limited hardware resources, batch processing may be employed; timeliness requirements of time-sensitive and another job stream processing may be employed. With server hardware prices getting lower and lower and everyone timeliness requirements have become increasingly demanding, more and more common stream processing, such as stock price forecasting electricity suppliers and operational data analysis.

Data applications

After the data analysis is complete, the next step is the application of data categories, depending on your actual business needs. For example, you can visualize the data show, or the data used to optimize your recommendation algorithms, which use very common today, such as short video personalized recommendations, product recommendations electricity supplier, headlines and recommend. Of course, you can also use the data for training your machine learning models, these are all areas other areas, have a corresponding framework and technology stack for processing, where stop here.

Learning Path

Big Data learning threshold is relatively high, first of all have a certain language foundation

1.java

Most of the big data framework using Java language development, and nearly all of the framework will provide a Java API. Java is more mainstream background development language, free online learning resources will be more.

2.scala

Scala is an integrated object-oriented and functional programming concepts statically typed programming language that runs on the Java virtual machine, you can work seamlessly with all of the Java class libraries, the famous Kafka is to use Scala language development.

Why do I need to learn Scala language? This is because the current hottest computing framework Flink and Spark provides an interface Scala language, use it for development, less than the code required to use Java 8, Spark and Scala is the use of written language, learning Scala can help you a deeper understanding Spark.

Linux Basics

Big Data frameworks are usually deployed on Linux servers, so it is necessary to have some knowledge of Linux.

Building tools

It should master the automated build tools are mainly Maven. Maven in big data scene is relatively common, mainly in the following three aspects:

1. JAR project management package, to help you quickly build big data applications;

2. Whether your project is to use the Java language or Scala language development, run-time to submit a clustered environment, you need to use Maven to compile package;

3. Most large data source management framework uses Maven carried out, when you need to compile the source code from the installation package, you need to use Maven.

Learning Framework

Our simple classification summary of the framework:

Log collection framework: Flume, Logstash, Kibana

Distributed File Storage System: Hadoop HDFS

Database Systems: Mongodb, HBase

Distributed computing framework:

· Batch framework: Hadoop MapReduce

· Stream processing framework: Storm

· Mixing process framework: Spark, Flink

Analysis framework: Hive, Spark SQL, Flink SQL, Pig, Phoenix

Cluster resource manager: Hadoop YARN

Distributed Coordination Services: Zookeeper

Data Migration Tool: Sqoop

Task scheduling framework: Azkaban, Oozie

Cluster deployment and monitoring: Ambari, Cloudera Manager

Listed above are more mainstream big data framework, the community is very active, learning resources are more abundant. Getting started is recommended to learn from Hadoop, because it is the cornerstone of the entire Big Data ecosystem, other frameworks are directly or indirectly dependent on Hadoop. Then you can learn computing framework, Spark and Flink were more mainstream hybrid processing framework, Spark appear relatively early, so its application is more extensive. Flink today is the most hot mixing process of a new generation of the frame, with a number of excellent properties which have been favored by many companies. Both may need to learn according to your personal preferences or actual work.

Domestic air fare and some place to play up oil prices are pushing hands; Tianjin court reward heavily rely on the old property leads up to 1.81 million award; Shanxi Public Security Bureau, a former deputy director was sentenced to life: Tomb of harboring criminal organization; mudslides caused by the Sichuan-Tibet Highway 318 State Road Tibet Renbu segment traffic disruption (Figure); People's Liberation Army military aircraft into the station, "anti knowledge area" domineering response to Taiwan's "warning"; Beijing tuberculosis will be included in future newborn physical examination will check the project; impact of Qiongzhou Strait in response to tropical depression ro-ro passenger train outage suspended; Development and Reform Commission report: China traffic charges ranked 53 in the world at low levels;

Hung Chen Fujin through personnel indicted on suspicion of embezzlement had fled the United States 21 years; first-tier cities to buy real estate agency on the eve of the new regulations landing staged crazy scene; this year, Beijing Haidian guarantee to build up to six districts of the city housing 50,000 units; People's Daily Overseas Edition: Chinese Grand Opening global good; Beijing and Inner Mongolia District 16 counties in pairs these well-off; People's Daily Overseas Edition: the United States insists the world sigh; the world's top 500 Chinese supercomputer has 206 Taiwan-US media: the construction of speed over the United States; Vice Minister of Commerce: China I do not want to fight a trade war with the United States;

Guess you like

Origin www.cnblogs.com/1654kjl/p/12569064.html