The couple must understand big data applications 17 knowledge points summary

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/qq_43958467/article/details/99630320

 

The couple must understand big data applications 17 knowledge points summary

 

First, how large the data in the data warehouse and Mpp database selection?

In the Hadoop platform, in general, we regard the hive as an alternative data warehouse, and the database is a typical representative of Mpp impala, presto. Mpp database architecture is mainly used for ad hoc queries scene, cum scene have high requirements for data query efficiency, while the efficiency of the data warehouse query requirements can not be bigger MPP as it is more suited for off-line analysis and scenarios.

Hadoop is already the standard in real-time big data platform, in which there is ecological Hadoop data warehouse Hive, a data warehouse can be used as a standard platform for big data,

For MPP oriented database applications, you may be selected MYCAT (mySql distributed architecture) or Impala (based Hive and Hbase), including symmetric and asymmetric type two distributed mode

Second, big data analytics in real-time recommendation is how to achieve?

Real-time recommendations require the use of real-time processing framework binding recommendation algorithm to achieve real-time processing of data and recommendations. Real-time processing framework Storm, Flink, SparkStreaming, docking assembly may Kafka, obtain real-time data stream, the data processing implemented within the real-time framework.

1, real-time recommendations need the help of real-time computing frameworks such as Spark or Strom technology,

2, the data acquisition using Flume + Kafka distribution function as a data cache and

3, but also you need to have very suitable for real-time recommendation algorithm, such as real-time recommendations based on the user's portrait, or based on the implementation of user behavior recommendation, or a recommendation for the implementation of commodity acquaintance of different algorithms, etc.

The couple must understand big data applications 17 knowledge points summary

 

Today, as we put together a large part of the tutorial and share data, each person can choose according to their needs, required little friends can share learning materials at + skirt 199 plus 427 and finally 210 numbers to link it wants.

 

Third, what is an efficient treatment method or tool data governance?

No specific data management tools and methods, this is a huge project, may involve every sector, both technical and personnel involved, but also there are business people involved, but also a critical moment leading decision-making. Different for each company data, the processing method are not the same, there is the basic method, by carding cum (metadata, master data) of the data, data quality issues found, then coordinated through the tissue mass or standard manner , normalizing the data processing.

Data governance is a human and hard living, there are no shortcuts and nothing effective tool, but also a big data project, data governance is a very important aspect, because the only data quality to meet the needs of front-end applications, possible mining and analysis the accurate results.

Specific data processing method need to look at the actual traffic situation, such as databases, data type, data size, etc.

Process data governance is a sort of data for business system process, the problem will be found during the feedback to the business sector, but also the development of uniform quality and audit standards, like every business system data to generate a mass line increase supervisor.

The concept of big data and artificial intelligence are vague, in accordance with what the line to learn, learn where to complete the development, want to learn, want to learn the students focus on learning apes, there are plenty of dry cargo (zero-based and advanced classic combat) for everyone to share, and a senior lecturer at Tsinghua graduate of big data to you free lessons to share with high-end real practical learning process system is currently the most complete big data.

The couple must understand big data applications 17 knowledge points summary

 

Fourth, how big data analysis framework for the selection of log analysis?

elk common components, an upper service package need other components required to complete

Log analysis elk + redis + mysql hot data, hotspot analysis

And so on, to see what your business model and developer preferences

Now free, mainstream companies have adopted Elastic ELK framework components are lightweight and easy to use, how much time from acquisition to display interface can almost not take up is completed, excellent Kibana interface effects, including maps, reports, , retrieval, alarm, monitoring and many other features.

Fifth, ask after the big data platform to build, operation and maintenance monitoring platform for big data focus on what?

Operation and maintenance monitoring platform for Big Data includes hardware and software level, as follows:

1, host, network, disk, memory, CPU and other resources.

With more than dozens of the clustered environment, large amounts of data to calculate the hardware, especially the loss of the hard drive is larger, the large number of calculations, the network also often become a bottleneck, these need to always pay attention.

2, the platform level

The state of the various components of the main monitoring platform, load, abnormal timely warning.

3, the user level

Big Data platform to serve internal customers, so both shared resources, but also need to be isolated, so the user needs to do a good job monitoring the use of platform resources, to detect abnormal usage, prevent other users adverse effects, affecting the normal business development.

The couple must understand big data applications 17 knowledge points summary

 

After the big data platform to build, the main contents include operation and maintenance monitoring

1, the operation of the underlying virtual machines distributed architecture (CPU, memory, network, disk, etc.)

2, various components (HDFS, MR, SPark, Hive, Hbase, IMpla, FLume, Spooq etc.) operating status and alarm information

Sixth, under the large amount of data, complex data types, how to make performance guarantees?

How to protect large data processing platform performance, the key is to look at scenarios and business needs, not every business requires high performance.

1, in the scene class OLTP, a large data platform has the same components as HBase, having read and write data to ensure high performance and throughput.

2, in the OLAP scene, big data platform like Impala, Kudu, Kylin, Druid this engine, or pre-computed by the memory of the way to ensure query performance.

3, the scene is analyzed off-line, such as engines Hive, Spark, Mapreduce, massive data distributed processing, in this scenario, the performance and response time do not have guaranteed.

1, the underlying data are all large distributed architecture, distributed architecture has a strong ability to scale, but use inexpensive PC server distributed architecture components can only increase server data, performance can scale,

2, another large internet data in data processing are also distributed processing technique (e.g., MR, Hive, Hbase, HDFS)

3, and there are some data is calculated based on the memory and processing architecture Spark technology, large data platform performance requirements, and no response to conventional interactive not the same, the data is divided into a large real-time and off-line calculations, the real-time computational requirements response time, off-line calculation for the response time requirement is not too high.

The couple must understand big data applications 17 knowledge points summary

 

Seven, data preprocessing problem?

Data of the steel industry is more complex for the production process is not particularly aware of how IT personnel for data processing or data processing should be carried out by whom?

Pre-process data comprises data cleaning, integration, integration, standardization of the process.

1, the process of data preprocessing is handled by the construction of big data project suppliers or specialize in data management company responsible for the work.

2, Big Data projects, pre-processing of data takes a lot of time, but is more manual workload, if data for business too, is bound to be a lot of problems, it is best to understand the relative business people to participate the preprocessing data.

Only high-quality data will have value analysis, the pretreatment process is particularly important. Data is digitized form of business, for more complex industry data, technical staff will not know how to deal with in order to meet the needs of business analysis, it is necessary to propose business analysts specific data processing requirements, and technical personnel to design meets the appropriate requirements.

Eight, the number of migrating from traditional data warehouse platform to the big plan?

Many traditional number of positions with the oracle to do, and now want to turn big data platform, what good migration plan, and migration problems you might encounter, thank you!

1, regardless of the data warehouse is used oracle, or other databases, this type of data transfer process has a large data ETL platform will unify data stored in HDFS distributed file system, by means of the upper Hive build a data warehouse, for offline data to run batch computing, Hbase, for object storage supports high concurrency online queries and unstructured data analysis applications to meet the requirements of the preceding paragraph

2, the data warehouse can use any original data exchange platform sharing, real time data is pushed to the sharing platform, e.g. Sqoop data lead-structured data, using the Flume and Kafka collection data structure and convert them to unstructured data type HDFS floor storage

Nine, the need for a large number of traditional data warehouse platform shift?

As the number of traditional warehouse questions, or what scene for turning big data platform. After turning big data platform to address what kind of question, what kind of problems exposed?

The couple must understand big data applications 17 knowledge points summary

 

Big Data platform uses a distributed architecture, to solve the problem of massive data storage and analysis, the traditional analysis of the number of positions can not solve the problem of hundreds of TB and PB level. Big Data platform due to the new architecture, usage patterns are not the same, some use SQL, use spark some programming, and some use mapreduce programming, so there is some learning costs; Big Data platform is still being perfected, especially users management, security, metadata management, there is still a problem and needs to be used with caution.

Ten, how big the underlying data remain strong data consistency is achieved?

Powerful data underlying data consistency through the HDFS distributed architecture redundant copies of policies and heartbeat detection mechanism to achieve.

1, a redundant copy of the policy: a method of processing node failure is the HDFS data redundancy that make multiple backups of data, the number of backups can be set via the configuration file in HDFS, the default is 3 copies only the data in three copies It was completed successfully on writing, before returning.

2, heartbeat mechanism: detecting a node failure using the "heartbeat mechanism." Datanode Each node periodically sends a heartbeat signal to Namenode. Namenode by deletion heartbeat signal to detect this situation and does not send these heartbeats Datanode recent marked down, then the new IO request is not sent to them.

N: (number of data backup) 3

W: 1 (data written several successful return node), the default is 1

R: 1 (data read time required to read the number of nodes)

W + R < N

Hadoop is no way to guarantee strong consistency of all the data, but by a copy of mechanisms to ensure a degree of consistency, if one datanode down, a copy will be rebuilt on the other datanode, so as to achieve the purpose of replica consistency, and in writing when the write-once mode can be used to ensure that even if multiple copies of a copy of the corresponding machine hang up, does not affect the entire data.

The couple must understand big data applications 17 knowledge points summary

 

XI, big data platform added to the disaster how to do? Mature idea or plan?

1, disaster recovery business continuity is to solve the problem of big data platform itself provides multi-copy protection mechanism is stable and reliable operation of the business

2, there are basically big data platform deployed on virtual machines or containers, there is little on the physical server + storage infrastructure deployed directly

3, so virtualization and container itself brings a strong business continuity features such as live migration of virtual machines, HA, DRS and other functions

Twelve, Big Data What are the underlying platform for the hardware requirements?

1, within the enterprise, it is best to ensure that all the machines in the cluster configuration has been maintained, or prone to a machine run slower, thus slowing down the speed of the overall situation task.

2, big data platform for demanding networks, dozens of machines in the cluster, if gigabit network, the situation is extremely easy one big task to fill the bandwidth appears.

3, platform demands on the CPU, the hard drive is relatively low to the network, but also not be too low, otherwise the IO able to come up, the task will be slowed down.

4, internet high memory requirements, especially in the case of a building within the platform Impala, Spark, MR, Hive, HBase components such as shared resources, should with high memory.

Support upstairs, X86 can be distributed deployment. IO system with particular attention to performance, configurable SSD.

High throughput, high-capacity, high bandwidth.

1, Hadoop is now the de facto standard for big data and Hadoop's emergence is to run on inexpensive commodity servers to power a cluster of, divide and conquer to solve the problem previously traditional database, traditional storage, the traditional computing model helpless, so that large-scale data the process as possible.

2, the requirements for the hardware is not too high, normal PC server can be, but more for a high performance server can increase SSD hard disk or content resources.

XIII, big data talent training?

The key to the successful transformation of big data platform accounted for a large proportion of people, how to effectively promote the smooth development of qualified personnel is?

Big data involves data collection, cleansing data integration, governance, big data platform installation and commissioning and operation and maintenance, development of big data, big data algorithm engineers, mining engineers and other big data.

Big Data is a demand for talent pyramid architecture, the greatest demand is the lowest level of data collection, cleaning staff and governance (basically labor-based), installation and commissioning of the data is in the upper platform (must have linux foundation), to It is open on large data mining algorithms and an engineer.

If a user unit, need to cultivate awareness ahead of big data, to recognize the importance and feasibility of big data, can provide operation and maintenance training for the latter part of the project-based.

The couple must understand big data applications 17 knowledge points summary

 

XIV user portrait which uses big data techniques and tools, should pay attention to what to do when?

The so-called user portrait is to use multi-dimensional data to describe the overall characteristics of a user, the project involves the extraction of features, the tagging process.

For example, user attributes, preferences, habits, behavior, exercise, rest and other information, abstracted tabbed user model. Popular terms is to give the user to play tag, and the tag is characterized by highly refined user to analyze information from the logo.

Related to data collection, data modeling, analysis, mining, needs to pay attention to the following points:

1, we need to know before you create a portrait of factors characteristic dimension of interest to users and user behavior, so as to grasp the needs of user needs as a whole.

2. Create a user portrait is not pulled out of the typical individual labels of the process, but to the edge of the integration environment-related information to be discussed.

3, users sometimes need to change the portrait, the portrait is divided into short-term or long-portraits.

What XV generally a big data project implementation process should pay attention to?

This process and general project no essential difference between the basic requirements, analysis, design, development, testing are to have the. The difference is big data project uses technology unlike traditional SQL-based database development so simple, require a higher programming capabilities, and the ability to troubleshoot problems encountered requirements are also higher, because it is distributed operation, resulting in troubleshooting becomes very complicated.

1, that is, data collection and customer numerous business systems docking process involved in big data project implementation, the data cleansing, integration, standards, data management, modeling data mining analysis and final visualization process.

2, in the course of business systems and docking that need attention must get the data dictionary service system (if not, get recognition and data analysis of the data is very difficult).

3, Dimension Data business analysis, project managers need to approach the needs of customers clearly needs to determine the scope and boundaries of the system (otherwise keep the needs and scope of change, development cycle within the foreseeable future).

4, ready for big data platform requirements of the underlying environment and resources (CPU, memory, hard disk, network, etc.), big data project requirements for these resources is still relatively high, such as hard drive capacity, for example, to analyze data logging class or data is flowing water.

The couple must understand big data applications 17 knowledge points summary

 

Sixteen, enterprise-class data platform How big selection?

Now, especially Hadoop Big Data platform basic platform, the selection mainly refers Haoop management platform. Now the mainstream manufacturers have cloudera and Hortonworks, Huawei's domestic fusion insight and star ring technology products. Relatively speaking, cloudera have greater advantages, market share is also high, very practical management platform, and the platform for managers is rare good helper

Hadoop is now the de facto standard for big data, enterprise-class big data platform is recommended to choose eco-based open source Hadoop, Hadoop open-source business promotion for the current maximum of two scenes and cloudera (CDH version, suitable for running on a linux system) and Hortonworks (HDP version, run or run on windows system), there is a company, you can choose one product can be

XVII What real-time calculation SPark big data and Storm strengths and weaknesses? What are suited to the scene?

SparkStreaming and Strom are all calculated in real time framework, it is a little can be done in real-time processing of data. Spark Core SparkStreaming based implementation, the processing of the data to be formed RDD, cum data window to be formed, so that the process can be called micro-batch, and storm can be done real-time processing for each piece of data, the relative , the real-time higher than sparkstreaming. So the storm is more suitable for handling extremely high real-time requirements scenes.

SPark System Spark Streaming batch computing framework belong in the strict sense, quasi-real time, computing framework based on memory, performance can be achieved in seconds, big data in addition to real-time calculation, also includes off-line batch processing, interactive query and other services features, and real-time calculation, may also involve high latency batch, interactive query and other functions, it should be the first choice ecological Spark, Spark Core development with off-line batch processing, developed by Spark SQL query interactive, real-time development by Spark Streaming computing, three seamlessly integrated, the system provides a very high scalability.

Storm frame is a pure real-time calculation to a data, a data processing can be achieved milliseconds, adapted to the requirements of a reliable mechanism and reliability mechanisms transaction, i.e. completely accurate processing of data, not a plurality, a no less, but also consider using Storm.

Metaphor image point, SPark like ladders Mall, Storm like a mall escalator ratio.

 

Well, commonly used big data technology, what does?

The first phase JavaSE + MySql + Linux

Java foundation → OOP programming → Java collections → IO / NIO → Eclipse → Intellij IDEA → Socket network technology → Mysql database → JDBC Api → JVM memory combat phase of the project construction → → Linux (VMware, CentOS, directory structure, Linux command)

The second phase Hadoop ecosystem

Hadoop → MapReduce → Hive → Avro and Protobuf → Zookeeper → HBase → phoenix → Redis → Flume distributed → SSM (Spring, SpringMVC, Mybatis) → Kafka architecture

The third stage Spark Storm and its ecosystem

Scala → Spark Job → Spark RDD → spark job deployment and resource allocation → Sparkshuffle → Spark SQL → SparkStreaming → Spark ML → azkaban

Other fourth stage

Python and data analysis, machine learning algorithms

The fifth phase of the project practical, integrated use of technology

Big Data Commercial real need to master the technical phase are: the practical operation of large enterprise data processing business scenarios, demand analysis, solution implementation, practical application of integrated technology.

The premise has the Java programming language-based learning can be more than Big Data technologies, big data is the future direction of development, we are challenging analytical skills and awareness of the way the world, so we advance with the times, embrace change, and continue to grow, to master the core technology of big data, it is to grasp the real value lies.

Guess you like

Origin blog.csdn.net/qq_43958467/article/details/99630320