Hadoop detailed documentation (1) Big data overview (with detailed explanation video)

Free video tutorial  https://www.51doit.com/  or contact the blogger on WeChat 17710299606

1 Big data background

The current society is a fast-developing society, with advanced technology and information circulation, communication between people is getting closer, and life is becoming more and more convenient. Big data is the product of this high-tech era.

With the advent of the cloud era, big data has also attracted more and more attention. Big data is usually used to describe a large amount of unstructured and semi-structured data created by a company. These data will spend too much time and money when downloaded to a relational database for analysis. Big data analysis is often associated with cloud computing, because real-time analysis of large data sets requires a framework like MapReduce to distribute work to tens, hundreds, or even thousands of computers. [2] 

In today’s society, the application of big data has increasingly demonstrated its advantages, and it occupies more and more areas. Various fields such as e-commerce, O2O, logistics and distribution, etc., using big data for development are helping companies continue to develop Develop new businesses and innovate operating models. With the concept of big data, the judgment of consumer behavior, product sales forecast, precise marketing scope and inventory replenishment have been comprehensively improved and optimized.

"Big data" in the Internet industry refers to a phenomenon: Internet companies generate and accumulate user network behavior data in their daily operations. The scale of these data is so large that it cannot be measured by G or T.

How big is big data? A set of data called "One Day on the Internet" tells us that in one day, all the content generated on the Internet can be carved into 168 million DVDs; 294 billion mails are sent (equivalent to two years of paper letters in the United States Number); 2 million community posts were issued (equivalent to the amount of text in Time magazine in 770); 378,000 mobile phones were sold, which is higher than the number of 371,000 babies born every day in the world... [1] 

As of 2012, the data volume has jumped from TB (1024GB=1TB) level to PB (1024TB=1PB), EB (1024PB=1EB) and even ZB (1024EB=1ZB) level. The research results of the International Data Corporation (IDC) show that the amount of data generated globally in 2008 was 0.49ZB, the amount of data in 2009 was 0.8ZB, the increase in 2010 was 1.2ZB, and the amount in 2011 was as high as 1.82ZB, equivalent to Every person in the world generates more than 200GB of data. As of 2012, the data volume of all printed materials produced by mankind is 200PB, and the data volume of all the words that have been said in the history of mankind is about 5EB. According to IBM's research, 90% of all data obtained by the entire human civilization were generated within the past two years. By 2020, the scale of data generated in the world will reach 44 times that of today. [3]   Every day, the world uploads more than 500 million pictures, and every minute 20 hours of videos are shared. However, even all the information that people create every day—including voice calls, emails, and messages, as well as all the pictures, videos, and music uploaded, can’t match the amount of information created every day. The amount of digital information about people themselves.

This trend will continue. We are still in the initial stage of the so-called "Internet of Things", and as the technology matures, our equipment, vehicles and the rapidly developing "wearable" technology will be able to connect and communicate with each other. Advances in technology have reduced the cost of creating, capturing, and managing information to one-sixth of what it was in 2005. Since 2005, business investment in hardware, software, talent, and services has also increased by a full 50%. Reached 400 billion US dollars.

2 Features

Large amount of data (Volume)

The first feature is the large amount of data. The starting unit of measurement for big data is at least P (1000 T), E (1 million T), or Z (1 billion T).

Many types (Variety)

The second feature is the variety of data types. Including web logs, audio, video, pictures, geographic location information, etc., multiple types of data place higher requirements on data processing capabilities.

Low value density (Value)

The third characteristic is that the data value density is relatively low. For example, with the widespread application of the Internet of Things, information perception is everywhere and information is massive, but the value density is low. How to "purify" the value of data more quickly through powerful machine algorithms is an urgent problem to be solved in the era of big data.

High speed and high efficiency (Velocity)

The fourth feature is fast processing speed and high timeliness requirements. This is the most significant feature that distinguishes big data from traditional data mining.

The existing technical architecture and route can no longer efficiently process such a large amount of data, and for the relevant organizations, if the huge investment in collecting information cannot be processed in time to feed back effective information, it will be more than a loss. It can be said that the era of big data poses new challenges to human data control ability, and also provides unprecedented space and potential for people to gain deeper and comprehensive insights. 

3 scenes

Big data has been applied to all walks of life, 

Big data is everywhere. Big data is used in various industries, including finance, automobiles, catering, telecommunications, energy, physical fitness and entertainment, and all walks of life in society have been integrated into the footprint of big data.

  • In the manufacturing industry, using industrial big data to improve the level of manufacturing, including product fault diagnosis and prediction, analysis of technological processes, improvement of production processes, optimization of energy consumption in the production process, industrial supply chain analysis and optimization, production planning and scheduling.
  • In the financial industry, big data plays a major role in the three major financial innovation fields of high-frequency trading, social sentiment analysis and credit risk analysis.
  • In the automotive industry, driverless cars using big data and Internet of Things technology will enter our daily lives in the near future.
  • The Internet industry, with the help of big data technology, can analyze customer behavior, carry out product recommendations and targeted advertising.
  • In the telecommunications industry, big data technology is used to realize customer off-grid analysis, timely grasp customer off-grid tendency, and introduce customer retention measures.
  • In the energy industry, with the development of smart grids, power companies can master a large amount of user power consumption information, use big data technology to analyze user power consumption patterns, improve grid operation, rationally design power demand response systems, and ensure grid operation safety.
  • In the logistics industry, use big data to optimize logistics networks, improve logistics efficiency, and reduce logistics costs.
  • Urban management can use big data to realize intelligent transportation, environmental monitoring, urban planning, and intelligent security.
  • Biomedicine and big data can help us realize epidemic prediction, smart medical care, and health management. At the same time, it can also help us interpret DNA and understand more life secrets.
  • Sports entertainment, big data can help us train the team, decide which film and television works to shoot, and predict the outcome of the game.
  • In the security field, the government can use big data technology to build a strong national security protection system, companies can use big data to defend against cyber attacks, and the police can use big data to prevent crime.
  • Personal life, big data can also be applied to personal life, using "personal big data" associated with each person to analyze personal life and behavior habits and provide them with more thoughtful and personalized services.

The value of big data goes far beyond that. The penetration of big data into all walks of life has greatly promoted social production and life, and it will have a major and far-reaching impact in the future.

4 big data positions

  1. Data analyst
    refers to those who are familiar with related businesses, are proficient in building data analysis frameworks, master and use related analysis tools and basic analysis methods, conduct data collection, sorting, and analysis, and provide guidance for management sales operations based on data analysis conclusions Analyze opinions.
  2. The data architect Data architect
    guides the entire life cycle of the Hadoop solution, including requirements analysis, platform selection, technical architecture design, application design and development, testing and deployment. Have an in-depth grasp of how to write MapReduce jobs and manage the job flow to complete the calculation of data, and be able to use the general algorithms provided by Hadoop, and be proficient in the components of the entire Hadoop ecosystem, such as Yarn, HBase, Hive, Pig and other important components. Development of platform monitoring and auxiliary operation and maintenance systems.
  3. Big Data Engineer Big DataEngineer
    collects and processes large-scale raw data (including scripting, web page acquisition, calling APIs, writing SQL queries, etc.); processing unstructured data into a form suitable for analysis, and then performing analysis; Need and project analysis of business decisions.
  4. The data warehouse manager Data warehousemanager
    specifies and implements information management strategies; coordinates and manages information management solutions; multiple project scopes, plans and prioritization arrangements; manages various aspects of the warehouse, such as data outsourcing, movement, quality, design and Implement.
  5. Database manager
    improves the effectiveness of database tools and services; ensures that all data complies with legal requirements; ensures that information is protected and backed up; makes regular reports; monitors database performance; improves the technology used; establishes a new database; checks data entry Procedure; troubleshooting.
  6. Business intelligence analysts
    disseminate information on tools, reports or metadata enhancements; conduct or coordinate tests to ensure that the definition of intelligence is consistent with requirements; use business intelligence tools to identify or monitor existing and potential customers; comprehensive The current business can only use trend data to support action recommendations; maintain or update business intelligence tools, databases, dashboards, systems or methods; and timely manage user traffic business intelligence.

5 core concepts

5.1 Concept

Big data refers to a collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. It is a massive, high growth rate and a high growth rate that requires a new processing model to have more powerful decision-making power, insight discovery and process optimization capabilities. Diversified information assets.

Big data is usually used to describe a large amount of unstructured data and semi-structured data created by a company. These data will cost a lot of time and money when downloaded to a relational database for analysis.

  • The core technology for processing massive data:
    1. Storage of massive data:
      1. Distributed file system storage
        1. HDFS
    2. Operational processing of massive data:
      1. Distributed computing framework
        1. MapReduce , spark , flink等
  • What is distributed
    1. It is to store a file on many machines. In fact, there is a system to help us store files. This system seems to be composed of directories (that is, it is composed of a unified path, but the path is the real path on the machine) Is irrelevant), when the file is placed in a certain path of the file system, he will divide the file into different file blocks and store them on different machines (the user does not know the storage information inside) , This is distributed storage
  • Storage framework
    1. Distributed File Storage System HDFS
    2. Distributed database system HBASE ElasticSearch mongDB
  • Calculation framework
    1. The core problem to be solved is to distribute the user's computing logic on multiple machines for parallel computing
    2. MapReduce computing framework-computing framework in Hadoop
    3. Spark computing framework-do offline batch processing, real-time streaming
    4. Strom computing framework-do real-time streaming
  • Auxiliary tools
    1. Hive - data warehouse tool: can accept SQL, parse SQL statements into MapReduce or Spark program processing
    2. Flume - data collection
    3. Sqoop - data migration
    4. ElasticSearch - distributed data search engine

5,2 Core technology

1) Sqoop : Sqoop is an open source tool mainly used to transfer data between Hadoop, Hive and traditional databases (MySql). It can import data from a relational database (for example: MySQL, Oracle, etc.) Into the HDFS of Hadoop, you can also import HDFS data into a relational database.

2) Flume: Flume is a highly available, highly reliable, distributed system for massive log collection, aggregation and transmission provided by Cloudera. Flume supports customizing various data senders in the log system to collect data; at the same time , Flume provides the ability to simply process data and write to various data recipients (customizable).

3) Kafka : Kafka is a high-throughput distributed publish-subscribe messaging system with the following characteristics:

(1) Provide message persistence through O(1) disk data structure. This structure can maintain long-term stable performance even for terabytes of message storage.

(2) High throughput: Even a very common hardware Kafka can support millions of messages per second.

(3) Support partitioning messages through Kafka server and consumer machine cluster.

(4) Support Hadoop parallel data loading.

4) Storm: Storm is used for "continuous calculations" to perform continuous queries on data streams and output the results to users in the form of streams during calculation.

5) Spark: Spark is currently the most popular open source big data memory computing framework. It can be calculated based on big data stored on Hadoop.

6) Oozie: Oozie is a workflow scheduling management system for managing Hdoop jobs (job).

7) Hbase: HBase is a distributed, column-oriented open source database. HBase is different from a general relational database. It is a database suitable for unstructured data storage. nosql not only sql

8) Hive: Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, and provides simple SQL query functions, which can convert SQL statements into MapReduce tasks for execution. Its advantages are low learning costs, simple MapReduce statistics can be quickly realized through SQL-like statements, no need to develop special MapReduce applications, and it is very suitable for statistical analysis of data warehouses.

10) R language: R is a language and operating environment for statistical analysis and drawing. R is a free, free, and open source software belonging to the GNU system. It is an excellent tool for statistical calculations and statistical graphics.

11) Mahout: Apache Mahout is a scalable machine learning and data mining library.

12) ZooKeeper: Zookeeper is an open source implementation of Google’s Chubby. It is a reliable coordination system for large-scale distributed systems. It provides functions including configuration maintenance, name service, distributed synchronization, group service, etc. ZooKeeper's goal is to encapsulate key services that are complex and error-prone, and provide users with simple and easy-to-use interfaces and systems with high performance and stable functions.

Guess you like

Origin blog.csdn.net/qq_37933018/article/details/107173898