Big Data (4) Mainstream Big Data Technology

Big Data (4) Mainstream Big Data Technology

 1. The words written in front

2ab463de26a541818271efa176656ae6.png

To those good girls (good boys) who are tortured and beaten:

   There are some things we can't choose, and we can't avoid being hurt.

   But please remember at all times:

   You may be worthless in front of some people, and you may be hurt badly,

   But it will definitely be a priceless treasure in the hearts of other people, and he (she) will regard you as more important than himself.

   Just be yourself, no need to deliberately change anything, people who love you will naturally love you.

   Remember your worth! You live in your own world, in the hearts of those who love you!

   Sober in adversity

2023.8.27

f3751a45350f4910835ba888fb79118a.gif

2. Big data technology

Mainstream big data technologies can be divided into two categories.

   One type is oriented to non-real-time batch processing business scenarios , focusing on processing TB-level and PB-level massive data storage, processing, analysis, and applications that traditional data processing technologies cannot handle in a limited space-time environment. Such as: user behavior analysis, order anti-fraud analysis, user loss analysis, data warehouse, etc. The characteristics of such business scenarios are non-real-time response. Usually, some units extract all kinds of data into the big data analysis platform at the end of the night's trading, obtain the calculation results within a few hours, and use them for the next day's business. The mainstream supporting technologies are HDFS, MapReduce, Hive, etc.

   Another type of business scenarios for real-time processing , such as Weibo applications, real-time social networking, real-time order processing, etc. This type of business scenario is characterized by strong real-time response. When a user sends a business request, a response must be given within a few seconds, and ensure that data integrity. The more mainstream supporting technologies are HBase, Kafka, Storm, etc.

(1)HDFS

 

   HDFS refers to the Hadoop Distributed File System (Hadoop Distributed File System), which is the core component of the Apache Hadoop project. It is a distributed file system with high availability, high reliability, high scalability, and high fault tolerance. It can store and process large-scale data sets on a group of cheap computers, and realizes the parallel processing of large-scale data by distributing data and computing tasks among multiple nodes.

   HDFS is the core sub-project of Hadoop and the basis for data storage and access of the entire Hadoop platform. Based on the file system on the Linux native file system. On top of this, it hosts the operation of other sub-projects such as MapReduce and HBase. It is a distributed file system that is easy to use and manage.

   HDFS is a highly fault-tolerant system suitable for deployment on cheap machines. HDFS can provide high-throughput data access and is very suitable for applications on large-scale data sets. HDFS relaxes some POSIX constraints to achieve the purpose of streaming file system data.

HDFS technology has the following characteristics:

   1. Large-scale storage: HDFS can handle large-scale data sets at the PB level, and supports distributed storage and management of data files.
   2. High reliability: HDFS is a data storage method based on redundancy, which distributes data to different nodes, and the downtime of any server will not affect the integrity and availability of data.
   3. High scalability: HDFS can run on hundreds of machines and supports dynamic expansion, which is convenient for users to expand with the growth of data volume.
   4. High fault tolerance (fault-tolerant): HDFS achieves data fault tolerance through multiple copies of data blocks. If a node goes down, other nodes can continue to provide data services.
   5. Efficiency: HDFS supports batch reading and writing of data, provides an efficient data transmission mechanism, and can realize fast data transmission and processing in the cluster.

   6. High throughput

   7.HDFS relaxes (relax) the requirements of POSIX (requirements) so that the data in the file system can be accessed in the form of stream (streaming access).

   In short, HDFS technology is one of the necessary technologies to realize big data storage, management, and processing. It can provide efficient and reliable data storage solutions for enterprises in different industries.

(2)MapReduce

 

   MapReduce is a software architecture that processes massive amounts of data in a parallel computing manner in a cluster composed of thousands of ordinary hardware. This computing framework has high stability and fault tolerance. MapReduce highly reduces the responsible logic, which is abstracted into Mapper and Reducer classes. Complex logic is transformed into a pattern that conforms to MapReduce function processing through understanding.

   A MapReduce job divides the input data set into independent computing blocks, and these blocks are processed by map tasks in a completely parallel and independent mode. The MapReduce framework sorts the output of maps, and after sorting, the data is used as the input data of the reduce task. Both input and output data of a job are stored in the HDFS file system. The computing framework manages job scheduling, monitors jobs, and re-executes failed tasks.

   MapReduce is a distributed computing framework for large-scale data processing.

The MapReduce software architecture can be divided into the following three levels :

   ♦ Application layer: MapReduce application developers use the MapReduce API to write applications, decomposing the problem into individual tasks that can be processed in parallel. These tasks are divided into two stages: the map stage and the reduce stage. In the map stage, the data is split into small pieces and processed as key-value pairs, and then the processed data is grouped and merged according to the key, and finally the desired data is generated. desired result.

   ♦ Computing layer: MapReduce computing cluster consists of two types of nodes: a Master node and a group of Worker nodes. The Master node is responsible for coordinating the entire computing process, including division of tasks, monitoring of Worker nodes, and data transmission. Worker nodes execute the tasks assigned to them and return the results to the Master node. Each node in the compute layer is either a physical computer or a virtual machine, and all can communicate system-wide.

   ♦ Storage layer: The MapReduce storage layer uses Hadoop Distributed File System (HDFS) to store large amounts of data. HDFS is a scalable and fault-tolerant file system that can replicate data to different nodes to ensure data reliability. HDFS provides efficient data storage and management methods for MapReduce.

   The above are the three levels of MapReduce software architecture. MapReduce enables efficient processing of large-scale data by breaking data into small pieces and distributing tasks among computing nodes.

(3)YARN

 

    Apache Hadoop YARN (Yet Another Resource Negotiator, another resource coordinator) is a new resource management and application scheduling framework evolved from Hadoop 0.23. Based on YARN, various types of applications can be run, such as MapReduce, Spark, Storm, etc. YARN no longer specifically manages applications. Resource management and application management are two loosely coupled modules.

   In a sense, YARN is a cloud operating system (Cloud OS). Based on this operating system, programmers can develop a variety of applications, such as batch processing MapReduce programs, Spark programs, and streaming job Storm programs. These applications can simultaneously utilize the data resources and computing resources of the Hadoop cluster.

   YARN (Yet Another Resource Negotiator) is a resource manager in the Hadoop ecosystem. Its main function is to manage and schedule cluster resources in a unified manner, allocate cluster resources for multiple applications, and improve the resource utilization of Hadoop clusters. . As an important component of Hadoop 2.0, YARN greatly expands the application scenarios of Hadoop and supports multiple computing models, including MapReduce, Spark, and Storm.

The main functions of YARN include:

   1. Resource management: YARN can manage the resources of different nodes in the cluster and allocate resources to different applications to ensure the normal operation of the applications.

   2. Scheduling management: YARN can schedule cluster resources according to the specified policies and rules according to the requirements of different applications, and ensure the fair sharing of resources by applications.

   3. Application management: YARN can automatically manage the lifecycle of applications, including operations such as application startup, monitoring, restart and shutdown.

   4. Security management: YARN can provide powerful security management functions, including user authentication, authorization, and data encryption, to ensure the security and stability of the cluster.

   In short, as an important component in the Hadoop ecosystem, YARN provides reliable resource management and scheduling functions for the operation of multiple applications, and is widely used in various industries such as the Internet, finance, and medical care.

(4)HBase

 

   HBase is an important non-relational database in the Hadoop platform. It can support PB-level data storage and processing capabilities through linearly scalable deployment.

   As a non-relational database, HBase is suitable for unstructured data storage, and its storage mode is based on columns.

   HBase is an open source distributed NoSQL database. In the Hadoop ecosystem, it is one of the components of Hadoop database, MapReduce and HDFS. Based on Google's BigTable paper, HBase is a highly reliable, highly scalable, high-performance database designed to run on large-scale data sets. HBase has the following characteristics:

   It uses column clusters as the basic storage unit and supports dynamic columns.

   Supports automatic partitioning, automatic load balancing, and automatic failover.

   Support semi-structured data and unstructured data, there is no fixed table mode.

   Supports high concurrent read and write operations, multi-version data, data compression, and data caching.

    HBase is highly scalable and can support hundreds of billions of rows of data, each row can have tens of thousands of columns.

   HBase is commonly used to store semi-structured data such as logs, social media data, sensor data, network data, images and audio, etc. It can dynamically adjust storage and processing capacity according to needs, and can carry the query and analysis of large-scale data sets, which is an ideal choice for processing big data.

(5)Hive

   Apache Hive is a data warehouse built on top of Hadoop architecture. It can provide data refinement, query and analysis. Apache Hive was originally developed by Facebook, and currently other companies use and develop Apache Hive, such as Netflix.

   Hive is an open source framework under the Apache Foundation. It is a data warehouse tool based on Hadoop. It can map structured data files into a data warehouse table and provide simple SQL (Structured Query Language) query functions. Statements are converted into MapReduce tasks to run.

   Using Hive can meet the needs of some database administrators who do not understand MapReduce but understand SQL, so that they can use the big data analysis platform smoothly.

   Hive is a Hadoop-based data warehouse tool. It is a data warehouse infrastructure that can map structured data files into a database table and provides a set of SQL-like query language HiveQL to query data. Hive is designed to facilitate SQL developers to process large data sets. It can convert SQL-like syntax into MapReduce tasks for execution through HQL, thereby utilizing Hadoop clusters to process massive data.

   Hive supports multiple data sources, including HDFS, HBase, and local file systems. It can store massive data through built-in data storage formats, such as text, serialization, ORC, etc., and provides features such as data compression, data partitioning, and data buckets to optimize performance.

   Hive has a good scalability and ecosystem. It can extend functions through UDF (user-defined function) and UDAF (user-defined aggregate function), and supports the integration of many third-party tools, such as JDBC, ODBC, Tableau, etc.

   In short, Hive is a powerful data warehouse tool, which has great practical value for scenarios that need to process large amounts of data.

(6)Kafka

 

   Apache Kafka is a distributed "publish-subscribe" messaging system, originally developed by LinkedIn and later became an Apache project. Kafka is a fast, scalable, and inherently distributed, partitioned, and replicated commit log service by design.

   Kafka is a distributed system that is easy to scale out. It can provide high throughput for publishing and subscribing, and supports multiple subscribers. When it fails, it can automatically balance consumers. Real-time business can also be oriented to real-time business.

   Apache Kafka is a distributed stream processing platform developed by the Apache Software Foundation, which has the characteristics of high reliability, high scalability and high throughput. Based on the publish/subscribe model, Kafka is mainly used to record streaming data, such as logs, events, and indicators.

Kafka's architecture includes the following components:

Broker: Each node in the Kafka cluster is called Broker and is responsible for storing and processing data.

Topic: Data records are stored in one or more Topics. Each Topic is divided into multiple Partitions, and each Partition can be distributed on different Brokers.

Producer: The Producer is responsible for sending data to the Topic in the Kafka cluster, and can specify which Partition the data is sent to.

Consumer: Consumer subscribes to data from Topic in Kafka cluster and processes the data. Consumers can form Consumer Groups, and Consumers in each Group jointly consume data in one or more Partitions.

Kafka is widely used in various scenarios, such as log collection, real-time data stream processing, event-driven architecture, etc. It is closely integrated with open source technologies such as Hadoop and Spark, and has become an indispensable part of the big data ecosystem.
 

(7)Storm

   Storm is a free, open source, distributed, and highly fault-tolerant real-time computing system. It can handle continuous flow computing tasks. At present, it is widely used in real-time analysis, online machine learning, ETL and other fields.

   Storm is an open source distributed real-time computing system, mainly used to process a large amount of streaming data. It can acquire data in real time, process it, and send the processed data to other systems. Storm is highly scalable, fault-tolerant, and reliable, and can run in distributed clusters.

   The core concept of Storm is Topology. A Topology is a way of data stream processing, which consists of Spout and Bolt. Spout is a component used for data source input, which is responsible for inputting data into the topology. Bolt is a component used for data processing and data transmission. It will receive the data sent by Spout, process it and then pass the processed data to the next A Bolt or Sink. Each bolt in the topology can run in parallel to make data processing more efficient.

   Storm also has a built-in fault tolerance mechanism, which can automatically restart or switch to other nodes to run when a cluster node fails, realizing highly reliable distributed computing. At the same time, Storm also supports multiple data sources (such as Kafka, RabbitMQ, etc.) and data storage (such as HDFS, Cassandra, Redis, etc.), can process different types of data, and store the results in different data storages.

   In short, Storm is a powerful real-time computing framework that is widely used in real-time data processing in various industries.

Comparison of Storm and Hadoop

structure Hadoop Storm
master node JobTracker Cloud
slave node TaskTracker Supervisor
application Job Topology
Worker process name Child Worker
Computational model Map / Reduce Spout / Bolt

  

Big data articles:

          Recommended reading:

[Have you found someone who will hold hands for a lifetime? ] Chinese Valentine's Day Special
Can digital technology bring ancient books back to life?
When you are in a bad mood, help yourself to train an AI emotional encourager (based on PALM 2.0 finetune)
Deep learning framework TensorFlow
AI Developer Workflow, Perceptions, Tool Statistics
June 2023 Developer Survey Statistics - Most Popular Technologies (2)
June 2023 Developer Survey Statistics - Most Popular Technologies (1)
Let Ai help us draw a zongzi, what will it look like?

9e598365ba5344e282453e71a676a056.jpeg​​

b9b9f2b9374646798ca554110a498cda.jpeg​​

23f61e3eac99458296be0fedea10019e.jpeg​​

Change the background color of the photo (python+opencv) Twelve categories of cats Virtual digital human based on large model__virtual anchor example

bfa502b957c247a7872d7e645d4c6f03.jpeg​​

2f073e39924e42d2b33221f4262dcc1d.jpeg​​

9d7e2b6a00aa45fd82291f0d5f9eea7e.jpeg​​

Computer Vision__Basic Image Operations (Display, Read, Save) Histogram (color histogram, grayscale histogram) Histogram equalization (adjust image brightness, contrast)

01bfb23f2f894ee0b0164f52e57bbbbc.png​​

47c92d6cf9fe4d279a142480a4340a0d.png​​

1620a2a7b0914c42b3a8254e94269a79.png​​

 Speech recognition practice (python code) (1)

 Artificial Intelligence Basics

 Basics of Computer Vision__Image Features

93d65dbd09604c4a8ed2c01df0eebc38.png​​

 Quick check of matplotlib's own drawing style effect display (28 types, all)

074cd3c255224c5aa21ff18fdc25053c.png​​

Detailed explanation of Three.js example ___ rotating elf girl (with complete code and resources) (1)

fe88b78e78694570bf2d850ce83b1f69.png​​

62e23c3c439f42a1badcd78f02092ed0.png​​

cb4b0d4015404390a7b673a2984d676a.png​​

Three-dimensional multi-layer rose drawing source code__Rose python drawing source code collection

 Python 3D visualization (1)

 Make your work better - the method of making word cloud Word Cloud (based on python, WordCloud, stylecloud)

e84d6708316941d49a79ddd4f7fe5b27.png​​

938bc5a8bb454a41bfe0d4185da845dc.jpeg​​

0a4256d5e96d4624bdca36433237080b.png​​

 Usage of python Format() function___Detailed example (1) (full, many examples)___Various formatting replacements, format alignment printing

 Write romance with code__Collection (python, matplotlib, Matlab, java to draw hearts, roses, front-end special effects roses, hearts)

Python love source code collection (18 models)

dc8796ddccbf4aec98ac5d3e09001348.jpeg​​

0f09e73712d149ff90f0048a096596c6.png​​

40e8b4631e2b486bab2a4ebb5bc9f410.png​​

 Usage of the Print() function in Python___Detailed examples (full, many examples)

 The complete collection of detailed explanations of Python function and method examples (updating...)

 "Python List List Full Example Detailed Explanation Series (1)" __ series general catalog, list concept

09e08f86f127431cbfdfe395aa2f8bc9.png​​

6d64357a42714dab844bf17483d817c0.png​​

Celebrate the Mid-Autumn Festival with code, do you want to have a bite of python turtle mooncake?

 directory of python exercises

03ed644f9b1d411ba41c59e0a5bdcc61.png​​

daecd7067e7c45abb875fc7a1a469f23.png​​

17b403c4307c4141b8544d02f95ea06c.png​​

Strawberry bear python turtle drawing (windmill version) with source code

 ​Strawberry Bear python turtle drawing code (rose version) with source code

 ​Strawberry bear python drawing (Spring Festival version, Christmas countdown snowflake version) with source code

4d9032c9cdf54f5f9193e45e4532898c.png​​

c5feeb25880d49c085b808bf4e041c86.png​​

 Buzz Lightyear python turtle drawing__with source code

Pikachu python turtle turtle drawing (power ball version) with source code

80007dbf51944725bf9cf4cfc75c5a13.png​​

1ab685d264ed4ae5b510dc7fbd0d1e55.jpeg​​

1750390dd9da4b39938a23ab447c6fb6.jpeg​​

 Node.js (v19.1.0npm 8.19.3) vue.js installation and configuration tutorial (super detailed)

 Color and color comparison table (1) (hexadecimal, RGB, CMYK, HSV, Chinese and English names)

A number of authoritative organizations in April 2023____Programming language rankings__Salary status

aa17177aec9b4e5eb19b5d9675302de8.png​​​

38266b5036414624875447abd5311e4d.png​​

6824ba7870344be68efb5c5f4e1dbbcf.png​​

 The phone screen is broken____how to export the data inside (18 methods)

[CSDN Cloud IDE] Personal experience and suggestions (including ultra-detailed operation tutorials) (python, webGL direction)

 Check the jdk installation path, realize the coexistence solution of multiple java jdk on windows, and solve the terminal garbled characters after installing java19

1408dd16a76947e4a7eb3c54cd570d95.png​​

Vue3 project building tutorial (based on create-vue, vite, Vite + Vue)

fea225cb9ec14b60b2d1b797dd8278a2.png​​

bba02a1c4617422c9fbccbf5325850d9.png​​

37d6aa3e03e241fa8db72ccdfb8f716b.png​​

The second part of the 2023 Spring Festival blessings - send you a guardian rabbit, let it warm every one of you [html5 css3] drawing and moving bunny, cool charging, special font

 Unique, original, beautiful and romantic Valentine's Day confession album, (copy is available) (html5, css3, svg) confession love code (4 sets)

Detailed explanation series of SVG examples (1) (overview of svg, difference between bitmap and vector graphics (diagram), SVG application examples)

5d409c8f397a45c986ca2af7b7e725c9.png​​

6176c4061c72430eb100750af6fc4d0e.png​​

1f53fb9c6e8b4482813326affe6a82ff.png​​

[Programming Life] Python turtle drawing of World Cup elements in Qatar (with source code), 5 World Cup theme front-end special effects (with source code) HTML+CSS+svg draws exquisite colorful flashing lights Christmas tree, HTML+CSS+Js real-time New Year countdown (with source code)

 The first part of the 2023 Spring Festival Blessing Series (Part 1) (flying Kongming lanterns for blessings, wishing everyone good health) (with complete source code and resources for free download)

fffa2098008b4dc68c00a172f67c538d.png​​

5218ac5338014f389c21bdf1bfa1c599.png​​

c6374d75c29942f2aa577ce9c5c2e12b.png​​

 Tomcat11, tomcat10 installation configuration (Windows environment) (detailed graphics)

 Tomcat port configuration (detailed)

 Tomcat startup flashback problem solving set (eight categories in detail)

Guess you like

Origin blog.csdn.net/weixin_69553582/article/details/132516011