The key technology of big data - big data collection

The key technology of big data - big data collection

Contents of this article:

1. The words written in front

2. The concept of big data collection

3. Big data collection steps

3.1. Big data collection steps (overall perspective)

3.2. Big data collection steps (data set perspective)

3.3. Big data collection steps (data set perspective)

Fourth, the relationship between data sources and data types

4.1. Big data system data

4.2, the relationship between data source and data type

5. Data types in the big data system

5.1. Structured data

5.2. Semi-structured data

5.3. Unstructured data

6. Big data collection technology

6.1, web data collection

6.2. System log collection

6.3. Database acquisition

6.4. Other data (data collection such as sensory equipment)

Seven, big data collection method

8. Features of big data collection


1. The words written in front

 2ab463de26a541818271efa176656ae6.png

        I live in a small house, and the two-point-one-line lifestyle is almost isolated from the outside world. I am also lucky. The people around me are very kind to me. No matter what they do, they will not intentionally hurt me, so I did not expect to provoke Right and wrong, deliberately misunderstood by others and then hate each other. They use out-of-context or various methods of falsification, such as playing an intercepted segment of a recording in a special situation to make people misunderstand and get angry, or even direct high-tech emotional manipulation to make you emotionally collapsed, and then deliberately let others watch it and say this That's what people are like. . . There are endless ways to blacken you. . .

        In this particular environment, the ability to self-regulate is important. Some time ago, I posted some personal opinions in a senior blog, and the records are as follows:

       Personally, in the new environment, it might be better to add a "toughness". That is, no matter how the world changes (people, things, things), it can withstand blows, maintain inner peace and self-regulation ability, handle various accidents calmly and properly, and ensure the normal progress of all things. . .


       I don't know how to explain "resilience", "resilience" is the experience I have summed up recently under the abnormal blow and attack, that is, no matter what blows and difficulties I encounter, I will try to return to my original state Change one's original character because of the change, not get upset because of the change, not blame others because of the change, and not give up what one wants to do because of the change. . . .


       "Toughness" does not mean violating the laws of nature and deliberately hurting oneself, but, how should I put it? Similar to personal resistance to blows. Sometimes the sense of security is beyond our control. Under special circumstances, certain sense of security can only be dealt with by the word "toughness" I mentioned just now. This kind of "toughness" is "toughness" that cannot be beaten. Can rely on self-regulation ability to return to the original state, ignore difficulties, and give yourself a sense of security.
For example,
          1. Someone troubled you for no reason today, and said some ugly words to irritate you. You need to learn to regulate yourself
          . There are many difficult things, some of which are beyond your original ability to accept. When you are exhausted and almost collapse, but there is no one around you who can understand you, all the pain needs to be healed by yourself.
          3. When things that were originally thought to be perfect in the plan, suddenly encountered inexplicable accidents for no reason, resulting in. . . How to adjust yourself to clean up the mess in the shortest time
          4. When family and friends have various problems (disputes, illnesses, and even unreasonable troubles), your own projects are in a hurry and are not going well, and you seem to have many problems when you are tired. . . The prospect is confused and at a loss, how to wipe away tears and face life with a smile?
          . . .


        Repeatedly hone in the collapse and self-healing. . . It's hard, and I'm still studying hard, but when I return, I will still be the boy I was back then.


        Keep going in the rain. In the next second, no one will know how fate will turn. The bitter whispers behind the back are the baptism for growing up. The self that cannot be copied makes me look stylish even when I am injured. This is not temper, but the so-called ambition and courage. You can push me off a cliff, and I can learn to fly. Never listen to anyone's order, very independent, ears are used to listen to one's own heart!

        PS:

       To those friends who care about me:

       I'm ok, don't worry, I'm still the same person I was, I haven't changed. I have a lot of things I want to say, but I can't say them. I can only say one sentence: Please take care of yourselves. I am still waiting for the day when the weather clears up, and you will tell me what happened at that time.

  Sober in adversity

2023.9.5

f3751a45350f4910835ba888fb79118a.gif

2. The concept of big data collection

        According to the different fields involved, the key technologies of big data can be divided into big data collection, big data preprocessing, big data storage and management, big data processing, big data analysis and mining, and big data processing. Key technologies generally include: big data collection , big data preprocessing, big data storage and management, big data analysis and mining, big data display and application (big data retrieval, big data visualization, big data application, big data security, etc.)

        Big data acquisition is the first link in the big data life cycle and is the cornerstone of the big data industry.    

        Big data collection is the entrance to big data analysis and a crucial link in big data analysis.

big data collection

        Big data collection refers to the process of collecting and organizing large amounts of data through various technical means. The collected data can come from different data sources, including structured data and unstructured data, such as website data, social media data, emails, log files, sensors, enterprise applications, etc.

        The acquisition process usually requires the use of various technical tools and technical platforms, such as web crawlers, data mining, natural language processing, etc.

        In the field of big data applications, collection is a very important part of the entire data processing process. Collecting useful information from big data is already one of the key factors in the development of big data.

3. Big data collection steps

(3.1), big data collection steps (overall perspective)

Big data collection usually includes the following steps:

  1. Determine the scope and purpose of data collection: determine the time, location, data type, data format, and data volume of data collection.
  2. Adopt appropriate technology for different data sources: for example, collecting data from sensors may require the use of IoT technology, and collecting data from social media may require the use of APIs.
  3. Design data collection and processing process: including data extraction, transformation and loading (Extract, Transform, Load, ETL for short).
  4. Ensure data accuracy and integrity: ensure data quality by cleaning, deduplicating, formatting, and other operations.
  5. Data storage: store the collected data in an appropriate database or data warehouse for subsequent data analysis and application.

(3.2), big data collection steps (data set perspective)

(3.3), big data collection steps (data set perspective)

Big data acquisition steps (data set perspective):

Collection requirements, rule configuration, task scheduling, task monitoring, data collation, data release, data transaction, data delivery

        In the process of big data collection, issues such as privacy and data security need to be considered to ensure that the collected data will not be stolen or misused.

Fourth, the relationship between data sources and data types

4.1 , big data system data

        In the big data system, traditional data is divided into business data and industry data. New data sources that have not been considered in the traditional data system include content data, online behavior data and offline behavior data.

New data sources include:

▷ Online behavior data: page data, interaction data, form data, session data, etc.

▷ Offline behavioral data: such as face recognition and fingerprint recognition technology that collects biometric features, to WiFi probes that collect device characteristics, and iBeacon recognition technology. These technologies are all trying to collect and analyze offline big data.

▷ Content data: application logs, electronic documents, machine data, voice data, social media data, etc.

        Different recognition technologies have their own roles in different fields. These recognition technologies can exist as a separate system, and can also be integrated in various ways.

4.2, the relationship between data source and data type

        For the processing of big data, different processing methods and technologies are required for different types of data, such as Hadoop and Spark for distributed processing of structured data, and machine learning algorithms for classification and labeling of unstructured data. Therefore, in the process of big data processing, it is very important to understand the relationship between data sources and data types.

        In the big data system, the relationship between data sources and data types is shown in the following figure:

        In the big data system, data sources and data types are closely related.

        Data sources usually refer to the starting point of big data storage and processing, and data sources can include various types of data sources such as sensors, websites, social media, IoT devices, mobile applications, cloud storage, databases, etc.

        Different data sources may contain different types of data, for example sensor data is usually structured data, while social media posts and comments are semi-structured data, while photos and videos are unstructured data.

5. Data types in the big data system

        The data type in the big data system refers to the type and format of data, mainly involving three types of structured data, semi-structured data and unstructured data.

5.1. Structured data

structured data:

        Structured data refers to data organized in a specific format and rules, and there is a clear relationship and hierarchy between its data elements, which is easy to store, process and analyze. Data elements are organized in a fixed format, and they can be easily organized, categorized, indexed, searched and queried. Data typically presented in the form of tables, relational databases, or XML, such as tabular data in relational databases, spreadsheets, and data in CSV (comma-separated value) files. This data type is usually easy to process and analyze, and its format is also very standardized, easy to store, manage and query.

        These data structures focus on numbers, dates, text, amounts, timestamp currencies, ratios, certificates, addresses, phone numbers, emails, and more, with well-defined data types and field names. Structured data is easy to handle and manage, it can be analyzed and processed through SQL queries and other data analysis tools. The clear structure and organizational form of structured data make it widely used in data analysis, machine learning, artificial intelligence applications and other fields. Such as enterprise data management systems, business reports, etc.

        Structured data refers to data stored in tabular form, and its characteristics include:

1. Data is organized according to a fixed structure, and each data item has clearly defined data types and attributes;

2. The data storage method is simple and clear, usually stored in the form of a relational database, which is convenient for query and analysis;

3. Data processing and management are relatively easy, and standardized languages ​​such as SQL can be used for operations;

4. The accuracy and consistency of data are high, which is conducive to the maintenance and management of data quality;

5. The processing method of structured data is relatively fixed, and common statistical and machine learning algorithms can be used for analysis and mining.

5.2. Semi-structured data

Semi-structured data:

        Semi-structured data refers to data that does not meet the requirements of the traditional relational database data model. It usually refers to a data type that has no specified structure and is between structured data and unstructured data. Its structure is less standardized than structured data. . However, there are data for identification and description, such as XML, JSON, and YAML;

        Semi-structured data is mainly used in Web applications, text processing, semantic analysis and other fields, and can well meet the flexibility requirements of data processing. Common semi-structured data sources include log files, social media data, sensor data, etc.

        Semi-structured data usually has the following characteristics:

1. The data has a certain structure, but it is not strictly in the form of a table, and can contain multiple levels of nested structures.

2. Fields in the data can be dynamically added or deleted as needed, without the need to define the table structure in advance.

3. The data can be very flexibly adapted to different application scenarios and needs.

4. Data is usually stored and transmitted in formats such as XML and JSON.

5.3. Unstructured data

Unstructured data:

        Unstructured data refers to data without a clear structure, such as text documents, audio, video, images and other data types. These data are usually highly complex and diverse and cannot be easily converted into tabular or two-dimensional matrix forms. Difficult to process using traditional structured data storage and management methods. Therefore, special techniques and tools are required to analyze and process these data.

        This type of data often requires processing and analysis using techniques such as text analytics, natural language processing, and image processing. At present, the use of unstructured data is increasing, and it has a wide range of application values ​​in artificial intelligence, machine learning and other fields.

        Therefore, different types of data sources often have an impact on the type of data, and different types of data also require different technologies and methods for processing and analysis.

6. Big data collection technology

        The collection of big data can be divided into four categories from the data sources:

6.1, web data collection

(This picture comes from the Internet www.yisu.com/news/id_335.html)

        Network data collection refers to the process of obtaining data information from websites through web crawlers or website public APIs.

        The web crawler will start from the URL of one or several initial webpages, obtain the content on each webpage, and in the process of crawling webpages, continuously extract new URLs from the current page and put them into the queue until the set stop condition is met.

        In this way, unstructured data and semi-structured data can be extracted from web pages and stored in a local storage system in a structured manner.

6.2. System log collection

(This picture comes from the network developer.aliyun.com/article/594990)

System log collection

        System log collection refers to the collection of log information generated inside the computer system, such as logs generated by operating systems, applications, and network devices. Collecting these log information helps security managers or system administrators to monitor the system operating status in real time, find system failures or abnormalities, and take timely measures to ensure the safe and stable operation of the system.

        System log collection usually collects log information to a central log server or centralized log management platform for storage and management by installing a log collection agent or software for subsequent query, analysis, and reporting. The collected system log information can be used for troubleshooting, security audit, and compliance supervision.

        High availability, high reliability, and scalability are the basic characteristics of a log collection system. The system log collection tools all adopt a distributed architecture, which can meet the log data collection and transmission requirements of hundreds of MB per second.

6.3. Database acquisition

Database Big Data Acquisition

        Database big data collection usually refers to the collection of a large amount of data from different data sources into a centralized database for analysis and application. These data sources can include various data sources such as sensors, websites, social networks, mobile devices, etc. The purpose of data collection is to collect enough data for in-depth analysis and mining to reveal potential trends and patterns and make more informed business decisions.

        When collecting big data, the following aspects need to be considered:

1. The type of data collected: The data type can be structured, semi-structured or unstructured data. The collection methods and collection tools of these data types are different, and collection tools need to be selected according to different data types.

2. Data source: There are usually multiple data sources for collecting data, including sensors, databases, websites, social networks, etc. Collection methods and tools need to be selected according to the characteristics of the data source.

3. Data collection technology: Data collection technologies include crawlers, ETL, etc. It is necessary to select the appropriate collection technology according to the type and source of the collected data.

4. Frequency of data collection: Determine the frequency of data collection according to different data sources and data types to ensure the timeliness and accuracy of data.

5. Storage and processing of data collection: The collected data needs to be stored and processed for subsequent analysis and application. The right storage and processing technology needs to be selected to meet the needs.

Traditional enterprises use traditional relational databases such as MySQL and Oracle to store data.

6.4. Other data (data collection such as sensory equipment)

(This picture comes from the Internet)

        Perception device data collection refers to automatically collecting signals, pictures or videos through sensors, cameras and other smart terminals to obtain data. The big data intelligent perception system needs to realize the intelligent identification, positioning, tracking, access, transmission, signal conversion, monitoring, preliminary processing and management of structured, semi-structured and unstructured mass data. Its key technologies include intelligent identification, perception, adaptation, transmission, and access to big data sources.

Seven, big data collection method

Big data collection method:

▷ 1. Batch collection: refers to the collection of a large amount of data from a certain website or system, and the analysis and processing of the collected data.

▷ 2. Real-time collection: refers to the real-time collection of data for real-time processing and analysis.

▷ 3. Incremental collection: refers to regular incremental collection of existing data to obtain the latest data.

▷ 4. Automatic collection: use automatic procedures to realize data collection, reduce manual intervention, and improve collection efficiency.

▷ 5. Cooperative collection: through cooperation with other institutions or organizations, to obtain shared data and conduct big data analysis.

8. Features of big data collection

Compared with traditional data collection technology, big data collection technology has the following characteristics:

▷ 1. Larger scale: Big data acquisition technology can handle larger scale data, including structured, semi-structured and unstructured data.

▷ 2. Faster speed: Big data acquisition technology can acquire data quickly and process the data in real time or almost real time to make decisions faster.

▷ 3. Stronger diversity: Big data collection technology can collect data from different sources, including social media, sensors, logs, videos and other data.

▷ 4. Higher accuracy: Big data acquisition technology can process more accurate and finer data, and perform operations such as induction and classification on data to improve data quality.

▷ 5. Higher degree of automation: Big data acquisition technology can automatically acquire and process data, reducing manual intervention and errors.

▷ 6. Lower cost: The cost of big data acquisition technology is usually lower than traditional data acquisition technology, including the cost of hardware and software.

 Big data articles:

          Recommended reading:

[Have you found someone who will hold hands for a lifetime? ] Chinese Valentine's Day Special
Can digital technology bring ancient books back to life?
When you are in a bad mood, help yourself to train an AI emotional encourager (based on PALM 2.0 finetune)
Deep learning framework TensorFlow
AI Developer Workflow, Perceptions, Tool Statistics
June 2023 Developer Survey Statistics - Most Popular Technologies (2)
June 2023 Developer Survey Statistics - Most Popular Technologies (1)
Let Ai help us draw a zongzi, what will it look like?

9e598365ba5344e282453e71a676a056.jpeg​​

b9b9f2b9374646798ca554110a498cda.jpeg​​

23f61e3eac99458296be0fedea10019e.jpeg​​

Change the background color of the photo (python+opencv) Twelve categories of cats Virtual digital human based on large model__virtual anchor example

bfa502b957c247a7872d7e645d4c6f03.jpeg​​

2f073e39924e42d2b33221f4262dcc1d.jpeg​​

9d7e2b6a00aa45fd82291f0d5f9eea7e.jpeg​​

Computer Vision__Basic Image Operations (Display, Read, Save) Histogram (color histogram, grayscale histogram) Histogram equalization (adjust image brightness, contrast)

01bfb23f2f894ee0b0164f52e57bbbbc.png​​

47c92d6cf9fe4d279a142480a4340a0d.png​​

1620a2a7b0914c42b3a8254e94269a79.png​​

 Speech recognition practice (python code) (1)

 Artificial Intelligence Basics

 Basics of Computer Vision__Image Features

93d65dbd09604c4a8ed2c01df0eebc38.png​​

 Quick check of matplotlib's own drawing style effect display (28 types, all)

074cd3c255224c5aa21ff18fdc25053c.png​​

Detailed explanation of Three.js example ___ rotating elf girl (with complete code and resources) (1)

fe88b78e78694570bf2d850ce83b1f69.png​​

62e23c3c439f42a1badcd78f02092ed0.png​​

cb4b0d4015404390a7b673a2984d676a.png​​

Three-dimensional multi-layer rose drawing source code__Rose python drawing source code collection

 Python 3D visualization (1)

 Make your work better - the method of making word cloud Word Cloud (based on python, WordCloud, stylecloud)

e84d6708316941d49a79ddd4f7fe5b27.png​​

938bc5a8bb454a41bfe0d4185da845dc.jpeg​​

0a4256d5e96d4624bdca36433237080b.png​​

 Usage of python Format() function___Detailed example (1) (full, many examples)___Various formatting replacements, format alignment printing

 Write romance with code__Collection (python, matplotlib, Matlab, java to draw hearts, roses, front-end special effects roses, hearts)

Python love source code collection (18 models)

dc8796ddccbf4aec98ac5d3e09001348.jpeg​​

0f09e73712d149ff90f0048a096596c6.png​​

40e8b4631e2b486bab2a4ebb5bc9f410.png​​

 Usage of the Print() function in Python___Detailed examples (full, many examples)

 The complete collection of detailed explanations of Python function and method examples (updating...)

 "Python List List Full Example Detailed Explanation Series (1)" __ series general catalog, list concept

09e08f86f127431cbfdfe395aa2f8bc9.png​​

6d64357a42714dab844bf17483d817c0.png​​

Celebrate the Mid-Autumn Festival with code, do you want to have a bite of python turtle mooncake?

 directory of python exercises

03ed644f9b1d411ba41c59e0a5bdcc61.png​​

daecd7067e7c45abb875fc7a1a469f23.png​​

17b403c4307c4141b8544d02f95ea06c.png​​

Strawberry bear python turtle drawing (windmill version) with source code

 ​Strawberry Bear python turtle drawing code (rose version) with source code

 ​Strawberry bear python drawing (Spring Festival version, Christmas countdown snowflake version) with source code

4d9032c9cdf54f5f9193e45e4532898c.png​​

c5feeb25880d49c085b808bf4e041c86.png​​

 Buzz Lightyear python turtle drawing__with source code

Pikachu python turtle turtle drawing (power ball version) with source code

80007dbf51944725bf9cf4cfc75c5a13.png​​

1ab685d264ed4ae5b510dc7fbd0d1e55.jpeg​​

1750390dd9da4b39938a23ab447c6fb6.jpeg​​

 Node.js (v19.1.0npm 8.19.3) vue.js installation and configuration tutorial (super detailed)

 Color and color comparison table (1) (hexadecimal, RGB, CMYK, HSV, Chinese and English names)

A number of authoritative organizations in April 2023____Programming language rankings__Salary status

aa17177aec9b4e5eb19b5d9675302de8.png​​​

38266b5036414624875447abd5311e4d.png​​

6824ba7870344be68efb5c5f4e1dbbcf.png​​

 The phone screen is broken____how to export the data inside (18 methods)

[CSDN Cloud IDE] Personal experience and suggestions (including ultra-detailed operation tutorials) (python, webGL direction)

 Check the jdk installation path, realize the coexistence solution of multiple java jdk on windows, and solve the terminal garbled characters after installing java19

1408dd16a76947e4a7eb3c54cd570d95.png​​

Vue3 project building tutorial (based on create-vue, vite, Vite + Vue)

fea225cb9ec14b60b2d1b797dd8278a2.png​​

bba02a1c4617422c9fbccbf5325850d9.png​​

37d6aa3e03e241fa8db72ccdfb8f716b.png​​

The second part of the 2023 Spring Festival blessings - send you a guardian rabbit, let it warm every one of you [html5 css3] drawing and moving bunny, cool charging, special font

 Unique, original, beautiful and romantic Valentine's Day confession album, (copy is available) (html5, css3, svg) confession love code (4 sets)

Detailed explanation series of SVG examples (1) (overview of svg, difference between bitmap and vector graphics (diagram), SVG application examples)

5d409c8f397a45c986ca2af7b7e725c9.png​​

6176c4061c72430eb100750af6fc4d0e.png​​

1f53fb9c6e8b4482813326affe6a82ff.png​​

[Programming Life] Python turtle drawing of World Cup elements in Qatar (with source code), 5 World Cup theme front-end special effects (with source code) HTML+CSS+svg draws exquisite colorful flashing lights Christmas tree, HTML+CSS+Js real-time New Year countdown (with source code)

 The first part of the 2023 Spring Festival Blessing Series (Part 1) (flying Kongming lanterns for blessings, wishing everyone good health) (with complete source code and resources for free download)

fffa2098008b4dc68c00a172f67c538d.png​​

5218ac5338014f389c21bdf1bfa1c599.png​​

c6374d75c29942f2aa577ce9c5c2e12b.png​​

 Tomcat11, tomcat10 installation configuration (Windows environment) (detailed graphics)

 Tomcat port configuration (detailed)

 Tomcat startup flashback problem solving set (eight categories in detail)

Guess you like

Origin blog.csdn.net/weixin_69553582/article/details/132674761