What are the common methods of processing massive data?

Processing massive data is an essential skill for big data engineers . It is a very necessary job to mine and analyze PB -level data to discover valuable information and provide a basis for enterprises or governments to make correct decisions. The following are commonly used tasks. The massive data processing method!

1. Bloom filter

Bloom filter is a binary vector data structure with good space and time efficiency, which can be used to detect whether an element belongs to a set. The advantage of this method is that its insertion and query time are constant, and it queries the element without saving the element itself, so it has good security, but its accuracy is slightly lower due to the algorithm, and it can be determined that it does not exist The data must not exist, and the data that is determined to exist does not necessarily exist, which is suitable for occasions where a low error rate can be tolerated.

2. Hash

Hash is a hash function, which is a function that compresses a message of any length into a message digest of a certain length. According to different processing requirements, there are different Hash functions, and there are corresponding hash functions for strings, integers, and arrays. Hash method, commonly used Hash construction methods include direct addressing method, digital analysis method, square method, folding method, random number method and remainder method.

3. BitMap

BitMap is a method that uses an array to indicate whether some data exists. It can quickly search, judge and delete data. Generally speaking, the data range is less than 10 times that of int . Bloom can be regarded as an extension of BitMap .

4. Heap

Heap is a general term for a special data structure in computer science. It is an array object that can be regarded as a tree. The principle is to first find the top k from the n numbers to be found to build a small top heap. , then read the following elements in turn and compare them with the top of the small top heap. If the current element is small or equal, continue to read the next element; if the current element is large, replace the top element with the current element, and then adjust the size top pile. The largest heap is to find the first k small, the smallest heap is to find the first k large, and the double heap is to find the median.

5. Double barrel

Double bucket is not a data structure, but an algorithm idea, similar to the idea of divide and conquer. Because the range of elements is very large, the direct addressing table cannot be used, so through multiple divisions, the range is gradually determined, and finally it is carried out within an acceptable range. The double-barrel method is generally suitable for finding the kth largest number, finding the median, and finding numbers that are not repeated or repeated.

6. Database optimization method

Massive amounts of data are stored in the database. How to extract useful information from the database requires the use of database optimization methods. Common database optimization methods include data partitioning, indexing, caching mechanism, batch processing, optimizing query statements, and using sampled data. data mining, etc.

7. Inverted Index

Inverted index is currently the most common storage method for search engines by search engine companies. It is used to store the mapping of the storage location of a word in a document or a group of documents under full-text search. When dealing with complex multi-keywords, logical operations such as union and intersection of queries can be completed in the inverted list, and then the results can be accessed after obtaining the results. In this way, the query of records is converted into the operation of address sets, and it is not necessary to perform operations on each address set. Each record is randomly accessed, thereby increasing the search speed.

8. Outer sorting

External sorting is the sorting of large files. Due to memory limitations, it is not possible to read all the content to be sorted into the memory at once for sorting. It needs to perform multiple data exchanges between the internal memory and the external storage to sort the entire file. The commonly used external sorting method is the merge sorting method, that is, first generate several sub-files, sort these sub-files respectively, and then merge these sub-files multiple times, so that the primary key of the ordered merge segment is expanded, and finally stored in the external storage. form a single merged segment for the entire file.

External sorting is suitable for sorting and deduplication of large data, but the disadvantage of external sorting is that it consumes a lot of IO and is not efficient.

9. Trie tree

Trie tree is a multi-fork tree result for fast string retrieval. The principle is to use the common prefix of strings to reduce space overhead. Often used by search engine systems for document word frequency statistics. The advantages are: minimize unnecessary string comparisons, and query efficiency is higher than hash tables. It is suitable for situations where the amount of data is large and there are many repetitions, but the data types are small and can be put into memory.

10. MapReduce

MapReduce is one of the core technologies of cloud computing, and it is a distributed programming model that simplifies parallel computing .

The above are the commonly used methods for processing massive data , you can choose and use according to the characteristics of the data to be processed!

What are the common methods of processing massive data?

Guess you like