Big data-data storage: the difference and usage scenarios of HDFS, MongoDB, and HBase

One, HDFS

HDFS:: Suitable for large file storage, can be appended, but cannot be modified. Suitable for Hadoop offline data analysis, Apache Spark data storage.

  • HDFS has a relatively large storage overhead for a large number of small files, and is suitable for large file processing. If there are multiple small files, they can be combined into large files for processing.
  • HDFS is suitable for high throughput, not suitable for low-latency access
  • HDFS is suitable for streaming reading, not suitable for multi-user writing to a file, random writing, and file overwriting operations
  • HDFS is more suitable for application scenarios where write once and read multiple times

HDFS applicable scenarios

  • GB, TB, and even petabytes of data
  • Number of files above one million
  • 10K+ node scale

Two, HBase

HBase: As a data store, it captures incremental data from various data sources. For example, this data source may be a web crawler, it may record the advertisement performance data that users watched and how long, or it may record time series data of various parameters.
FaceBook uses HBase counters to measure people's Like specific webpages. The number of times. Content creators and webpage owners can get nearly real-time data information about how many users like their webpages. They can therefore more quickly determine what content should be provided. For this, Facebook created a system called Facebook Insight, which requires a scalable storage system. The company considered many possibilities, including relational databases, in-memory databases, and Cassandra databases, and finally decided to use HBase. Based on HBase, Facebook can easily scale its services horizontally and provide it to millions of users, and they can also continue to use their existing experience in running large-scale HBase clusters. The system processes tens of billions of events and records hundreds of parameters every day.

  • Suitable for semi-structured or unstructured data, data whose structure will change
  • Log very sparse data
  • Multi-version data
  • Very large amount of data

HBase builds Internet index

  • The crawler continuously crawls new pages, and these pages are stored in BigTable one line per page.
  • The MapReduce calculation job runs on the entire table, generates an index, and prepares it for web search applications. Search the internet
  • The user initiates a web search request.
  • The web search application queries the established index, or directly obtains a single document directly from BigTable.
  • The search results are submitted to the user.

Three, MongoDB

MongoDB: Log collection and storage, distributed storage of small files, data storage similar to Internet microblog applications

  • Suitable for various data without strict transactional requirements, such as object data, JSON format data
  • Due to its very high performance, it is very suitable for real-time insertion, update and search, and has a high degree of scalability
  • Suitable for caching

mongodb is suitable for the following scenarios:

  • Website data: mongo is very suitable for real-time insertion, update and query, and has the replication and high scalability required for real-time data storage of the website.
  • Caching: Due to its high performance, mongo is also suitable as a caching layer for information infrastructure. After the system restarts, the persistent cache built by mongo can avoid overloading the underlying data sources.
  • Large-size, low-value data: It may be more expensive to store some data using traditional relational databases. Before that, many programmers often chose traditional files for storage.
  • Highly scalable scenarios: mongo is very suitable for databases composed of dozens or hundreds of servers.
  • Used for storage of objects and JSON data: Mongo's BSON data format is very suitable for storage and query of document formatting.

Unsuitable scenarios:

  • Highly transactional systems: such as banking or accounting systems. Traditional relational databases are currently more suitable for applications that require a large number of atomic and complex transactions.
  • Traditional business intelligence applications: BI database for specific problems will produce highly optimized query methods. For such applications, a data warehouse may be a more suitable choice.
  • Questions that require SQL.

Guess you like

Origin blog.csdn.net/u013250861/article/details/113732898