Article Directory
1. Introduction to NoSQL
1.1 What is NoSQL
NoSQL: not only SQL, non-relational database
NoSQL is a general term
- Refers to databases that do not follow the traditional RDBMS model
- The data is non-relational and does not use SQL as the main query language
- Solve database scalability and availability issues
- Does not address atomicity or consistency issues
1.2 Why use NoSQL
With the development of the Internet, traditional relational databases have bottlenecks
- High concurrent reading and writing
- High storage capacity
- High availability
- High scalability
- low cost
Comparison of NoSQL and relational databases
There are mainly the following differences
Compared | NoSQL | Relational Database |
---|---|---|
Common database | HBase、MongoDB、Redis | Oracle、DB2、MySQL |
Storage format | Documents, key-value pairs, graph structure | Table format, rows and columns |
Storage specification | Encourage redundancy | Normative, avoid duplication |
Storage expansion | Scale out, distributed | Vertical expansion (limited horizontal expansion) |
inquiry mode | Structured query language SQL | Unstructured query |
Affairs | Does not support transaction consistency | Support affairs |
performance | High read and write performance | Poor read and write performance |
cost | Simple and easy to deploy, open source, low cost | high cost |
1.3 Features of NoSQL
-
Final consistency
-
The application has increased the responsibilities of maintaining consistency and handling transactions
-
Redundant data storage
-
NoSQL != Big data
- NoSQL products are to help solve big data storage problems
- Big data includes more than just data storage issues
- Hadoop
- Kafka
- Spark, etc
1.4 Basic Concepts of NoSQL
- Three cornerstones
- CAP, BASE, final consistency
- Indexing (index), Query (query)
- MapReduce
- Sharding
- CAP theory
- The database supports up to 2 of 3
- Consistency
- Availability
- Partition Tolerance (partition fault tolerance)
- NoSQL does not guarantee "ACID"
- Provide "eventual consistency"
- BASE
- Basically Availble (basically available)
- Ensure that the core is available
- Soft-state
- The state can be out of sync for a while
- Eventual Consistency (eventual consistency)
- After a certain period of time, the data can finally reach a consistent state
- The core idea is that even if strong consistency cannot be achieved, the application can choose a suitable way to achieve final consistency
- Final consistency
- The end result is consistent, not always consistent
- Data such as account balance and inventory must be strongly consistent
- Information such as catalog does not require strong consistency
- Causal consistency (Causal consistency)
- Read-your-writes consistency
- Session consistency
Index and query
- Indexing (Indexing)
Most NoSQL is indexed by key.
Part of NoSQL allows secondary index
HBase to use HDFS, append-only
batch write Logged
to recreate and sort files - Query (query)
does not have a special query language, usually use scripting language for query,
some start to support SQL query,
some can use MapReduce code query
MapReduce、Sharding
- MapReduce
is not Hadoop's MapReduce, and the concept is related
to data processing and query - Sharding (sharding)
a partitioning mode that
can replicate shards, which
is good for disaster recovery
1.5 NoSQL classification
Mainly divided into the following four categories
classification | For example | Typical application scenarios |
---|---|---|
Key-value store database (key-value) | Redis, MemcacheDB, Voldemort | Content caching, etc. |
Column store database (WIDE COLUMN STORE) | Cassandra, HBase | Respond to the massive data of distributed storage |
Document database (DOCUMENT STORE) | CouchDB, MongoDB | Web application (can be regarded as an upgraded version of the key-value database) |
GRAPH DB | Neo4J, InfoGrid, Infinite Graph | Social networks, recommendation systems, etc., focusing on building a relationship graph |
Key-Value Store Database (Key-Value)
Column Store Database (Wide Column Store)
Document Store
Graph Databases
1.6 The relationship between NoSQL, BI and big data
- BI (Business Intelligence): Business Intelligence
It is a complete set of solutions.
BI applications involve models, which depend on the model.
BI mainly supports standard SQL, and NoSQL support is weaker than relational databases. - NoSQL has a high correlation with big data.
Generally, column storage databases are used in big data scenarios,
such as HBase and Hadoop.
2. Introduction to HBase
2.1 HBase overview
- HBase is a leading NoSQL database. It
is a column-oriented storage database. It
is a distributed hash map
based on the Google Big Table paper. It
uses HDFS as storage and uses its reliability. - HBase features
Fast data access speed, response time is about 2-20 milliseconds
Support random read and write, each node 20k~100k+ ops/s
scalability, can be expanded to 20,000+ nodes
2.2 HBase development history
time | event |
---|---|
year 2006 | Google published a paper on Big Table |
2007 | The first version of HBase and Hadoop 0.15.0 are released together |
Year 2008 | HBase becomes a sub-project of Hadoop |
year 2010 | HBase becomes the top Apache project |
year 2011 | Cloudera launches CDH3 based on HBase0.90.1 |
2012 | HBase released version 0.94 |
2013-2014 | HBase has released 0.96 version/0.98 version |
2015-2016 | HBase has released version 1.0, version 1.1 and version 1.2.4 |
2017 | HBase released version 1.3 |
2018 | HBase released version 1.4 and version 2.0 |
2.3 HBase user groups
2.4 HBase application scenarios
- Application scenario-1
Incremental data-time series data
High capacity, high speed writing
- Application scenario-2
Information exchange-messaging
High capacity, high speed reading and writing
- Application scenario-3
Content Service-Web Backend Application
High capacity, high speed reading and writing
2.5 Apache HBase Ecosystem
HBase ecosystem technology
Lily – CRM
OpenTSDB based on HBase – HBase-oriented time series data management
Kylin – OLAP
Phoenix
on HBase – SQL operation HBase tool Splice Machine – OLTP based on HBase
Apache Tephra – HBase transaction support
TiDB – Distributed SQL DB
Apache Omid-Optimize transaction management
Yarn application timeline server v.2 Migrate to HBase
Hive metadata storage can be migrated to HBase
Ambari Metrics Server will use HBase for data storage
2.6HBase architecture
1. Physical architecture
HBase adopts Master/Slave architecture
-
The role
of HMaster is the master node of the HBase cluster, which can be configured with multiple nodes to achieve HA
management and distribution. Region
is responsible for the load balancing of RegionServers.
Finds the failed RegionServer and redistributes the Region on it -
RegionServer
RegionServer is responsible for the management and maintenance of Region.
One RegionServer contains one WAL, one BlockCache (read cache) and multiple Regions.
One Region contains multiple storage areas. Each storage area corresponds to a column family.
One storage area is composed of multiple StoreFiles and MemStores.
One StoreFile corresponds to One HFile and a column family
HFile and WAL are stored as sequence files on HDFS,
Client interacts with RegionServer
- Region和Table
2. Logical Architecture Row
- Rowkey (row key) is unique and sorted
- Schema can define when to insert records
- Each Row can define its own column, even if other Rows are not used
- Related columns are defined as column families
- Maintain multiple Row versions with unique timestamps
- The value type can be different in different versions
- HBase data is all stored in bytes
2.7 HBase data management
- Data Management Directory
- System catalog table hbase:meta
- Store metadata, etc.
- Files in HDFS directory
- Region instance on Servers
- System catalog table hbase:meta
- HBase data on HDFS
- Can be repaired through HDFS File
- Repair path
- RegionServer->Table->Region->RowKey->列族
2.8HBase architecture features
- Strong consistency
- Automatic expansion
- Automatically split when Region becomes large
- Use HDFS to expand data and manage space
- Write recovery
- 使用WAL(Write Ahead Log)
- Integration with Hadoop