Solr: SolrCloud

Distributed indexing

Document shard assignment

A document is assigned to one and only one shard per collection. Solr uses a component called a document router to determine which shard a document should be assigned to. There are two basic document-routing strategies supported by SolrCloud: compositeId (default) and implicit.

Solr uses the MurmurHash algorithm, because it’s fast and creates an even distribution of hash values, which keeps the number of documents in each shard balanced (roughly).



 

Adding documents

You can send update requests to any node in the cluster, and the request will be forwarded to the correct shard leader.

STEP 1: SEND THE UPDATE REQUEST USING CLOUDSOLRSERVER

STEP 2: ROUTE THE DOCUMENT TO THE CORRECT SHARD

STEP 3: LEADER ASSIGNS VERSION ID

STEP 4: FORWARD REQUEST TO REPLICAS

STEP 5: ACKNOWLEDGE WRITE SUCCESS

Near real-time search

NRTmakes documents visible in search results within seconds of their being indexed,hence the use of the near qualifier. To allow documents to be visible in NRT, Solr provides a soft commit mechanism, which skips the costly aspects of hard commits, such as flushing documents stored in memory to disk.

 

 cache autowarming settings and warming queries must execute faster than your soft commit frequency.

Although NRT search is a powerful feature, you do not have to use it with SolrCloud. It’s perfectly acceptable to not use soft commits, and we recommend not using them unless you really need indexed documents to be visible in near real-time. Do not feel like you must use NRT search when using SolrCloud. One of the drawbacks to using soft commits is that your caches are constantly being invalidated

Node recovery process

SolrCloud supports two basic recovery scenarios: peer sync and snapshot replication. The recovery process for these two scenarios is differentiated by how many update requests (add, delete, update) the recovering node missed while it was offline.

  1. Peer sync—If the outage was short-lived and the recovering node missed only a few updates, it will recover by pulling updates from the shard leader’s update log. The upper limit on missed updates is currently hardcoded to 100. If the number of missed updates exceeds this limit, the recovering node pulls a full index snapshot from the shard leader.
  2. Snapshot replication—If a node is offline for an extended period of time such that it becomes too far out of sync with the shard leader, it uses Solr’s HTTP-based replication, based on the snapshot of the index.

-------------------------------------------------

Distributed search

Once you shard your index, you have a new problem: you must query all shards to get a complete result set. Querying across all shards in a collection to create a unified result set is known as a distributed query. The distrib parameter determines if a query is distributed or local; when SolrCloud mode is enabled, distrib defaults to true.

Multistage query process

Distributed queries work differently than nondistributed queries because Solr needs to gather results for all shards, then merge the results into a single response to the client. Solr uses a multistage query process to execute distributed queries.



 

STEP 1: CLIENT SENDS QUERY TO ANY NODE

STEP 2: QUERY CONTROLLER RECEIVES REQUEST

STEP 3: QUERY STAGE

STEP 4: GET FIELDS STAGE

Distributed search limitations

Unfortunately, not all Solr query features work in distributed mode. Specifically, there are three main limitations you should be aware of:

  1. Inverse document frequency (idf) is based on the frequency of a term in the local index only. It is used when scoring documents, so there can be some bias introduced when ranking documents in a distributed query. Because documents are randomly distributed across shards (by default), the idf for a term in shard1 is typically close to the idf for a term across all shards.
  2. Joins do not work in distributed mode unless you use the custom hashing solution.
  3. In order to use Solr’s grouping functionality in SolrCloud, you need to use custom hashing to collocate documents that will be collapsed into the same group.

猜你喜欢

转载自ylzhj02.iteye.com/blog/2090511