Solr, MongoDB and Hadoop Comparison

  An interesting phenomenon has emerged in the IT world over the past few years. Many new technologies emerged and immediately embraced "big data". Slightly older technologies will also add big data into their own characteristics, so as to avoid falling too far, we are seeing the blurring of the margins between different technologies. Suppose you have search engines like Elasticsearch or Solr that store JSON documents, MongoDB stores JSON documents, or a bunch of JSON documents are stored in HDFS in a Hadoop cluster. You can do a lot of co-breeding things with these three configurations.
Does ES work as a No SQL database? At first glance, this sentence is not quite right, but it is a reasonable scene. Similarly, MongoDB uses sharding technology on top of MapReduce to do what Hadoop can do. Of course with the many features we can use on top of Hadoop (Hive, HBase , Pig and a few of the same) you can also query data in a Hadoop cluster in a number of ways.

So, can we now say that all three of Hadoop, MongoDB and Elasticsearch are exactly the same? Obviously not! Each tool has its own more applicable scenarios, but each has considerable flexibility to suit different roles. The question now becomes "What is the most appropriate use case for these technologies?". Let's take a look.
Elasticsearch has moved beyond its original pure search engine role and has now added analytics and visualization features - but at its core it is still a full-text search engine. Elasticsearch is built on Lucene and supports extremely fast queries and rich query syntax. If you have millions of documents that need to be targeted by keywords, Elasticsearch is definitely the best choice. Of course, if your documents are in JSON, you can think of Elasticsearch as a kind of lightweight " NoSQL database". But Elasticsearch is not a proper database engine and is not very strong for complex queries and aggregations, although the statistics facet can provide some support for statistics about a given query. Facets in Elasticsearch are mainly used to support faceted browsing.

If you're looking for a small collection of documents corresponding to a keyword query, and want to support faceted navigation in those results, then Elasticsearch is definitely a better choice. If you need to perform more complex calculations, execute server-side scripts on the data, and easily run MapReduce jobs, then MongoDB or Hadoop are options.
MongoDB is a NoSQL database designed to be highly scalable, with automatic sharding and some additional performance optimizations. MongoDB is a document-oriented database that stores data in the form of JSON (to be precise, BSON, with some enhancements to JSON)—for example, a native data type. MongoDB provides a text index type to support full-text search, so we can see the line between Elasticsearch and MongoDB, with basic keyword searches corresponding to collections of documents.
Where MongoDB surpasses Elasticsearch is its support for server-side js scripts, aggregation pipelines, MapReduce support, and capped collections. With MongoDB, you can use aggregation pipelines to process documents in a collection, processing documents in multiple steps through a sequence of pipeline operations. Pipeline operations can generate entirely new documents and remove documents from the final result. This is a fairly strong feature of filtering, processing and transforming data when retrieving it. MongoDB also supports the execution of map/reduce jobs on a data collection, using custom js functions to operate map and reduce processes. This guarantees the ultimate flexibility in MongoDB's ability to perform any type of computation or transformation on selected data.
Another extremely powerful feature of MongoDB is called "Capped collections". Using this feature, the user can define a larger size for a collection - the collection can then be written blindly and roll-over the data necessary to obtain logs and other streaming data for analysis.

You see, there is an overlap of possible use cases for Elasticsearch and MongoDB, they are not the same tools. But what about Hadoop? Hadoop is MapReduce, which is already supported in-place by MongoDB! Is there another scenario dedicated to Hadoop, and MongoDB is just suitable.
have! Hadoop is the old MapReduce, which provides a more flexible and powerful environment for processing large amounts of data. There is no doubt that it can handle scenarios that cannot be processed by Elasticsearch or MongoDB.

To see this more clearly, take a look at how Hadoop abstracts storage using HDFS - in terms of associated computational properties. Through the data stored in HDFS, any job can perform operations on the data, using the core MapReduce API, or using Hadoop streaming technology to directly use native language programming. Based on Hadoop 2 and YARN, even the core programming model has been abstracted and you are no longer tied to MapReduce. Using YARN you can implement MPI on Hadoop and write jobs that way.
Additionally, the Hadoop ecosystem provides an interleaved set of tools, built on top of HDFS and core MapReduce, for querying, analyzing, and processing data. Hive provides an SQL-like language that enables business analytics to be queried using a user-accustomed syntax. HBASE provides a column-oriented database based on Hadoop. Pig and Sizzle provide two more distinct programming models for querying Hadoop data. For the use of data stored in HDFS, you can inherit Mahout's machine learning capabilities into your toolset. When using RHadoop, you can directly use the R statistical language to perform advanced statistical analysis on Hadoop data.

So, although Hadoop and MongoDB also have partially overlapping use cases and share some useful features (seamless horizontal scaling), the two There are still specific scenarios between them. If you just want to pass keywords and simple analysis, then Elasticsearch can do the job; if you need to query documents, and include more complex analysis process, then MongoDB is very suitable; if you have a huge amount of data, you need a lot of different For complex processing and analysis, Hadoop provides a wider range of tools and flexibility.
An eternal truth is to choose the most suitable tool at hand to do things. In the context of big data, technologies emerge in an endless stream, and the boundaries between technologies are quite blurred, which is a very difficult thing for us to choose. As you can see, specific scenarios have the most suitable technology, and this difference is quite important. The good news is that you are not limited to one tool or technique. Depending on the scenario you face, this allows us to build an integrated system. For example, we know that Elasticsearch and Hadoop work well together, using Elasticsearch for fast keyword queries, and Hadoop jobs for fairly complex analytics.

Ultimately, a larger search and careful analysis were used to identify more appropriate options. When choosing any technology or platform, you need to verify them carefully to understand which scenarios this stuff is suitable for, where optimizations can be made, and what sacrifices need to be made. Start with a small pre-research project, and after confirmation, apply the technology to the real platform and slowly upgrade to a new level.
By following these tips, you can successfully navigate big data technologies and be rewarded accordingly.

 

The original text comes from: http://f.dataguru.cn/thread-607258-1-1.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324573921&siteId=291194637