spark,neo4j

Spark memory computing reflects that it can resident RDD in memory (insufficient memory will also overflow to disk), which can reduce disk IO. I think the disadvantage lies in 1. In terms of resource scheduling, Spark is different from Hadoop. It uses a multi-threaded mode for execution. Hadoop is multi-process. . But in essence, this is not a disadvantage of Spark, it is just the result of the tradeoff. 2. In fact, Spark is a distributed system that uses the idea of ​​memory computing. If you want to maximize its performance advantages, it has higher requirements for cluster resource configuration, such as memory (of course, it can also be used if memory is insufficient). In layman's terms, it is more expensive. . The rest of the holes will be filled later

 

1. Where is Spark's memory computing mainly reflected?
(a) Spark, compared with map reduce, the biggest speed improvement is that when doing repeated calculations, spark can reuse the relevant cache data, while M/R will clumsily continue to perform disk i/o. (b) In order to improve fault tolerance All intermediate results of M/R will persist to disk, while M/R is stored in memory by default.

2. What are the main shortcomings of Spark at present?
(a) The memory overhead of the JVM is too large, 1G of data usually consumes 5G of memory -> Project Tungsten is trying to solve this problem;
(b) There is no effective shared memory mechanism between different spark apps -> Project Tachyon is trying to Introduce distributed memory management so that different spark apps can share cached data

 

Weakness.
1. Unstable, the cluster occasionally hangs. Only suitable for doing calculations, not suitable for providing services directly.
2. The partition of the data is not good enough, which will lead to uneven distribution of computing tasks on each machine in the cluster.
3. Task scheduling is not good enough

 

1. The mismatch of object relations makes it so difficult and laborious to squeeze the object-oriented "round object" into the relation-oriented "square table", and all this can be avoided.
2. The static, rigid, and inflexible nature of the relational model makes it very difficult to change schemas to meet changing business needs. For the same reason , databases often hold back when development teams want to apply agile software development.
3. Relational models are poorly suited to expressing semi-structured data—and industry analysts and researchers agree that semi- structured data is the next big thing in information management.
4. The network is a very efficient data storage structure. It is no coincidence that the human brain is a huge network, and the World Wide Web is likewise structured as a mesh. The relational model can express network-oriented data, but the relational model is very weak in the ability to traverse the network and extract information.
Although Neo is a relatively new open source project, it is already used in products with more than 100 million nodes, relationships and attributes, and meets the robustness and performance needs of enterprises:
Full support for JTA and JTS, 2PC distributed ACID transactions , configurable isolation levels and large-scale, testable transaction recovery. These aren't just lip service: Neo has been in high-demand 24/7 environment for over 3 years. It is mature, robust, and fully up to the threshold of deployment.
Figure: refers to the network composed of tree sets in the data principle.
Neo4j is an embedded, disk -based, full transactional Java persistence engine that stores data in graphs (networks) instead of tables. Neo4j provides massive scalability, can handle graphs of billions of nodes/relationships/properties on a single machine, and can scale to multiple machines running in parallel. Compared to relational databases, graph databases are good at handling large volumes of complex, interconnected, low-structured data that change rapidly and require frequent queries—in relational databases, these queries result in a large number of table joins, and therefore produce performance problems. Neo4j focuses on solving the performance degradation problem of traditional RDBMS with a large number of connections when querying. By modeling data , Neo4j traverses nodes and edges at the same speed independent of the amount of data that makes up the graph. Additionally, Neo4j provides very fast graph algorithms, recommender systems, and OLAP-style analytics, all of which are not possible in current RDBMS systems.
People are curious about Neo because of the use of a "web-oriented database". In this model, domain data is expressed in "node space" - compared to the traditional model table, row and column, the node space is a network composed of many nodes, relationships and attributes (key-value pairs). Relationships are first-level objects that can be annotated by properties, which indicate the context in which nodes interact. The network model perfectly matches problem domains that are inherently inherited, such as semantic web applications. Neo's creators found that inheritance and structured data did not fit the traditional relational database model

First, both Hadoop and Apache Spark are big data frameworks, but they exist for different purposes. Hadoop is essentially more of a distributed data infrastructure: it dispatches huge data sets to multiple nodes in a cluster of ordinary computers for storage, meaning you don't need to buy and maintain expensive server hardware.

At the same time, Hadoop also indexes and tracks this data, making big data processing and analysis more efficient than ever. Spark, on the other hand, is a tool specially used to process those distributed storage big data, and it does not store distributed data.

Both can be combined

In addition to providing the distributed data storage function of HDFS, which is agreed by everyone, Hadoop also provides a data processing function called MapReduce. So here we can completely abandon Spark and use Hadoop's own MapReduce to complete data processing.

On the contrary, Spark does not have to depend on Hadoop to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to work. Here we can choose Hadoop's HDFS, or other cloud-based data system platforms. But Spark is still used on Hadoop by default. After all, everyone thinks that their combination is the best.

The following is the most concise and clear analysis of MapReduce, excerpted from the Internet:

  We're going to count all the books in the library. You count shelf 1, I count shelf 2. This is "Map". The more we are, the faster we can count books.

Now let's get together and add all the statistics together. This is "Reduce".

Spark data processing speed kills MapReduce

Spark is much faster than MapReduce because of its different way of processing data. MapReduce processes data in steps: "Read data from the cluster, perform one processing, write the result to the cluster, read the updated data from the cluster, perform the next processing, and write the result to the cluster, Wait..." so explains Kirk Borne, data scientist at Booz Allen Hamilton.

In contrast, Spark does all the data analysis in-memory in near-real-time: "Read the data from the cluster, do all the necessary analytical processing, write the results back to the cluster, done," Born said. Spark's batch processing speed is nearly 10 times faster than MapReduce, and in-memory data analysis is nearly 100 times faster.

If the data to be processed and the result requirements are mostly static, and you have the patience to wait for the batch processing to complete, the MapReduce processing method is also completely acceptable.

But if you need to analyze streaming data, such as those collected by sensors in a factory, or if your application requires multiple data processing, then you should probably use Spark for processing.

Most machine learning algorithms require multiple data processing. In addition, Spark is usually used in the following application scenarios: real-time market activity, online product recommendation, network security analysis, machine diary monitoring, etc.

disaster recovery

Both have very different approaches to disaster recovery, but both are great. Because Hadoop writes data to disk each time it is processed, it is inherently resilient to system errors.

Spark's data objects are stored in a data cluster called Resilient Distributed Dataset (RDD: Resilient Distributed Dataset). "These data objects can be placed in memory or on disk, so RDDs can also provide complete disaster recovery capabilities," Borne noted.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326518755&siteId=291194637