Real-time Big Data Spark ignition in recently is to force

For near real-time big data analysis should be how to do it? As the most advanced next-generation open source Apache Spark already streaming video data, sensor, trading analysis, machine learning, predictive modeling to create the conditions. They may be used in genomic research, packet detection, malware detection and things.

After the user experience not up to the publicity, IT field, "the hot new event" will inevitably appears. Analysis of the current new and popular events involving large data quickly and accurately massive distributed data.

In the current field of big data, Hadoop is as a mass data storage and distribution of software, which were as MapReduce engine deal with these massive data. Both can batch together some of the data is not excessive demands for timeliness.

So for near real-time big data analysis should be how to do it? As the most advanced next-generation open source Apache Spark already streaming video data, sensor, trading analysis, machine learning, predictive modeling to create the conditions. They may be used in genomic research, packet detection, malware detection and things.

Spark used not only as a batch as MapReduce, the algorithm required to interact with a large number of data sets, these operations may also be Spark intermediate result stored in the cache. In contrast, before the next step in the process into the system, the results must MapReduce To each step of the operation is written to disk. This fast processing in memory of elasticity distributed data sets (RDD) can be said that the core capabilities of Apache Spark.

Salient Federal Solutions company has been committed to using Spark for government institutions to develop analytical products. The company's predictive analytics director Dave Vennergrund said: "Once perform operations on data sets, which can be connected to each other, so that the conversion can be completed quickly they can be combined simultaneously across multiple machines to do this work, which allows us to quickly. React."

Spark's supporters believe that compared with competitors, Spark scalability and speed have an advantage. Outstanding performance after petabytes, they are still able to work well with small data sets upgrade. In the benchmark competition in November 2014, Apache Spark finishing 100 terabytes of data three times faster than Hadoop MapReduce, and the size of its cluster of machines is one-tenth of MapReduce.

It is a software development company Typesafe recent observations indicate that institutions Spark interest is growing in number. Data show that 13 percent of respondents are using Spark, about 30% of the respondents being assessed Spark, 20% of respondents plan to start using Spark at some point this year. Another 6 percent of respondents want to use later when Spark in 2016. In addition, 28% of respondents do not understand the Spark also that they are not yet ripe.

Cindy Walker, vice president of Salient data analysis center, said:. "For the government, they are doing testing and evaluation of early deployments are those who have a sandbox and R & D budgets of the departments many of our customers now deploy large data memory. analysis, streaming solutions have not yet delineated ability bottom line. Therefore, we are currently using Spark to help them set reasonable goals. "

Although Spark can not replace MapReduce, but they will eventually become part of a larger analysis of data fields and push data to be processed faster.

Apache Spark ecological environment has the following components:

Spark Core: the underlying execution engine platform, supporting a large number of applications as well as Java, Scala and Python application programming interface (API).

Spark SQL (Structured Query Language): Users can explore their data.

Spark Streaming: can analyze the data stream from Twitter, and let the Spark has the ability to batch.

Machine Learning Library (MLlib): a distributed machine learning architecture, speed the delivery of high-quality algorithm 100 times faster than MapReduce.

Graph X: help text and lists the user's performance data in graphical form, find a different relationship data.

SparkR: package for the R statistical language program. R Spark user can use the function in which R shell through.

BlinkDB: large-scale parallel engines. Allows the user to execute SQL-query huge amounts of data, the accuracy is very useful in cases in importance than speed.

Highly recommended reading articles

40 + annual salary of big data development [W] tutorial, all here!

Zero-based Big Data Quick Start Tutorial

Java Basic Course

web front-end development based tutorial

Big Data engineers must understand the concept of the seven

The future of cloud computing and big data Five Trends

How to quickly build their own knowledge of large data

Guess you like

Origin blog.csdn.net/yuyuy0145/article/details/92430608