Seven tools for building a Spark big data engine

Spark is taking the data processing world by storm. Let's go through this article and look at a few important tools that fuel Spark's big data platform.

Apache Spark
not only makes big data processing faster, but also makes big data processing easier, more powerful, and more convenient. Spark isn't just a technology, it's a combination of parts, with new features and performance improvements being added all the time, and each part being improved over time.
This article describes each major part of the Spark ecosystem: what each part does, why it's important, how it's growing, where it falls short, and where it might go.

​Spark Core
The heart of Spark is the aptly named Spark Core. In addition to coordinating and scheduling jobs, Spark Core provides a basic abstraction mechanism for data processing in Spark, called Resilient Distributed Datasets (RDD).
RDDs perform two actions on data: transformations and operations. The former transforms data and provides them as newly created RDDs; the latter computes results based on existing RDDs (such as the number of objects).
Spark is fast because transformations and operations are kept in memory. Operations are evaluated sluggishly, meaning that they are only performed when relevant data is needed; however, it can be difficult to figure out what's running slowly.
Spark's speed is constantly improving. Java's memory management often creates problems for Spark, so Project Tungsten plans to avoid the JVM's memory and garbage collection subsystems in order to improve memory efficiency.

​Spark API
Spark is primarily written in Scala, so Spark's main API has long supported Scala as well. But three other, far more widely used languages ​​are also supported: Java (which Spark also relies on), Python, and R.
In general, you're better off picking the language you're best at, since the features you need are likely to be directly supported in that language. There is one exception: by contrast, the support for machine learning in SparkR is weak, and only a small set of algorithms are currently available. However, this situation is bound to change in the future.

​Spark SQL
Never underestimate the power or convenience of being able to execute SQL queries on bulk data. Spark SQL provides a generic mechanism for executing SQL queries (and requesting columnar DataFrames) on data provided by Spark, including queries that are pipelined through ODBC/JDBC connectors. You don't even need a formal data source. This feature was added in Spark 1.6: support for querying flat files in a supported format, just like Apache Drill.
Spark SQL is not really used to update data, because that goes against the whole point of Spark. The resulting data can be written back to a new Spark data source (such as a new Parquet table), but UPDATE queries are not supported. Don't expect features like this to come out anytime soon; most of the improvements that look at Spark SQL go to improving its performance, as it also forms the basis of Spark Streaming.
​​Spark
Streaming
Spark is designed to support many processing methods, including stream processing - hence the name Spark Streaming. The conventional wisdom about Spark Steaming is that it's half-baked, which means you'll only use it if you don't need instant latency, or if you're not already invested in another streaming data processing solution (say, Apache Storm).
But Storm is losing popularity; the long-time Twitter user has since switched to his own project, Heron. Additionally, Spark 2.0 promises to introduce a new "structured data streaming" mode for interactive Spark SQL queries on real-time data, including using Spark's machine learning library. Whether its performance is high enough to beat the competition remains to be seen, but it deserves serious consideration.

​MLlib (Machine Learning)
Machine learning techniques are known to be both magical and difficult. Spark lets you run many common machine learning algorithms on data in Spark, making these types of analytics much easier and more accessible to Spark users.
The number of algorithms available in MLlib is numerous and increases with each revision of the framework. Having said that, some types of algorithms are still missing -- say, any algorithm involving deep learning. Third parties are taking advantage of Spark's popularity to fill the void; Yahoo, for example, can perform deep learning with CaffeOnSpark, which leverages the Caffe deep learning system through Spark.

​GraphX ​​(Graph Computing)
Depicting the relationships between millions of entities often requires graphs, data artifacts that describe the interrelationships between those entities. Spark's GraphX ​​API lets you perform graph operations on data using Spark's suite of methods, so the heavy lifting of building and transforming such graphs is offloaded to Spark. GraphX ​​also includes several common algorithms for processing data, such as PageRank or label propagation.
A major limitation of GraphX, as it stands, is that it is best suited for static graphs. Processing a graph with new vertices added can severely impact performance. Also, if you are already using a full-fledged graph database solution, GraphX ​​is unlikely to replace it yet.

​SparkR (R on Spark)
The R language provides an environment for performing statistical numerical analysis and machine learning work. Spark added support for R in June 2015 to match its support for Python and Scala.
In addition to providing an additional language for potential Spark developers, SparkR enables R programmers to do many things that were not possible before, such as accessing datasets that exceed the memory capacity of a single machine, or easily using multiple processes simultaneously Or run the analysis on multiple machines.
SparkR also enables R programmers to take advantage of the MLlib machine learning module in Spark to create general linear models. Sadly, not all MLlib functionality is supported in SparkR, although Spark is filling the gaps in R support with each subsequent revision. ​Original

title: 7 tools to fire up Spark's big data engine

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326925647&siteId=291194637