Introduction to Getting Started with Apache Spark

I first heard about Spark in late 2013, when I was very interested in Scala, and Spark is written in Scala. After a while, I did an interesting data science project that tried to predict surviving the Titanic . This is a great way to learn more about Spark content and programming. I highly recommend this project to any aspiring programmer who is thinking about getting started with Spark.

Today, Spark is used by many giants, including Amazon, eBay, and Yahoo!. Many organizations run Spark on clusters with thousands of nodes. According to the Spark FAQ, the largest known Spark cluster has over 8000 nodes. Spark is indeed a technology worth considering and learning.

This post will introduce you to Spark, including use cases and examples. The information in it comes from the Apache Spark website and the  Learn Spark - Lightning Fast Big Data Analytics  book.

What is Apache Spark? a brief introduction

Spark is an Apache project that is billed as "lightning-fast cluster computing". It has a thriving open source community and is currently the most active Apache project.

Spark provides a faster and more general data processing platform. Compared to Hadoop, Spark can make your program 100 times faster when running in memory, or 10 times faster when running on disk. Last year, Spark beat Hadoop in the 100 TB Daytona GraySort competition, using only one tenth the number of machines but running 3x faster. Spark has also become  the fastest open source engine for sorting petabyte-scale data .

Spark also makes it possible for us to write code faster, as if there are 80+ high-level operators working on it for you. To illustrate this, let's take a look at "Hello World!" in big data: an example of counting words. In MapReduce, we would have to write about 50 lines of code to do this, but for Spark (and Scala), you can do it like this:

Another important part of learning how to use Apache Spark is the interactive shell (REPL), which is available out of the box. By using the REPL, we can test the output of each line of code without first writing and executing the entire job. This way, you get working code faster, and peer-to-peer data analysis becomes possible.

Spark also provides some other key features:

  • APIs are currently available for Scala, Java, and Python, with support for other languages ​​(such as R) coming soon.
  • Integrates well with the Hadoop ecosystem and data sources (HDFS, Amazon S3, Hive, HBase, Cassandra, etc.).
  • Can run on a cluster managed by Hadoop YARN or Apache Mesos, or on a separate cluster.

The Spark core consists of a set of powerful, high-level libraries that can be seamlessly incorporated into the same application. Currently these libraries include SparkSQL, Spark Streaming, MLlib (for machine learning), and GraphX, each of which will be described later. Several other Spark libraries and extensions are also under development.

Spark Core

Spark Core is a fundamental engine for massively parallel and distributed data processing. It is mainly responsible for:

  • Memory management and failure recovery
  • Schedule, distribute and monitor jobs on the cluster
  • Interact with the storage system

Spark introduces a concept called Resilient Distributed Dataset (RDD), which is an immutable, fault-tolerant, distributed collection of objects that we can operate on in parallel. RDDs can contain any type of object that is created when loading an external dataset or distributing the collection from the driving application.

RDD supports two types of operations:

  • A transformation is an operation (such as map, filter, join, union, etc.) that performs an operation on an RDD and then creates a new RDD to hold the result.
  • An action is an operation (eg merge, count, first, etc.) that performs some kind of computation on an RDD and returns the result.

In Spark, transformations are "lazy", that is, they don't compute the result right away. Instead, they just "remember" what to do and what data set (such as a file) to operate on. Only when the behavior is invoked does the transformation actually compute and return the result to the driver program. This design makes Spark run more efficiently. For example, if a large file is to be transformed by various means, and the file is passed to the first behavior, Spark will only process the first line of the file and return the result, not the entire file.

By default, when you run an action on a transformed RDD, the RDD may be recomputed. However, you can also persist an RDD in memory from the beginning of the year by using persistence or caching, this way, Spark will keep the elements on the cluster and the query will be much faster the next time you query it .

SparkSQL

SparkSQL is a component of Spark that allows us to query data through SQL or Hive query language. It originally came from the Apache Hive project to run on Spark (instead of MapReduce), and it has now been integrated into the Spark heap. In addition to providing support for a wide variety of data sources, it also makes it possible to weave code transformations and SQL queries together, which ultimately results in a very powerful tool. Here is an example of a Hive-compatible query:

Spark Streaming

Spark Streaming supports real-time processing of streaming data such as log files for production web servers (such as Apache Flume and HDFS/S3), social media such as Twitter, and various message queues such as Kafka. Behind this, Spark Streaming will receive the input data, and then divide it into different batches, and then the Spark engine will process these batches and generate the final stream based on the results in the batches. The whole process is shown below.

Spark Streaming API可以非常紧密匹配Spark核心API,这使得程序员可以很容易的工作在批处理数据和流数据的海洋中。

MLlib

MLlib是一个机器学习库,它提供了各种各样的算法,这些算法用来在集群上针对分类、回归、聚类、协同过滤等(可以在 machine learning 上查看Toptal的文章,来获取更过的信息)。其中一些算法也可以应用到流数据上,例如使用普通最小二乘法或者K均值聚类(还有更多)来计算线性回归。Apache Mahout(一个针对Hadoop的机器学习库)已经脱离MapReduce,转而加入Spark MLlib。

GraphX

GraphX是一个库,用来处理图,执行基于图的并行操作。它针对ETL、探索性分析和迭代图计算提供了统一的工具。除了针对图处理的内置操作,GraphX还提供了一个库,用于通用的图算法,例如PageRank。

如何使用Apache Spark:事件探测用例

既然我们已经回答了“Apache Spark是什么?”这个问题,接下来让我们思考一下,使用Spark来解决什么样的问题或者挑战最有效率。

最近,我偶然看到了一篇关于 通过分析Twitter流的方式来探测地震 的文章。它展示了这种技术可以比日本气象厅更快的通知你日本哪里发生了地震。虽然那篇文章使用了不同的技术,但我认为这是一个很好的示例,可以用来说明我们如何通过简单的代码片段,在不需要”胶水代码“的情况下应用Spark。

首先,我们需要处理tweet,将那些和”地震“或”震动“等相关的内容过滤出来。我们可以使用Spark Streaming的方式很容易实现这一目标,如下所示:

然后,我们需要在tweets上运行一些语义分析,来确定它们是否代表当前发生了地震。例如,像“地震!”或者“现在正在震动”这样的tweets,可能会被认为是正向匹配,而像“参加一个地震会议”或者“昨天的地震真可怕”这样的tweets,则不是。这篇文章的作者使用了一个支持向量机(support vector machine, SVM)来实现这一点。我们在这里使用同样的方式,但也可以试一下 流版本。一个使用了MLlib的代码示例如下所示:

如果对于这个模型的预测比例满意,我们可以继续往下走,无论何时发现地震,我们都要做出反应。为了检测一个地震,我们需要在一个指定的时间窗口内(如文章中所述)有一定数量(例如密度)的正向tweets。请注意,对于带有Twitter位置服务信息的tweets来说,我们还能够从中提取地震的位置信息。有了这个只是以后,我们可以使用SparkSQL来查询现有的Hive表(保存那些对接收地震通知感兴趣的用户)来获取用户的邮箱地址,并向他们发送一些个性化的警告邮件,如下所示:

其它Apache Spark用例

当然,Spark潜在的用例远远超出了地震预测。

下面是一个针对其它一些用例示例(当然远远没有列举全部),这些用例都需要快速处理各种各样的大数据,Spark也非常适合处理这些用例:

在游戏领域,如果能从实时游戏的事件的潜流中处理和发现模式,并能够快速做出响应,这种能力可以带来一门赚钱的生意,针对这种目的的例子包括玩家保留、定位广告、自动调整复杂度等等。

在电子商务领域,实时交易的信息可以被传到像K均值这样的流聚集算法或者像ALS这样的协同过滤的算法上。而产生的结果可能会组合其它一些非结构化的数据源,例如客户评论或者产品评审。随着时间的推移,我们可以用它来提升和改进系统的推荐功能。

在金融或者安全领域,Spark技术栈可以用于欺诈或者入侵检测系统或者基于风险的认证系统。通过分析大规模的压缩日志,并结合外部数据源,例如已经泄漏的数据以及泄漏的账户(可以参考https://haveibeenpwned.com/)、从连接/请求中得到的一些诸如IP地址或者时间等信息,我们可以实现一个非常好的结果。

结论

总之,Spark可以帮助我们简化处理那些需要处理大量实时或压缩数据的计算密集型任务和挑战。这些数据既包括结构化数据,也包括非结构化数据。Spark可以和其它一些复杂能力进行无缝集成,例如机器学习、图算法等。Spark将大数据处理变得“接地气”。赶快来试试吧。


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325825566&siteId=291194637