Microsoft released .NET version of Spark, efficient and powerful

Microsoft yesterday to see a message worth more than one trillion dollar, then what? We can find many reasons, but I think a lot and Asan CEO of open source strategic relationship. Microsoft's open source cloud from Microsoft to open source .net WSL to open-source rivals from the past to today's arms. Recently in Spark + AI Summit, Microsoft released the open source .NET for Apache Spark, so in the field of big data has added a chapter. This article Bugs'll tell you about a project.

Outline

We've also introduced, Apache Spark is the Apache Foundation's most popular open source distributed-memory big data processing engine. Spark can be used for batch data processing, real-time data streaming, real-time support for machine learning and data query.

Microsoft released .NET version of Spark, efficient and powerful

 

.NET for Apache Spark Spark project is mainly used for creating native operating across Spark API libraries, and large data analysis to facilitate .net developers. Spark had official support for Scala, Java, R and Python, so far added .net.

Microsoft released .NET version of Spark, efficient and powerful

 

.NET for Apache Spark will open as .NET Foundation released from the project, the project has been released by Github, you can get all the source code via Github (warehouse github: / dotnet / spark).

Introduction

.NET for Apache Spark provides for the application of high-performance API .net native, he packed on Spark operating the operation layer, can provide high-performance access to library into multiple languages, supports C # and F # .net and other languages.

Microsoft released .NET version of Spark, efficient and powerful

 

Through .NET API, we can efficiently access all components of Apache Spark, including Spark SQL, DataFrames, Streaming, MLLib and so on.

Microsoft released .NET version of Spark, efficient and powerful

 

.NET for Apache Spark符合.NET标准,遵循.NET API的正式规范,我们在.NET代码中随时引入,插拔式的插入,非常容易扩展。现有.net项目和代码,编码习惯等都可以无缝引入到.NET for Apache Spark的项目开。基于.NET Standard 2.0,可以括平台在Linux,macOS和Windows上使用,还支持云架构,微软云Azure HDInsight中已经默认启用,也可以安装在Azure Databricks等中。

实例入门

.NET for Apache Spark的使用需要预装.net core和Spark包括:

.NET Core 2.1 SDK

Java 1.8

Apache Spark 2.4.1

Microsoft.Spark.Worker

安装设置好以上软件后,就可以开始Spark应用的开发了,本我们提供两个简单实例分别说明在C#和F#的应用。

C# 实例:

//创建一个Spark session

var spark = SparkSession

.Builder()

.AppName("word_count_sample")

.GetOrCreate();

//创建一个数据框

DataFrame dataFrame = spark.Read().Text("input.txt");

//操纵和查看数据

var words = dataFrame.Select(Split(dataFrame["value"], " ").Alias("words"));

words.Select(Explode(words["words"])

.Alias("word"))

.GroupBy("word")

.Count()

.Show();

Microsoft released .NET version of Spark, efficient and powerful

 

F#:实例

//创建一个Spark session

let spark =

SparkSesstion.Builder()

.AppName("word_count_sample")

.GetOrCreate()

//创建一个数据框

let df = spark.Read().Text("input.txt")

let words = df.Select(Split(df.["value"], " ").Alias("words")

words.Select(Explode(words["words"]).Alias("word"))

.GroupBy("word")

.Count()

Microsoft released .NET version of Spark, efficient and powerful

 

性能分析

数据分析很重要的一个方面就能高性能性操作和分析。.NET for Apache Spark在发布之前就做了很多的系能测试,官方对其预览版本进行了TPC-H基准测试,结果显示.NET for Apache Spark系能表象良好。官方进行的TPC-H基准包含一套面向业务的查询。下面的图例说明了在TPC-H查询集上.NET Core与Python和Scala的性能对比。

Microsoft released .NET version of Spark, efficient and powerful

 

上图显示了.NET for Apache Spark,Python及Scala在Apache Spark上每个查询性能。 .NET for Apache Spark性能表现良好。此外,在UDF性能至关重要的情况下,例如查询1,其中在JVM和CLR .NET之间传递3B行非字符串数据,Apache Spark比Python快2倍。

TPC-H基准测试中所有22个查询的总执行时间(秒)(越低越好)结果如下图所示。

Microsoft released .NET version of Spark, efficient and powerful

 

数据源自TPC-H基准测试的内部运行,在Ubuntu 16.04上使用热执行。

当然由于基准测试使用的是.NET for Apache Spark预览,没有很多的优化,正式版本的性能会有更进一步的优化和提高。

发展展望

正式发布后Visual Studio Code才算是踏上万里长征的第一步,官方也提供了以后发展路线图,提供值得期望的有:

简化入门体验,文档和示例

Organic integration with Visual Studio, Visual Studio Code, JupyterNote and other developer tools

.NET support user-defined aggregate function

F # C # and provides the common API and Examples (e.g., using LINQ query)

Provided Azure Databricks, Kubernetes and other support out of the box ready to use.

To build .NET for Spark Spark Spark Spark.

Source: SEO Company

Guess you like

Origin www.cnblogs.com/1994jinnan/p/12324628.html