.NET for Apache Spark preview version officially released

.NET for Apache Spark preview version officially released

Past large memory data passing large data memory
original article (click below to read the original text to enter) https://www.iteblog.com/archives/2544.html

On April 25, 2019, Microsoft’s Rahul Potharaju, Terry Kim, and Tyson Condie brought us the topic "Introducing .NET Bindings for Apache Spark" at the Spark + AI Summit 2019 conference, and announced .NET for Apache Spark The preview version is officially released.

The .NET framework is developed by Microsoft, a free software framework dedicated to agile software development, rapid application development, platform independence and network transparency, and is used to build many different types of applications. As can be seen from the current programming language rankings, .NET is also one of the most used programming languages ​​in the world. Its flagship programming language C# is listed as one of the most popular programming languages ​​in various articles and statistics:

.NET for Apache Spark preview version officially released
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please pay attention to the WeChat public account: iteblog_hadoop
As you can see from the above figure, C# ranks eighth among the most popular programming languages ​​surveyed by stackoverflow. For details, please refer to here. At the same time, C# ranked sixth among the most popular programming languages ​​on GitHub in 2018, see here. Although there are so many developers using C#, there is currently no good big data solution. Based on these problems, Microsoft brought us .NET for Apache Spark.
Obviously, the goal of .NET for Apache Spark is to enable .NET developers to use all the APIs of Apache Spark, because currently Apache Spark only supports Scala, Java, Python and R programming languages. Microsoft has made a lot of contributions to open source projects in recent years, so .NET for Apache Spark is of course also an open source project (project address: https://github.com/dotnet/spark), but it is released under the MIT license.

.NET for Apache Spark preview version officially released
If you want to learn about Spark, Hadoop or Hbase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop

What is .NET for Apache Spark

.NET for Apache Spark provides C# or F# developers with a high-performance API to access Apache Spark. Using this .NET API, users can access all components of Apache Spark, including Spark SQL, DataFrames, Streaming, MLLib, etc. And this project allows .NET developers to reuse all existing knowledge, skills, code and libraries.
Let Spark support C#/F# is based on a new Spark interoperability layer (interop layer), this layer provides easier scalability. In the long run, this scalability can be used to add support for other languages ​​in Spark. For details, please refer to SPARK-26257. The specific framework of .NET for Apache Spark is as follows:

.NET for Apache Spark preview version officially released
If you want to learn about Spark, Hadoop or Hbase related articles in time, please pay attention to the WeChat public account: iteblog_hadoop
.NET for Apache Spark conforms to .NET Standard 2.0 and can be used on Linux, macOS and Windows, just like the rest of .NET. .NET for Apache Spark is available by default in Azure HDInsight and can be installed in Azure Databricks etc.

Use .NET for Apache Spark

Before using .NET for Apache Spark, you need to install some software, see here for details. In this way, we can use C# or F# to write Spark applications. The following are the WordCount programs written in C# and F# respectively:
C# version of WordCount
.NET for Apache Spark preview version officially released
F# version of WordCount. It
.NET for Apache Spark preview version officially released
can be seen that this is very similar to Spark's native API.

How does .NET for Apache Spark perform

After Microsoft's official test, the first preview version of .NET for Apache Spark performed well in the popular TPC-H benchmark test. The TPC-H benchmark contains a set of business-oriented queries. The following figure illustrates the performance comparison of .NET Core with Python and Scala on the TPC-H query set.
.NET for Apache Spark preview version officially released
If you want to learn about Spark, Hadoop or Hbase-related articles in time, please pay attention to the WeChat public account: iteblog_hadoop
. The figure above shows the performance comparison of each query between .NET for Apache Spark and Python and Scala. .NET for Apache Spark runs well compared to Python and Scala. In addition, in situations where UDF performance is critical, such as query 1, in which 3 billion rows of non-string data are passed between JVM and CLR, .NET for Apache Spark is 2 times faster than Python.

Guess you like

Origin blog.51cto.com/15127589/2678482