スパーク:なぜPythonはかなり私のユースケースにはスカラをアウトパフォームしますか?

maestromusica:

PythonとScalaのIを使用した場合スパークの性能を比較するために、両方の言語で同じジョブを作成し、ランタイムを比較しました。私は両方のジョブは大体同じ時間を取ることが期待されるが、Pythonの仕事しか取っ27minScalaのジョブがかかっている間、37min(約40%長いです!)。私もJavaで同じ仕事を実装し、それがかかった37minutes、あまりにも。これはどのようにPythonがそんなに速くなることがありますか?

最小限の検証例:

Pythonの仕事:

# Configuration
conf = pyspark.SparkConf()
conf.set("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
conf.set("spark.executor.instances", "4")
conf.set("spark.executor.cores", "8")
sc = pyspark.SparkContext(conf=conf)

# 960 Files from a public dataset in 2 batches
input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"

# Count occurances of a certain string
logData = sc.textFile(input_files)
logData2 = sc.textFile(input_files2)
a = logData.filter(lambda value: value.startswith('WARC-Type: response')).count()
b = logData2.filter(lambda value: value.startswith('WARC-Type: response')).count()

print(a, b)

Scalaの仕事:

// Configuration
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
sc.hadoopConfiguration.set("spark.executor.instances", "4")
sc.hadoopConfiguration.set("spark.executor.cores", "8")

// 960 Files from a public dataset in 2 batches 
val input_files = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312025.20/warc/CC-MAIN-20190817203056-20190817225056-00[0-5]*"
val input_files2 = "s3a://commoncrawl/crawl-data/CC-MAIN-2019-35/segments/1566027312128.3/warc/CC-MAIN-20190817102624-20190817124624-00[0-3]*"

// Count occurances of a certain string
val logData1 = sc.textFile(input_files)
val logData2 = sc.textFile(input_files2)
val num1 = logData1.filter(line => line.startsWith("WARC-Type: response")).count()
val num2 = logData2.filter(line => line.startsWith("WARC-Type: response")).count()

println(s"Lines with a: $num1, Lines with b: $num2")

ただ、コードを見ることで、彼らは同じように見えます。私は、DAGのを見て、彼らは任意の洞察を提供しなかった(あるいは、少なくとも私は、ノウハウ、それらに基づいて説明を考え出すに欠けています)。

私は本当に任意のポインタをいただければ幸いです。

user10938362:

あなたの基本的な前提は、スカラ座やJavaが速く、この特定のタスクのためであることを、単に間違っています。あなたは簡単に、最小限のローカルアプリケーションでそれを確認することができます。スカラ1:

import scala.io.Source
import java.time.{Duration, Instant}

object App {
  def main(args: Array[String]) {
    val Array(filename, string) = args

    val start = Instant.now()

    Source
      .fromFile(filename)
      .getLines
      .filter(line => line.startsWith(string))
      .length

    val stop = Instant.now()
    val duration = Duration.between(start, stop).toMillis
    println(s"${start},${stop},${duration}")
  }
}

Pythonの1

import datetime
import sys

if __name__ == "__main__":
    _, filename, string = sys.argv
    start = datetime.datetime.now()
    with open(filename) as fr:
        # Not idiomatic or the most efficient but that's what
        # PySpark will use
        sum(1 for _ in filter(lambda line: line.startswith(string), fr))

    end = datetime.datetime.now()
    duration = round((end - start).total_seconds() * 1000)
    print(f"{start},{end},{duration}")

結果に(300回の繰り返しそれぞれ、パイソン3.7.6、スカラ2.11.12)Posts.xmlからhermeneutics.stackexchange.comデータダンプマッチングおよび非マッチングパターンが混在します。

上記のプログラムのためのミリ秒単位durartionの箱ひげ図

  • Pythonの273.50(258.84、288.16)
  • Scalaの634.13(533.81、734.45)

ご覧のとおりPythonは体系的に速いだけではありませんが、また、(低スプレッド)より一貫性があります。

Take away message is ‒ don't believe unsubstantiated FUD ‒ languages can be faster or slower on specific tasks or with specific environments (for example here Scala can be hit by JVM startup and / or GC and / or JIT), but if you claims like " XYZ is X4 faster" or "XYZ is slow as compared to ZYX (..) Approximately, 10x slower" it usually means that someone wrote really bad code to test things.

Edit:

To address some concerns raised in the comments:

  • In the OP code data is passed in mostly in one direction (JVM -> Python) and no real serialization is required (this specific path just passes bytestring as-is and decodes on UTF-8 on the other side). That's as cheap as it gets when it comes to "serialization".
  • What is passed back is just a single integer by partition, so in that direction impact is negligible.
  • Communication is done over local sockets (all communication on worker beyond initial connect and auth is performed using file descriptor returned from local_connect_and_auth, and its nothing else than socket associated file). Again, as cheap as it gets when it comes to communication between processes.
  • Considering difference in raw performance shown above (much higher than what you see in you program), there is a lot of margin for overheads listed above.
  • This case is completely different from cases where either simple or complex objects have to be passed to and from Python interpreter in a form that is accessible to both parties as pickle-compatible dumps (most notable examples include old-style UDF, some parts of old-style MLLib).

Edit 2:

Since jasper-m was concerned about startup cost here, one can easily prove that Python has still significant advantage over Scala even if input size is significantly increased.

Here are results for 2003360 lines / 5.6G (the same input, just duplicated multiple times, 30 repetitions), which way exceeds anything you can expect in a single Spark task.

ここでは、画像の説明を入力します。

  • Python 22809.57 (21466.26, 24152.87)
  • Scala 27315.28 (24367.24, 30263.31)

Please note non-overlapping confidence intervals.

おすすめ

転載: http://10.200.1.11:23101/article/api/json?id=5119&siteId=1