Everday I'm Shuffling - Tips for Writing Better Apache Spark Programs - 代码天地

Everday I'm Shuffling - Tips for Writing Better Apache Spark Programs

其他 2018-07-10 16:13:30 阅读次数: 0

https://www.youtube.com/watch?v=Wg2boMqLjCg

Understanding the Shuffle in Spark

-Common causes of inefficiency

Understanding when code runs on the driver vs. the workers

-Common causes of errors

How to factor your code

-For reuse between batch and streaming

Part1 Understanding the Shuffle in Spark

Shuffle occurs to transfer all data with the same key to the same worker node.

1) reduceByKey vs. groupByKey

> reduceByKey is more efficient

> groupByKey can cause of out of disk problems

> reduceByKey, aggregateByKey, foldByKey, and combineByKey, preferred over groupByKey, because they do some combination steps before network transfer and they cost less disk

2) Join a large table with a Small Table

ShuffledHashJoin vs. BroadCastHashJoin

> ShuffledHashJoin: all the data will be shuffled

> BroadcasrHashJoin: broadcast the small table to all worker nodes

> Use .toDebugString() or EXPLAIN to double check

3）Join a medium table with a large table

Before join, do some transformation steps to filter useful data from the large table, then do shuffle

4) In Practice: Detecting Shuffle Problems

Check the Spark UI pages for task level detail about your Spark Job.

Things to Look for:

> Tasks that take much longer to run than others.

> Speculative tasks that are launching

> Shards that have a lot more input or shuffle output than others

Part2 Execution on the Driver vs. Workers

1)

The main program are executed on the Spark Driver.

Transformations are executed on the Spark Workers.

Actions may transfer data from workers to driver.

2) collect()

collect() sends all the partitions to the single driver

Don't call collect() on a large RDD

> use count()/take(N)

> saveAsTextFile()

猜你喜欢

转载自blog.csdn.net/weixin_42129080/article/details/80984749

Everday I'm Shuffling - Tips for Writing Better Apache Spark Programs

Using assembly writing algorithm programs

Tips for Writing Solidity Tests with Truffle

Tips to write better Conditionals in JavaScript

Record Everday!

apache安全配置tips

Oracle SQL tuning - Tips for writing more efficient SQL

Performance tips when writing shaders Only compute what you need

程序猿命名建议20条——20 Tips for Better Naming

spark 基础开发 Tips总结

Apache CXF Writing a service with Spring notes

Apache Spark

M - The more, The Better （树形DP）

Writing

Writing a Gallery App in Django, Part I

[Spark笔记]Apache Spark — Overview

Apache Spark Spark VS Hadoop

Spring i18n的better practice

spark运行报错Please install psutil to have better support with spilling

Apache Spark 入门简介

Apache Spark 入门

[Apache Spark Error Message]

Apache Spark源码剖析

Apache Spark开发介绍

Apache Spark入门攻略

Apache Spark安装部署

Apache Spark机器学习

Apache Spark 概述

Apache Spark 简介

Apache Spark RDD

今日推荐

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

基于大语言模型的开源知识库问答系统 MaxKB GitHub Star 数量突破 5,000 个！

周排行

static方法和非static方法的区别（java）

如何查找计算机专业paper

java.lang.ClassFormatError: Incompatible magic value 0 in class file com/sitecha

跳跃游戏II

stm32_之【建立工程】

TeaWeb v0.0.9 发布，统计底层优化、主机监控功能改进

事件分发 -----控制字体大小

JavaScript DOM练习（动态表格添加） December 25，2019

JSF Scope & CDI

实现从零搭建一个登录注册页面（附源代码）

每日归档

更多

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)