Does anyone know exactly what spark.yarn.executor.memoryOverhead is used for and why it may be using up so much space? If I could, I would love to have a peek inside this stack. Spark's description is as follows:
The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
The problem I'm having is when running spark queries on large datasets ( > 5TB), I am required to set the executor memoryOverhead to 8GB otherwise it would throw an exception and die. What is being stored in this container that it needs 8GB per container?
I've also noticed that this error doesn't occur on standalone mode, because it doesn't use YARN.
Note. My configurations for this job are:
executor memory = 15G
executor cores = 5
yarn.executor.memoryOverhead = 8GB
max executors = 60
offHeap.enabled = false
@Henry : I think that equation uses the executor memory (in your case, 15G) and outputs the overhead value.
// Below calculation uses executorMemory, not memoryOverhead
math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN))
The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames. I will add that when using Spark on Yarn, the Yarn configuration settings have to be adjusted and tweaked to match up carefully with the Spark properties (as the referenced blog suggests). You might also want to look at Tiered Storage to offload RDDs into MEM_AND_DISK, etc.
Per recent Spark docs, you can't actually set the heap size that way. You need to use `spark.executor.memory` to do so.
https://spark.apache.org/docs/2.1.1/configuration.html#runtime-environment
,
Just an FYI, Spark 2.1.1 doesn't allow setting the heap space in `extraJavaOptions`:
java.lang.Exception: spark.executor.extraJavaOptions is not allowed to specify max heap memory settings (was ''-Xmx20g''). Use spark.executor.memory instead.