Solved: Job fails with "The spark driver has stopped unexp... - Databricks Community

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

发怒的棒棒糖 · HoYoLAB - Official ...· 1 月前 ·

心软的黑框眼镜 · HoYoLAB - Official ...· 1 周前 ·

霸气的充电器 · 材料力学-材料力学促销价格、材料力学品牌 - 淘宝· 5 月前 ·

博学的香菜 · 雷神T-Book ...· 8 月前 ·

胡子拉碴的饼干 · 【SQLite】获取插入 ID 的几种方式 ...· 9 月前 ·

一直单身的手链 · 带格式文本html ...· 1 年前 ·

高大的凉茶 · Word wrap - Visual ...· 1 年前 ·

Databricks Community

No other output is available, not even output from cells that did run successfully.

Also, I'm unable to connect to spark ui or view the logs. It makes an attempt to load each of them, but after some time an error message appears saying it's unable to load.

This happens on a job that runs every Sunday. I've tried switching the cluster configuration a bit (spots, on-demand, # instances, instance types, etc) and nothing seems to fix it. The job runs two notebooks via the dbutils.notebook.run utility and I am able to run each notebook via a job independently, it's only when they are put together.

Any suggestions for figuring out what's going on? At this point, I'm thinking of breaking this up into two jobs and trying to stagger them far enough apart that the first is sure to finish before the second starts.

I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.

I'm not sure if this is related, but I ran this job yesterday and the exact spot where it failed when run weekly this time never completed. It typically takes about 2 hours, but I noticed it still running at 14h. I canceled the job before I took a look at the spark ui/logs and now that the job is finished in a failed state, I am unable to load the Spark UI or view the logs, with the same error message above.

Seems like maybe there's something with how this job fails that circumvents Databrick's ability to restore the logs or UI. I remember in the past something like this happening, and it was related to a job outputting UTF8 characters. I think Databricks fixed that issue. This job should not do that, as it's a counting job and all text is pre-sanitized to only contain ASCII or be numeric ids.

I've been unable to find any background on this issue. After digging into the spark logs, I've also found a reference to a GC issue. More specifically:

java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:136) ..

I should note this is a simple object declaration. No data is being processed by the culprit cell.

most of the time it's out of memory on driver node. check over all the drive log, data node log in Spark UI.

And check if u r collecting huge data to drive node, e.g. collect()

If there isn’t a group near you, start one and help create a community that brings people together. Request a New Group in Data Engineering Wednesday in Data Engineering a week ago in Data Engineering 3 weeks ago in Data Engineering 3 weeks ago in Data Engineering 3 weeks ago © Databricks 2025. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.

Privacy Notice

Your Privacy Choices

Your California Privacy Rights