添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
打酱油的槟榔  ·  .NET ...·  5 天前    · 
傻傻的双杠  ·  Compiling XLA with ...·  3 天前    · 
有情有义的小马驹  ·  第417章 ...·  3 月前    · 
帅气的勺子  ·  滑动验证页面·  4 月前    · 
骑白马的大熊猫  ·  sys.sp_cdc_cleanup_cha ...·  1 年前    · 
愤怒的菠萝  ·  天空之城 ...·  1 年前    · 

Running a HortonWork hadoop cluster (HDP-3.1.0.0) and getting a bunch of

Failed on local exception: java.io.IOException: Too many open files

errors when running spark jobs that up until this point have worked fine .

I have seen many other questions like this where the answer is to increase the ulimit settings for open files and processes (this is also in the HDP docs ) (and I'll note that I believe that mine are still at the system default settings), but...

My question is: Why is this only happening now when previously the spark jobs have been running fine for months?

The spark jobs I have been running have been running fine for months without incident and I have made no recent code changes. Don't know enough about the internals of spark to theorize about why things could be going wrong only now (would be odd to me if open files just build up in the course of running spark, but that seems like what is happening).

Just as an example, just this code...

.
.sparkSession = SparkSession.builder.appName("GET_TABLE_COUNT").getOrCreate()sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size()
.

now generates errors like...

.
[2020-05-12 19:04:45,810] {bash_operator.py:128} INFO - 20/05/12 19:04:45 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:46,813] {bash_operator.py:128} INFO - 20/05/12 19:04:46 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:47,816] {bash_operator.py:128} INFO - 20/05/12 19:04:47 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:48,818] {bash_operator.py:128} INFO - 20/05/12 19:04:48 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:49,820] {bash_operator.py:128} INFO - 20/05/12 19:04:49 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:50,822] {bash_operator.py:128} INFO - 20/05/12 19:04:50 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:51,828] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client: Application report for application_1579648183118_19918 (state: FAILED)
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client:
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO -      client token: N/A[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO -      diagnostics: Application application_1579648183118_19918 failed 2 times due to Error launching appattempt_1579648183118_19918_000002. Got exception: java.io.IOException: DestHost:destPort hw005.co.local:45454 , LocalHost:localPort hw001.co.local/172.18.4.46:0. Failed on local exception: java.io.IOException: Too many open files
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.GeneratedConstructorAccessor808.newInstance(Unknown Source) [2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

My RAM and ulimit setting on the cluster look like...

[root@HW001]# clush -ab free -h---------------HW001---------------              total        used        free      shared  buff/cache   availableMem:            31G        9.0G        1.1G        1.7G         21G         19G
Swap:          8.5G         44K        8.5G
---------------HW002---------------              total        used        free      shared  buff/cache   availableMem:            31G        7.3G        5.6G        568M         18G         22G
Swap:          8.5G        308K        8.5G
---------------HW003---------------              total        used        free      shared  buff/cache   availableMem:            31G        6.1G        4.0G        120M         21G         24G
Swap:          8.5G        200K        8.5G
---------------HW004---------------              total        used        free      shared  buff/cache   availableMem:            31G        2.9G        2.8G        120M         25G         27G
Swap:          8.5G         28K        8.5G
---------------HW005---------------              total        used        free      shared  buff/cache   availableMem:            31G        2.9G        4.6G        120M         23G         27G
Swap:          8.5G         20K        8.5G
---------------airflowetl---------------              total        used        free      shared  buff/cache   availableMem:            46G        5.3G         13G        2.4G         28G         38G
Swap:          8.5G        124K        8.5G
[root@HW001]#
[root@HW001]#
[root@HW001]#
[root@HW001]# clush -ab ulimit -a
---------------HW[001-005] (5)---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 127886
max locked memory (kbytes, -l) 64max
memory size
(kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited max user processes (-u) 127886
virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
---------------airflowetl---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 192394
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited max user processes (-u) 192394
virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

Don't know much about Hadoop admin, but just looking at the Ambari dashboard, the cluster does not seem to be overly taxed...

Capture001.PNG

(though could not actually check the RM web UI, since it just throws a "too many open files" error).

Anyone with more spark/hadoop experience know why this would be happening now?

your PB may be caused by caused by several reasons

firstly i think 1024 is not enough, you should increase it

opened files may be increasing day after day ( an application may stream more data from/into splitted files)

a spark application may also import/open more libraries today

etc...

please check opened file by the user (that runs spark jobs) to find the possible cause

lsof -u myUser ( | wc -l ... )

check lsof (lsof +D directory) , and find how many opened files per job and how many jobs are runing etc...

Hello @rvillanueva ,

You can check how many threads are used by a user by running ps -L -u <username> | wc -l

if the user’s open files  limit ( ulimit -n <user name >) is hit then the user can’t spawn any further more threads. Most possible reasons in this case could be,

  1. Same user running other jobs and having open files on the node where it tries to launch/spawn the container.
  2. systems thread might have excluded.
  3. see which application is running and what is their current open files

Kindly check application log ( application_XXX),if available and see which phase it throw's the exception and on which node the issue is faced.

Apache Hadoop and associated open source project names are trademarks of the Apache Software Foundation. For a complete list of trademarks, click here.