Running a HortonWork hadoop cluster (HDP-3.1.0.0) and getting a bunch of
Failed on local exception: java.io.IOException: Too many open files
errors when running spark jobs that
up until this point have worked fine
.
I have seen many other questions like this where the answer is to increase the ulimit settings for open files and processes (this is also in
the HDP docs
) (and I'll note that I believe that mine are still at the system default settings), but...
My question is:
Why is this only happening now
when previously the spark jobs have been running fine for months?
The spark jobs I have been running have been running fine for months without incident and I have made no recent code changes. Don't know enough about the internals of spark to theorize about why things could be going wrong only now (would be odd to me if open files just build up in the course of running spark, but that seems like what is happening).
Just as an example, just this code...
.
.sparkSession = SparkSession.builder.appName("GET_TABLE_COUNT").getOrCreate()sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size()
.
now generates errors like...
.
[2020-05-12 19:04:45,810] {bash_operator.py:128} INFO - 20/05/12 19:04:45 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:46,813] {bash_operator.py:128} INFO - 20/05/12 19:04:46 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:47,816] {bash_operator.py:128} INFO - 20/05/12 19:04:47 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:48,818] {bash_operator.py:128} INFO - 20/05/12 19:04:48 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:49,820] {bash_operator.py:128} INFO - 20/05/12 19:04:49 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:50,822] {bash_operator.py:128} INFO - 20/05/12 19:04:50 INFO Client: Application report for application_1579648183118_19918 (state: ACCEPTED)
[2020-05-12 19:04:51,828] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client: Application report for application_1579648183118_19918 (state: FAILED)
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - 20/05/12 19:04:51 INFO Client:
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - client token: N/A[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - diagnostics: Application application_1579648183118_19918 failed 2 times due to Error launching appattempt_1579648183118_19918_000002. Got exception: java.io.IOException: DestHost:destPort hw005.co.local:45454 , LocalHost:localPort hw001.co.local/172.18.4.46:0. Failed on local exception: java.io.IOException: Too many open files
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.GeneratedConstructorAccessor808.newInstance(Unknown Source)
[2020-05-12 19:04:51,829] {bash_operator.py:128} INFO - at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
My RAM and
ulimit
setting on the cluster look like...
[root@HW001]# clush -ab free -h---------------HW001--------------- total used free shared buff/cache availableMem: 31G 9.0G 1.1G 1.7G 21G 19G
Swap: 8.5G 44K 8.5G
---------------HW002--------------- total used free shared buff/cache availableMem: 31G 7.3G 5.6G 568M 18G 22G
Swap: 8.5G 308K 8.5G
---------------HW003--------------- total used free shared buff/cache availableMem: 31G 6.1G 4.0G 120M 21G 24G
Swap: 8.5G 200K 8.5G
---------------HW004--------------- total used free shared buff/cache availableMem: 31G 2.9G 2.8G 120M 25G 27G
Swap: 8.5G 28K 8.5G
---------------HW005--------------- total used free shared buff/cache availableMem: 31G 2.9G 4.6G 120M 23G 27G
Swap: 8.5G 20K 8.5G
---------------airflowetl--------------- total used free shared buff/cache availableMem: 46G 5.3G 13G 2.4G 28G 38G
Swap: 8.5G 124K 8.5G
[root@HW001]#
[root@HW001]#
[root@HW001]#
[root@HW001]# clush -ab ulimit -a
---------------HW[001-005] (5)---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127886
max locked memory (kbytes, -l) 64max
memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 127886
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
---------------airflowetl---------------
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 192394
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 192394
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Don't know much about Hadoop admin, but just looking at the Ambari dashboard, the cluster does not seem to be overly taxed...
(though could not actually check the RM web UI, since it just throws a "too many open files" error).
Anyone with more spark/hadoop experience know why this would be happening now?
your PB may be caused by caused by several reasons
firstly i think 1024 is not enough, you should increase it
opened files may be increasing day after day ( an application may stream more data from/into splitted files)
a spark application may also import/open more libraries today
etc...
please check opened file by the user (that runs spark jobs) to find the possible cause
lsof -u myUser
( | wc -l ... )
check
lsof (lsof +D directory) , and find how many opened files per job and how many jobs are runing etc...
Hello
@rvillanueva
,
You can check how many threads are used by a user by running ps -L -u <username> | wc -l
if the user’s open files limit ( ulimit -n <user name >) is
hit then the user can’t spawn any further more threads. Most possible reasons in this case could be,
-
Same user running other jobs and having open files on the node where it tries to launch/spawn the container.
-
systems thread might have excluded.
-
see which application is running and what is their current open files
Kindly check application log (
application_XXX),if available and see which phase it throw's the exception and on which node the issue is faced.
Terms & Conditions
Privacy Policy and Data Policy
Unsubscribe / Do Not Sell My Personal Information
Supported Browsers Policy
Apache Hadoop
and associated open source project names are trademarks of the
Apache Software Foundation.
For a complete list of trademarks,
click here.