VPC errors during Amazon EMR cluster operations
The following errors are common to VPC configuration in Amazon EMR.
Topics
Invalid subnet configuration
On the Cluster Details page, in the Status field, you see an error similar to the following:
The subnet configuration was invalid: Cannot find route to InternetGateway in main RouteTable
rtb-id
for vpc
vpc-id
.
To solve this problem, you must create an Internet Gateway and attach it to your VPC. For more information, see Adding an internet gateway to your VPC .
Alternatively, verify that you have configured your VPC with Enable DNS resolution and Enable DNS hostname support enabled. For more information, see Using DNS with your VPC .
Missing DHCP Options Set
You see a step failure in the cluster system log (syslog) with an error similar to the following:
ERROR org.apache.hadoop.security.UserGroupInformation (main): PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id '
application_id
' doesn't exist in RM.
or
ERROR org.apache.hadoop.streaming.StreamJob (main): Error Launching job : org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id '
application_id
' doesn't exist in RM.
To solve this problem, you must configure a VPC that includes a DHCP Options Set whose parameters are set to the following values:
Note
If you use the AWS GovCloud (US-West) region, set domain-name to
us-gov-west-1.compute.internal
instead of the value
used in the following example.
domain-name
=
ec2.internal
Use
ec2.internal
if your region is US East (N. Virginia). For other regions, use
region-name
.compute.internal
. For example in us-west-2, use
domain-name
=
us-west-2.compute.internal
.
domain-name-servers
=
AmazonProvidedDNS
For more information, see DHCP Options Sets .
Permissions errors
A failure in the
stderr
log for a step indicates that an Amazon S3
resource does not have the appropriate permissions. This is a 403 error and the error
looks like:
Exception in thread "main" com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID:
REQUEST_ID
If the ActionOnFailure is set to
TERMINATE_JOB_FLOW
, then this
would result in the cluster terminating with the state,
SHUTDOWN_COMPLETED_WITH_ERRORS
.
A few ways to troubleshoot this problem include:
If you are using an Amazon S3 bucket policy within a VPC, make sure to give access to all buckets by creating a VPC endpoint and selecting Allow all under the Policy option when creating the endpoint.
Make sure that any policies associated with S3 resources include the VPC in which you launch the cluster.
Try running the following command from your cluster to verify you can access the bucket
hadoop fs -copyToLocal s3://
path-to-bucket
/tmp/
You can get more specific debugging information by setting the
log4j.logger.org.apache.http.wire
parameter to
DEBUG
in
/home/hadoop/conf/log4j.properties
file on the cluster.
You can check the
stderr
log file after trying to access the
bucket from the cluster. The log file will provide more detailed
information:
Access denied for getting the prefix for bucket - us-west-2.elasticmapreduce with path samples/wordcount/input/ 15/03/25 23:46:20 DEBUG http.wire: >> "GET /?prefix=samples%2Fwordcount%2Finput%2F&delimiter=%2F&max-keys=1 HTTP/1.1[\r][\n]" 15/03/25 23:46:20 DEBUG http.wire: >> "Host: us-west-2.elasticmapreduce.s3.amazonaws.com[\r][\n]"
Errors that result in
START_FAILED
Before AMI 3.7.0, for VPCs where a hostname is specified, Amazon EMR maps the internal hostnames of the subnet with custom domain addresses as follows:
ip-
.
For example, if the hostname was
X.X.X.X.customdomain.com
.tld
ip-10.0.0.10
and the VPC has the
domain name option set to customdomain.com, the resulting hostname mapped by Amazon EMR would
be
ip-10.0.1.0.customdomain.com
. An entry is added in
/etc/hosts
to resolve the hostname to 10.0.0.10. This behavior
is changed with AMI 3.7.0 and now Amazon EMR honors the DHCP configuration of the VPC
entirely. Previously, customers could also use a bootstrap action to specify a hostname
mapping.
If you would like to preserve this behavior, you must provide the DNS and forward resolution setup you require for the custom domain.
Cluster
Terminated with
errors
and NameNode fails to start
When launching an EMR cluster in a VPC which makes use of a custom DNS domain name, your cluster may fail with the following error message in the console:
Terminated with errors On the master instance(
instance-id
), bootstrap action 1 returned a non-zero return code
The failure is a result of the NameNode not being able to start up. This will result
in the following error found in the NameNode logs, whose Amazon S3 URI is of the form:
s3://
:
mybucket
/
logs
/
cluster-id
/daemons/
master instance-id
/hadoop-hadoop-namenode-
master node hostname
.log.gz
2015-07-23 20:17:06,266 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem (main): Encountered exception loading fsimage java.io.IOException: NameNode is not formatted. org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:212) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1020) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:739) org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:537) org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:596) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:765) org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:749) org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1441) org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1507)
This is due to a potential issue where an EC2 instance can have multiple sets of fully qualified domain names when launching EMR clusters in a VPC, which makes use of both an