How do I resolve the "Unable to infer schema" exception in AWS Glue?
My AWS Glue job fails with one of the following exceptions:
"AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'" "AnalysisException: u'Unable to infer schema for ORC. It must be specified manually.;'"
Short description
This error usually happens when AWS Glue tries to read a Parquet or Orc file that's not stored in an Apache Hive-style partitioned path with the key=val structure. AWS Glue expects the Amazon Simple Storage Service (Amazon S3) source files to be in key-value pairs. For example, if the AWS Glue job is processing files at s3://s3-bucket/parquet-data/ , the files must have the following partition structure:
s3://s3-bucket/parquet-data/year=2018/month=10/day=10/file1.parquet
If your Parquet or Orc files are stored in a hierarchical structure, the AWS Glue job fails with the "Unable to infer schema" exception. Example:
s3://s3-bucket/parquet-data/year/month/day/file1.parquet
Resolution
Use one of the following methods to resolve the error.
Restructure your data
Copy the files into a new S3 bucket and use Hive-style partitioned paths. Run the job again.
Replace partition column names with asterisks
If restructuring your data isn't feasible, create the DynamicFrame directly from Amazon S3. In the Amazon S3 path, replace all partition column names with asterisks (*). When you use this solution, AWS Glue does not include the partition columns in the DynamicFrame—it only includes the data.
For example, assume that your files are stored in an S3 bucket with the following partition structure:
s3://s3-bucket/parquet-data/year/month/day/files.parquet
To process all files in the s3://s3-bucket/parquet-data/ path, create the DynamicFrame:
dynamic_frame0 = glueContext.create_dynamic_frame_from_options('s3',connection_options={'paths':['s3://s3-bucket/parquet-data/*/*/*'],},format="parquet",transformation_ctx = "dynamic_frame0")
Use a map transformation to add partition columns
To include the partition columns in the DynamicFrame, create a DataFrame first, and then add a column for the Amazon S3 file path. Then, create the DynamicFrame and apply a map transformation to add the partition columns, as shown in the following example. Before using the sample code, replace the Amazon S3 paths and enter your partition column names using the correct index values.
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.dynamicframe import DynamicFrame from pyspark.sql.functions import * args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) df= spark.read.parquet("s3://s3-bucket/parquet-data/*/*/*") modified_df = df.withColumn('partitions_column',input_file_name()) dyf_0 = DynamicFrame.fromDF(modified_df, glueContext, "dyf_0") def modify_col(x): if x['partitions_column']: new_columns = x['partitions_column'].split('/') x['year'],x['month'],x['day'] = new_columns[4],new_columns[5],new_columns[6] del x['partitions_column'] return x