from pyspark.sql.types import StructType, StructField, StringType, ArrayType
from pyspark.sql.functions import col, coalesce, lit
# Define the schema for the nested objects
schema = StructType([
StructField("id", StringType(), True),
StructField("positionData", StructType([
StructField("manager", StructType([
StructField("id", StringType(), True),
StructField("employeeNumber", StringType(), True)
]), True)
]), True)
# Read JSON data with multiline option and schema
df = spark.read.option("multiline", "true").json("Files/testing.json", schema=schema)
df = df.withColumn("ManagerId", coalesce(col("positionData.manager.id"), lit(None)))
display(df)
Please try this and let me know if you have further queries.
I had to strip out some of the other stuff that is not relevent to the topic.
Case 1: the positionData.manager object has an id and employeeNumber field that I need to grab:
{
"id": "00000001-0000-0000-0000-000000000000",
"positionData": {
"manager": {
"id": "00000002-0000-0000-0000-000000000000",
"employeeNumber": "1234"
}
}
}
Case 2: the manager is simply null:
{
"id": "00000001-0000-0000-0000-000000000000",
"positionData": {
"manager": null
}
}
In Case 2, we cannot navigate down to col("positionData.manager.id") becuase it doesn't exist. Hence the error. This seems to happen regarless of the function used ( when/otherwise or coalesce )
I have no control over the incoming JSON structure.
Any suggestion would be appreciated.
Thanks in advance.
@Anonymous Unfortunately, the COALESCE function does the same thing as the WHEN / EXCEPT function: It evaluates all paths offered, even though it is only going to take ONE of those paths. In my case, one of the paths will result in an error as shown above, and even though the logic of the function is such that it will not return a certain element, it still needs to evaluate it.
a bit of 'airware' example:
COALESCE ( NULL, "some string not null", 1/0)
This will error out on the 1 devided by zero path even though the logic is to return "some string not null".
Any other suggestions?
While I have not tried it, I don't think it is going to work, as is:
>>In this example, coalesce will first try to access the value of the column "positionData.manager.id". If it's null, it will return None instead.<<
This issue is NOT that there is a NULL value in column "positionData.manager.id" it is that the column does not exist, cannot be found. If it is there at all, it is part of some nested JSON structure.
If I have data like this picture from the original post:
and I try to reference col("positionData.manager.id") then I get this error:
AnalysisException: [INVALID_EXTRACT_BASE_FIELD_TYPE] Can't extract a value from "manager". Need a complex type [STRUCT, ARRAY, MAP] but got "STRING".
It might be possible to use nested COALESCE statements and / or WHEN/OTHERWISE functions.
I'm going to have to experiment. Thanks for the tip.
Hi @ToddChitt ,
Thanks for using Fabric Community.
PySpark provides a function called coalesce that allows you to specify a sequence of columns. The first non-null value in the sequence is returned. You can use this function to handle null values in your when expression like this:
from pyspark.sql.functions import col, coalesce
df = df.withColumn("ManagerId", coalesce(col("positionData.manager.id"), lit(None)))
In this example, coalesce will first try to access the value of the column "positionData.manager.id". If it's null, it will return None instead.
Can you please check this - pyspark.sql.functions.coalesce — PySpark 3.1.1 documentation (apache.org)
Hope this is helpful. Please let me know incase of further queries.
Join us at the Microsoft Fabric Community Conference
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
We want your feedback!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Microsoft Fabric Community Conference 2025
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
A Year in Review - December 2024
Find out what content was popular in the Fabric community during 2024.