Hello all. I am trying to learn PySpark from this website:
Good info, but I am stuck. I borrowed the simple JSON code that looks like this:
{ "RecordNumber": 2, "Zipcode": 704 },
{ "RecordNumber": 10, "Zipcode": 709 }
]
And I can read that in a data frame. But unfortunately, my data has an array name at the top, like this:
{ "data": [
{ "RecordNumber": 2, "Zipcode": 704 },
{ "RecordNumber": 10, "Zipcode": 709 }
I read these two items into 2 data frames, then do two selects in PySpark:
dfJSON1 = df1.select( col(
"RecordNumber"
), col(
"Zipcode"
))
dfJSON2 = df2.select( col(
"data.RecordNumber"
), col(
"data.Zipcode"
))
dfJSON1.show()
dfJSON2.show()
The two results:
What am I missing to get the second data frame to show two records, similar to the first?
This can't be that hard. What am I missing?
Thanks in advance.
Hi
@ToddChitt
,
Wouldn't it be possible to use a couple of SQL functions like
explode
and
col
for this?
I found that suggested approach in this blog:
https://medium.com/towards-data-engineering/transforming-json-to-lakehouse-tables-with-microsoft-fab...
Below is an example based on your json code in one of my test notebooks.
# Apply transformation to the dataframe
from pyspark.sql.functions import col, explode
exploded_df = df.select(explode(col("data")).alias("data"))
tf_df = exploded_df.select(
col("data.RecordNumber").alias("RecordNumber"),
col("data.Zipcode").alias("Zipcode")
display(tf_df)
dfJSON1 = tf_df.select( col("RecordNumber"), col("Zipcode"))
dfJSON1.show()
@Expiscornovus
Thanks for the quick response.
Your sample code worked great. Now it's up to me to figure out how to shred the multi-level nested arrays in my actual JSON documents.
I will check out that blog and try to learn a little more about PySpark.
Thanks
Hi
@ToddChitt
,
Wouldn't it be possible to use a couple of SQL functions like
explode
and
col
for this?
I found that suggested approach in this blog:
https://medium.com/towards-data-engineering/transforming-json-to-lakehouse-tables-with-microsoft-fab...
Below is an example based on your json code in one of my test notebooks.
# Apply transformation to the dataframe
from pyspark.sql.functions import col, explode
exploded_df = df.select(explode(col("data")).alias("data"))
tf_df = exploded_df.select(
col("data.RecordNumber").alias("RecordNumber"),
col("data.Zipcode").alias("Zipcode")
display(tf_df)
dfJSON1 = tf_df.select( col("RecordNumber"), col("Zipcode"))
dfJSON1.show()
Fabric Monthly Update - March 2025
Check out the March 2025 Fabric update to learn about new features.
Fabric Community Update - March 2025
Find out what's new and trending in the Fabric community.