![]() |
冷冷的胡萝卜 · SpringBoot学习笔记(八)——JWT ...· 2 周前 · |
![]() |
爱热闹的葫芦 · Column.IsIdentity ...· 2 周前 · |
![]() |
兴奋的玉米 · Mapping Attributes✨ | ...· 2 周前 · |
![]() |
文武双全的小笼包 · 《lua 程序设计》读书笔记(2):数值 ...· 1 周前 · |
![]() |
刀枪不入的马铃薯 · ASP.NET Core 中的配置 | ...· 1 周前 · |
![]() |
帅气的钥匙扣 · 我国国产首台质子治疗系统获批上市----中国 ...· 1 月前 · |
![]() |
眼睛小的野马 · OR新媒体 | ...· 3 月前 · |
![]() |
发财的西装 · Docker 部署 Nginx ...· 5 月前 · |
![]() |
听话的楼房 · 爱上ASMR:一个让你的网站 ...· 9 月前 · |
![]() |
风流的火柴 · Field type: deviceid· 10 月前 · |
In this article, we are going to discuss how to parse a column of json strings into their own separate columns. Here we will parse or read json string present in a csv file and convert it into multiple dataframe columns using Python Pyspark.
For parsing json string we’ll use from_json() SQL function to parse the column containing json string into StructType with the specified schema. If the string is unparseable, it returns null.
The movie_input.csv file contains 15 records containing movie details(title, rating, releaseYear and genre) present in a JSON string. We want to read this file and parse the json string to extract the movie details into their own separate columns title, rating, releaseYear and genre.
This function requires two required parameters:
a column of json strings').getOrCreate()
df
=
spark.read.load(
'movie_input.csv'
, header
=
True
,
format
=
"csv"
)
df.show()
df.printSchema()
Output:
Once we have read the data into a dataframe, now let’s convert the JSON column into multiple columns using from_json(). As mentioned above this function takes the column name with JSON string and the JSON schema as arguments, so let’s create the schema that represents our data.
T.StructField(
'title'
, T.StringType(),
True
),
T.StructField(
'rating'
, T.StringType(),
True
),
T.StructField(
'releaseYear'
, T.StringType(),
True
),
T.StructField(
'genre'
, T.StringType(),
True
)
Now, lets use the from_json() function which returns the Column struct with all the json columns.
import
pyspark.sql.functions as F
mapped_df
=
df.withColumn(
"movie"
, F.from_json(
"movie"
, schema))
mapped_df.show(truncate
=
False
)
mapped_df.printSchema()
Output:
And finally, we explode the json struct to flatten it using the select method. We could have selected on cols movie.title, movie.rating, ..etc. But the better approach is to use * wildcard character which would select all the columns which has movie. prefix.
import
pyspark.sql.functions as F
parsed_df
=
mapped_df.select(F.col(
'id'
), F.col(
"movie.*"
))
parsed_df.show(truncate
=
False
)
parsed_df.printSchema()
For this, we’ll be using to_json() and json_tuple() in addition to the above method.
from
pyspark.sql
import
SparkSession
import
pyspark.sql.functions as F
import
pyspark.sql.types as T
if
__name__
=
=
"__main__"
:
spark
=
SparkSession.builder.appName('Parse a\
column of json strings').getOrCreate()
df
=
spark.createDataFrame(
[
"1"
,
"{'color': 'red', 'value': '#f00'}"
],
[
"2"
,
"{'color': 'green', 'value': '#0f0'}"
],
[
"3"
,
"{'color': 'blue', 'value': '#00f'}"
],
[
"4"
,
"{'color': 'cyan', 'value': '#0ff'}"
],
[
"5"
,
"{'color': 'magenta', 'value': '#f0f'}"
],
[
"6"
,
"{'color': 'yellow', 'value': '#ff0'}"
],
[
"7"
,
"{'color': 'black', 'value': '#000'}"
],
).toDF(
'id'
,
'colors'
)
df.show(truncate
=
False
)
df.printSchema()
df
=
df.withColumn(
"colors"
,
F.from_json(df.colors,
T.MapType(T.StringType(),
T.StringType())))
df.show(truncate
=
False
)
df.printSchema()
df
=
df.withColumn(
"colors"
, F.to_json(df.colors))
df.show(truncate
=
False
)
df.printSchema()
df
=
df.select(
'id'
, F.json_tuple(F.col(
"colors"
),
"color"
,
"value"
)
).toDF(
'id'
,
'color'
,
'value'
)
df.show(truncate
=
False
)
df.printSchema()
Output:
+---+-------------------------------------+ |id |colors | +---+-------------------------------------+ |1 |{'color': 'red', 'value': '#f00'} | |2 |{'color': 'green', 'value': '#0f0'} | |3 |{'color': 'blue', 'value': '#00f'} | |4 |{'color': 'cyan', 'value': '#0ff'} | |5 |{'color': 'magenta', 'value': '#f0f'}| |6 |{'color': 'yellow', 'value': '#ff0'} | |7 |{'color': 'black', 'value': '#000'} | +---+-------------------------------------+ |-- id: string (nullable = true) |-- colors: string (nullable = true) +---+---------------------------------+ |id |colors | +---+---------------------------------+ |1 |{color -> red, value -> #f00} | |2 |{color -> green, value -> #0f0} | |3 |{color -> blue, value -> #00f} | |4 |{color -> cyan, value -> #0ff} | |5 |{color -> magenta, value -> #f0f}| |6 |{color -> yellow, value -> #ff0} | |7 |{color -> black, value -> #000} | +---+---------------------------------+ |-- id: string (nullable = true) |-- colors: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) +---+----------------------------------+ |id |colors | +---+----------------------------------+ |1 |{"color":"red","value":"#f00"} | |2 |{"color":"green","value":"#0f0"} | |3 |{"color":"blue","value":"#00f"} | |4 |{"color":"cyan","value":"#0ff"} | |5 |{"color":"magenta","value":"#f0f"}| |6 |{"color":"yellow","value":"#ff0"} | |7 |{"color":"black","value":"#000"} | +---+----------------------------------+ |-- id: string (nullable = true) |-- colors: string (nullable = true) +---+-------+-----+ |id |color |value| +---+-------+-----+ |1 |red |#f00 | |2 |green |#0f0 | |3 |blue |#00f | |4 |cyan |#0ff | |5 |magenta|#f0f | |6 |yellow |#ff0 | |7 |black |#000 | +---+-------+-----+ |-- id: string (nullable = true) |-- color: string (nullable = true) |-- value: string (nullable = true)Like Article
![]() |
爱热闹的葫芦 · Column.IsIdentity Property (Microsoft.Web.Management.DatabaseManager) | Microsoft Learn 2 周前 |
![]() |
风流的火柴 · Field type: deviceid 10 月前 |