添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

In this article, we are going to discuss how to parse a column of json strings into their own separate columns. Here we will parse or read json string present in a csv file and convert it into multiple dataframe columns using Python Pyspark.

Example 1: Parse a Column of JSON Strings Using pyspark.sql.functions.from_json

For parsing json string we’ll use from_json() SQL function to parse the column containing json string into StructType with the specified schema. If the string is unparseable, it returns null.

The movie_input.csv file contains 15 records containing movie details(title, rating, releaseYear and genre) present in a JSON string. We want to read this file and parse the json string to extract the movie details into their own separate columns title, rating, releaseYear and genre.

This function requires two required parameters:

  • col: Name of column that contains the json string.
  • schema: a StructType or ArrayType of StructType to use when parsing the json column.
  • Python3

    a column of json strings').getOrCreate()
    df = spark.read.load( 'movie_input.csv' , header = True , format = "csv" )
    df.show()
    df.printSchema()

    Output:

    Once we have read the data into a dataframe, now let’s convert the JSON column into multiple columns using from_json(). As mentioned above this function takes the column name with JSON string and the JSON schema as arguments, so let’s create the schema that represents our data.

    Python3

    T.StructField( 'title' , T.StringType(), True ),
    T.StructField( 'rating' , T.StringType(), True ),
    T.StructField( 'releaseYear' , T.StringType(), True ),
    T.StructField( 'genre' , T.StringType(), True )

    Now, lets use the from_json() function which returns the Column struct with all the json columns.

    Python3

    import pyspark.sql.functions as F
    mapped_df = df.withColumn( "movie" , F.from_json( "movie" , schema))
    mapped_df.show(truncate = False )
    mapped_df.printSchema()

    Output:

    And finally, we explode the json struct to flatten it using the select method. We could have selected on cols movie.title, movie.rating, ..etc. But the better approach is to use * wildcard character which would select all the columns which has movie. prefix.

    Python3

    import pyspark.sql.functions as F
    parsed_df = mapped_df.select(F.col( 'id' ), F.col( "movie.*" ))
    parsed_df.show(truncate = False )
    parsed_df.printSchema()

    Example 2: Parse a column of json strings using to_json() and json_tuple()

    For this, we’ll be using to_json() and json_tuple() in addition to the above method.

  • First, we’ll map the JSON string column to MapType with the help of from_json() function.
  • Then we’ll convert the MapType column to JSON string. If the JSON data was incorrect, the function would throw an exception.
  • Finally, we’ll create new columns for the JSON column according to the field names passed.
  • Python3

    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    import pyspark.sql.types as T
    if __name__ = = "__main__" :
    spark = SparkSession.builder.appName('Parse a\
    column of json strings').getOrCreate()
    df = spark.createDataFrame(
    [ "1" , "{'color': 'red', 'value': '#f00'}" ],
    [ "2" , "{'color': 'green', 'value': '#0f0'}" ],
    [ "3" , "{'color': 'blue', 'value': '#00f'}" ],
    [ "4" , "{'color': 'cyan', 'value': '#0ff'}" ],
    [ "5" , "{'color': 'magenta', 'value': '#f0f'}" ],
    [ "6" , "{'color': 'yellow', 'value': '#ff0'}" ],
    [ "7" , "{'color': 'black', 'value': '#000'}" ],
    ).toDF( 'id' , 'colors' )
    df.show(truncate = False )
    df.printSchema()
    df = df.withColumn( "colors" ,
    F.from_json(df.colors,
    T.MapType(T.StringType(),
    T.StringType())))
    df.show(truncate = False )
    df.printSchema()
    df = df.withColumn( "colors" , F.to_json(df.colors))
    df.show(truncate = False )
    df.printSchema()
    df = df.select( 'id' , F.json_tuple(F.col( "colors" ),
    "color" , "value" )
    ).toDF( 'id' , 'color' , 'value' )
    df.show(truncate = False )
    df.printSchema()

    Output:

    +---+-------------------------------------+
    |id |colors                               |
    +---+-------------------------------------+
    |1  |{'color': 'red', 'value': '#f00'}    |
    |2  |{'color': 'green', 'value': '#0f0'}  |
    |3  |{'color': 'blue', 'value': '#00f'}   |
    |4  |{'color': 'cyan', 'value': '#0ff'}   |
    |5  |{'color': 'magenta', 'value': '#f0f'}|
    |6  |{'color': 'yellow', 'value': '#ff0'} |
    |7  |{'color': 'black', 'value': '#000'}  |
    +---+-------------------------------------+
     |-- id: string (nullable = true)
     |-- colors: string (nullable = true)
    +---+---------------------------------+
    |id |colors                           |
    +---+---------------------------------+
    |1  |{color -> red, value -> #f00}    |
    |2  |{color -> green, value -> #0f0}  |
    |3  |{color -> blue, value -> #00f}   |
    |4  |{color -> cyan, value -> #0ff}   |
    |5  |{color -> magenta, value -> #f0f}|
    |6  |{color -> yellow, value -> #ff0} |
    |7  |{color -> black, value -> #000}  |
    +---+---------------------------------+
     |-- id: string (nullable = true)
     |-- colors: map (nullable = true)
     |    |-- key: string
     |    |-- value: string (valueContainsNull = true)
    +---+----------------------------------+
    |id |colors                            |
    +---+----------------------------------+
    |1  |{"color":"red","value":"#f00"}    |
    |2  |{"color":"green","value":"#0f0"}  |
    |3  |{"color":"blue","value":"#00f"}   |
    |4  |{"color":"cyan","value":"#0ff"}   |
    |5  |{"color":"magenta","value":"#f0f"}|
    |6  |{"color":"yellow","value":"#ff0"} |
    |7  |{"color":"black","value":"#000"}  |
    +---+----------------------------------+
     |-- id: string (nullable = true)
     |-- colors: string (nullable = true)
    +---+-------+-----+
    |id |color  |value|
    +---+-------+-----+
    |1  |red    |#f00 |
    |2  |green  |#0f0 |
    |3  |blue   |#00f |
    |4  |cyan   |#0ff |
    |5  |magenta|#f0f |
    |6  |yellow |#ff0 |
    |7  |black  |#000 |
    +---+-------+-----+
     |-- id: string (nullable = true)
     |-- color: string (nullable = true)
     |-- value: string (nullable = true)
    Like Article
    We use cookies to ensure you have the best browsing experience on our website. By using our site, you acknowledge that you have read and understood our Cookie Policy & Privacy Policy Got It !
    Please go through our recently updated Improvement Guidelines before submitting any improvements.
    This article is being improved by another user right now. You can suggest the changes for now and it will be under the article's discussion tab.
    You will be notified via email once the article is available for improvement. Thank you for your valuable feedback!
    Please go through our recently updated Improvement Guidelines before submitting any improvements.
    Suggest Changes
    Help us improve. Share your suggestions to enhance the article. Contribute your expertise and make a difference in the GeeksforGeeks portal.