pyspark.sql.DataFrameReader.json — PySpark 3.1.2 documentation

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

卖萌的小马驹 · 案例1：Redis当作缓存，mysql查询案 ...· 3 天前 ·

面冷心慈的保温杯 · 类型转换工具类-Convert | Hutool· 12 小时前 ·

含蓄的火腿肠 · 德源科技· 4 月前 ·

知识渊博的热带鱼 · 日系洛丽塔边夹头饰暗黑哥特风发饰蝴蝶结发夹水 ...· 4 月前 ·

不开心的夕阳 · git clone -mirror 和 ...· 8 月前 ·

腼腆的酱牛肉 · 让死者闭眼全册小说_《让死者闭眼》_笔趣阁5 ...· 1 年前 ·

开朗的胡萝卜 · Test from performance ...· 1 年前 ·


   DataFrameReader.


   json

( path , schema = None , primitivesAsString = None , prefersDecimal = None , allowComments = None , allowUnquotedFieldNames = None , allowSingleQuotes = None , allowNumericLeadingZero = None , allowBackslashEscapingAnyCharacter = None , mode = None , columnNameOfCorruptRecord = None , dateFormat = None , timestampFormat = None , multiLine = None , allowUnquotedControlChars = None , lineSep = None , samplingRatio = None , dropFieldIfAllNull = None , encoding = None , locale = None , pathGlobFilter = None , recursiveFileLookup = None , allowNonNumericNumbers = None , modifiedBefore = None , modifiedAfter = None ) [source] ¶

Loads JSON files and returns the results as a DataFrame .

JSON Lines (newline-delimited JSON) is supported by default. For JSON (one record per file), set the multiLine parameter to true .

If the schema parameter is not specified, this function goes through the input once to determine the input schema.

New in version 1.4.0.

Parameters

path str, list or

RDD

string represents path to the JSON dataset, or a list of paths, or RDD of Strings storing JSON objects.

schema


      
       pyspark.sql.types.StructType

or str, optional

an optional pyspark.sql.types.StructType for the input schema or a DDL-formatted string (For example col0 INT, col1 DOUBLE ).

primitivesAsString str or bool, optional

infers all primitive values as a string type. If None is set, it uses the default value, false .

prefersDecimal str or bool, optional

infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. If None is set, it uses the default value, false .

allowComments str or bool, optional

ignores Java/C++ style comment in JSON records. If None is set, it uses the default value, false .

allowUnquotedFieldNames str or bool, optional

allows unquoted JSON field names. If None is set, it uses the default value, false .

allowSingleQuotes str or bool, optional

allows single quotes in addition to double quotes. If None is set, it uses the default value, true .

allowNumericLeadingZero str or bool, optional

allows leading zeros in numbers (e.g. 00012). If None is set, it uses the default value, false .

allowBackslashEscapingAnyCharacter str or bool, optional

allows accepting quoting of all character using backslash quoting mechanism. If None is set, it uses the default value, false .

mode str, optional

allows a mode for dealing with corrupt records during parsing. If None is

set, it uses the default value, PERMISSIVE .

PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord , and sets malformed fields to null . To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in an output schema.

DROPMALFORMED : ignores the whole corrupted records.

FAILFAST : throws an exception when it meets corrupted records.

columnNameOfCorruptRecord: str, optional

allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord . If None is set, it uses the value specified in spark.sql.columnNameOfCorruptRecord .

dateFormat str, optional

sets the string that indicates a date format. Custom date formats follow the formats at datetime pattern . # noqa This applies to date type. If None is set, it uses the default value, yyyy-MM-dd .

timestampFormat str, optional

sets the string that indicates a timestamp format. Custom date formats follow the formats at datetime pattern . # noqa This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] .

multiLine str or bool, optional

parse one record, which may span multiple lines, per file. If None is set, it uses the default value, false .

allowUnquotedControlChars str or bool, optional

allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.

encoding str or bool, optional

allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of input JSON will be detected automatically when the multiLine option is set to true .

lineSep str, optional

defines the line separator that should be used for parsing. If None is set, it covers all \r , \r\n and \n .

samplingRatio str or float, optional

defines fraction of input JSON objects used for schema inferring. If None is set, it uses the default value, 1.0 .

dropFieldIfAllNull str or bool, optional

whether to ignore column of all null values or empty array/struct during schema inference. If None is set, it uses the default value, false .

locale str, optional

sets a locale as language tag in IETF BCP 47 format. If None is set, it uses the default value, en-US . For instance, locale is used while parsing dates and timestamps.

pathGlobFilter str or bool, optional

an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter . It does not change the behavior of partition discovery . # noqa

recursiveFileLookup str or bool, optional

recursively scan a directory for files. Using this option disables partition discovery . # noqa

allowNonNumericNumbers str or bool

allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values. If None is set, it uses the default value, true .

+INF : for positive infinity, as well as alias of

+Infinity and Infinity .

-INF : for negative infinity, alias -Infinity .

NaN : for other not-a-numbers, like result of division by zero.

modifiedBefore an optional timestamp to only include files with

modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

modifiedAfter an optional timestamp to only include files with

modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

Examples

       >>> df1 = spark.read.json('python/test_support/sql/people.json')
>>> df1.dtypes
[('age', 'bigint'), ('name', 'string')]
>>> rdd = sc.textFile('python/test_support/sql/people.json')
>>> df2 = spark.read.json(rdd)
>>> df2.dtypes
[('age', 'bigint'), ('name', 'string')]