-
allows a mode for dealing with corrupt records during parsing. If None is
-
set, it uses the default value,
PERMISSIVE
.
-
PERMISSIVE
: when it meets a corrupted record, puts the malformed string into a field configured by
columnNameOfCorruptRecord
, and sets malformed fields to
null
. To keep corrupt records, an user can set a string type field named
columnNameOfCorruptRecord
in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a
columnNameOfCorruptRecord
field in an output schema.
-
DROPMALFORMED
: ignores the whole corrupted records.
-
FAILFAST
: throws an exception when it meets corrupted records.
-
columnNameOfCorruptRecord: str, optional
-
allows renaming the new field having malformed string
created by
PERMISSIVE
mode. This overrides
spark.sql.columnNameOfCorruptRecord
. If None is set,
it uses the value specified in
spark.sql.columnNameOfCorruptRecord
.
-
dateFormat
str, optional
-
sets the string that indicates a date format. Custom date formats
follow the formats at
datetime pattern
. # noqa
This applies to date type. If None is set, it uses the
default value,
yyyy-MM-dd
.
-
timestampFormat
str, optional
-
sets the string that indicates a timestamp format.
Custom date formats follow the formats at
datetime pattern
. # noqa
This applies to timestamp type. If None is set, it uses the
default value,
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
.
-
multiLine
str or bool, optional
-
parse one record, which may span multiple lines, per file. If None is
set, it uses the default value,
false
.
-
allowUnquotedControlChars
str or bool, optional
-
allows JSON Strings to contain unquoted control
characters (ASCII characters with value less than 32,
including tab and line feed characters) or not.
-
encoding
str or bool, optional
-
allows to forcibly set one of standard basic or extended encoding for
the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
the encoding of input JSON will be detected automatically
when the multiLine option is set to
true
.
-
lineSep
str, optional
-
defines the line separator that should be used for parsing. If None is
set, it covers all
\r
,
\r\n
and
\n
.
-
samplingRatio
str or float, optional
-
defines fraction of input JSON objects used for schema inferring.
If None is set, it uses the default value,
1.0
.
-
dropFieldIfAllNull
str or bool, optional
-
whether to ignore column of all null values or empty
array/struct during schema inference. If None is set, it
uses the default value,
false
.
-
locale
str, optional
-
sets a locale as language tag in IETF BCP 47 format. If None is set,
it uses the default value,
en-US
. For instance,
locale
is used while
parsing dates and timestamps.
-
pathGlobFilter
str or bool, optional
-
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows
org.apache.hadoop.fs.GlobFilter
.
It does not change the behavior of
partition discovery
. # noqa
-
recursiveFileLookup
str or bool, optional
-
recursively scan a directory for files. Using this option
disables
partition discovery
. # noqa
-
allowNonNumericNumbers
str or bool
-
allows JSON parser to recognize set of “Not-a-Number” (NaN)
tokens as legal floating number values. If None is set,
it uses the default value,
true
.
-
+INF
: for positive infinity, as well as alias of
-
+Infinity
and
Infinity
.
-
-INF
: for negative infinity, alias
-Infinity
.
-
NaN
: for other not-a-numbers, like result of division by zero.
-
modifiedBefore
an optional timestamp to only include files with
-
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
-
modifiedAfter
an optional timestamp to only include files with
-
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Examples
>>> df1 = spark.read.json('python/test_support/sql/people.json')
>>> df1.dtypes
[('age', 'bigint'), ('name', 'string')]
>>> rdd = sc.textFile('python/test_support/sql/people.json')
>>> df2 = spark.read.json(rdd)
>>> df2.dtypes
[('age', 'bigint'), ('name', 'string')]