添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
重感情的番茄  ·  使用阿里云CLI Visual ...·  1 年前    · 
严肃的西红柿  ·  实战 - ...·  1 年前    · 
果断的凉茶  ·  表的类型 | MicrosoftDocs ...·  2 年前    · 
坐怀不乱的猕猴桃  ·  区块链指南 ...·  2 年前    · 
冷冷的酱牛肉  ·  SQL游标的小知识_数据·  2 年前    · 
SparkSession. createDataFrame ( data , schema = None , samplingRatio = None , verifySchema = True ) [source]

Creates a DataFrame from an RDD , a list or a pandas.DataFrame .

When schema is a list of column names, the type of each column will be inferred from data .

When schema is None , it will try to infer the schema (column names and types) from data , which should be an RDD of either Row , namedtuple , or dict .

When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is not pyspark.sql.types.StructType , it will be wrapped into a pyspark.sql.types.StructType as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.

If schema inference is needed, samplingRatio is used to determined the ratio of rows used for schema inference. The first row will be used if samplingRatio is None .

New in version 2.0.0.

Changed in version 2.1.0: Added verifySchema.

Parameters
data RDD or iterable

an RDD of any kind of SQL data representation(e.g. Row , tuple , int , boolean , etc.), or list , or pandas.DataFrame .

schema pyspark.sql.types.DataType , str or list, optional

a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None. The data type string format equals to pyspark.sql.types.DataType.simpleString , except that top level struct type can omit the struct<> and atomic types use typeName() as their format, e.g. use byte instead of tinyint for pyspark.sql.types.ByteType . We can also use int as a short name for pyspark.sql.types.IntegerType .

samplingRatio float, optional

the sample ratio of rows used for inferring

verifySchema bool, optional

verify data types of every row against schema. Enabled by default.

Returns
DataFrame

Notes

Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.

Examples

>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name='Alice', age=1)]
>>> d = [{'name': 'Alice', 'age': 1}]
>>> spark.createDataFrame(d).collect()
[Row(age=1, name='Alice')]
>>> rdd = sc.parallelize(l)
>>> spark.createDataFrame(rdd).collect()
[Row(_1='Alice', _2=1)]
>>> df = spark.createDataFrame(rdd, ['name', 'age'])
>>> df.collect()
[Row(name='Alice', age=1)]
>>> from pyspark.sql import Row
>>> Person = Row('name', 'age')
>>> person = rdd.map(lambda r: Person(*r))
>>> df2 = spark.createDataFrame(person)
>>> df2.collect()
[Row(name='Alice', age=1)]
>>> from pyspark.sql.types import *
>>> schema = StructType([
...    StructField("name", StringType(), True),
...    StructField("age", IntegerType(), True)])
>>> df3 = spark.createDataFrame(rdd, schema)
>>> df3.collect()
[Row(name='Alice', age=1)]
>>> spark.createDataFrame(df.toPandas()).collect()  
[Row(name='Alice', age=1)]
>>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect()  
[Row(0=1, 1=2)]
>>> spark.createDataFrame(rdd, "a: string, b: int").collect()
[Row(a='Alice', b=1)]
>>> rdd = rdd.map(lambda row: row[1])
>>> spark.createDataFrame(rdd, "int").collect()
[Row(value=1)]
>>> spark.createDataFrame(rdd, "boolean").collect() 
Traceback (most recent call last):
Py4JJavaError: ...