PySpark 大数据处理 - 康行天下

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
PySpark

PySpark 采用了 Python、JVM 进程分离的多进程架构，在 Driver、Executor 端均会同时有 Python、JVM 两个进程。当通过 spark-submit 提交一个 PySpark 的 Python 脚本时，Driver 端会直接运行这个 Python 脚本，并从 Python 中启动 JVM；而在 Python 中调用的 RDD 或者 DataFrame 的操作，会通过 Py4j 调用到 Java 的接口。在 Executor 端恰好是反过来，首先由 Driver 启动了 JVM 的 Executor 进程，然后在 JVM 中去启动 Python 的子进程，用以执行 Python 的 UDF，这其中是使用了 socket 来做进程间通信。
PySpark 使用了 Py4j 这个开源库。当创建 Python 端的 SparkContext 对象时，实际会启动 JVM，并创建一个 Scala 端的 SparkContext 对象。
对于 Spark 内置的算子，在 Python 中调用 RDD、DataFrame 的接口时会通过 JVM 去调用到 Scala 的接口，最后执行和直接使用 Scala 并无区别。而对于需要使用 UDF 的情形，在 Executor 端就需要启动一个 Python worker 子进程，然后执行 UDF 的逻辑。
Executor 端启动 Python 子进程后，会创建一个 socket 与 Python 建立连接。所有 RDD 的数据都要序列化后，通过 socket 发送，而结果数据需要同样的方式序列化传回 JVM。
PySpark 不足之处：
Python、JVM通信的开销。
仍然需要学习spark的分布式编程模式。
注：Databricks 提出了新的 Koalas 接口来使得用户可以以接近单机版 Pandas 的形式来编写分布式的 Spark 计算作业，对数据科学家会更加友好。Koalas 接口2019年公开，试图在 Spark 之上提供一个和 Python 的 Pandas 一样接口的包。
参考 PYSPARK 原理解析
客户端安装： pip install pyspark
终端中执行 pyspark 进入pyspark shell。
这类似于 $SPARK-HOME/bin/spark-shell 和 $SPARK-HOME/bin/sparkR ，交互式编写代码并运行。在进入spark shell时便建立到服务端的连接，并申请计算资源。
方式一：pyspark < script.py 仍然属于shell式，只不过用了输入流重定向。
方式二：spark-submit script.py
在命令执行时需要的 [OPTION] 参数是spark的常规参数，例如：
--name ${app_name}
--master yarn --deploy-mode cluster
--executor-memory 15G
--driver-memory 30G
--num-executors 100
--executor-cores 4
--queue root
--conf spark.driver.userClassPathFirst=true
--jars ...
发现当client和worker的python版本不一致时无法在pyspark shell中执行一些操作，不确定具体的原因，也许和通信时的序列化机制有关。需要设置 PYSPARK_PYTHON 和 PYSPARK_DRIVER_PYTHON 环境变量。
添加依赖文件

任务启动前
启动spark-shell或者提交任务时通过参数指定加载本地依赖文件，类似hadoop。
例如加载用Java编写的库base64（底层实现采用JNI调用C++代码）：
spark-shell --files libbase64.so --jars base64.jar ...
在 Apache Spark 中使用 JNI 调用 C/C++ 代码
代码示例。
任务启动后
获取依赖文件（如动态链接库、数据文件、模型参数）等，可采用 SparkContext的 addFile() 方法来添加本地或hdfs路径的文件或文件夹。然后Spark的Driver和Exector可以通过SparkFiles.get()方法来获取文件的绝对路径。注：集群模式下不支持本地文件目录。
sparkContext.addFile(path, recursive=False)
addFile把添加的本地文件传送给所有的Worker，这样能够保证在每个Worker上正确访问到文件。另外，Worker会把文件放在临时目录下。因此，比较适合用于文件比较小，计算比较复杂的场景。如果文件比较大，网络传送的消耗时间也会增长。
如果添加的是常见的压缩文件，spark还会调用外部命令来自动解压，如tar。
删除hdfs文件夹

spark和hadoop一样，在保存文件时如果文件夹已存在，则会失败，需要先删除文件夹。然而spark并未提供直接的函数调用API来删除，需要通过hadoop conf方式删除。
python代码参考：
def rm_hdfs_dir(path):
    # spark 为 SparkSession 对象
    Path = spark.sparkContext._gateway.jvm.org.apache.hadoop.fs.Path
    URI = spark.sparkContext._gateway.jvm.java.net.URI
    FileSystem = spark.sparkContext._gateway.jvm.org.apache.hadoop.fs.FileSystem
    fs = FileSystem.get(URI(path), spark.sparkContext._jsc.hadoopConfiguration())
    fs.delete(Path(path), True)
# rm_hdfs_dir(path)
# rdd.repartition(partition).saveAsTextFile(path)
scala写法，参考 Spark删除HDFS文件的两种方式
通过sc.textFile读取文件时默认只支持UTF-8字符编码，saveAsTextFile输出也是，所以在处理GBK等编码文件时需要小心，最好预处理统一为UTF-8。
APP运行速度分析
spark任务运行速度慢的原因有很多，与hadoop一样会遇到数据倾斜、核数设置不合理等问题。
打开app运行链接，进入spark ui。

点击Executors选项卡，查看 申请到的cores和正则运行的任务数量（Active Tasks），

如果大量task等待，cores很少，就是资源不足。Pyspark组件在没有申请到指定的资源数量时仍然会利用申请到的部分资源执行。
RDD 常用算子
RDD 弹性分布式数据集，是不可变类型，仅能通过transformation和action operator转换成新的RDD。

spark采用函数式编程的思想，适合多核并行计算。
RDD 常用算子列表如表，参考：
官方文档：https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
中文教程：Spark 系列（四）—— RDD常用算子详解
map(func)
Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func)
Return a new dataset formed by selecting those elements of the source on which func returns true.
flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator => Iterator when running on an RDD of type T.
mapPartitionsWithIndex(func)
Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.
sample(withReplacement, fraction, seed)
Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset)
Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numPartitions]))
Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.  Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.  Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks.
reduceByKey(func, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOp, combOp, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numPartitions])
When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numPartitions])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable, Iterable)) tuples. This operation is also called groupWith.
cartesian(otherDataset)
When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
reduce(func)
Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect()
Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count()
Return the number of elements in the dataset.
first()
Return the first element of the dataset (similar to take(1)).
take(n)
Return an array with the first n elements of the dataset.
takeSample(withReplacement, num, [seed])
Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n, [ordering])
Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path)
Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path)  (Java and Scala)
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path)  (Java and Scala)
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().
countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func)
Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.  Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
函数传参、闭包
在上面的RDD常用算子中指定的func仅能接受数据作为参数，如果想要再额外传递参数进去，可以通过三种方式：
1. 闭包
def test(abc):
    def my_map_partitions(partition):
        ...do something with partition and abc...
    return my_map_partitions
df.rdd.mapPartitions(test(abc))
2. 匿名函数

rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
3. Currying

functools.partial
from functools import partial
def my_func(arg, j):
rdd.map(partial(my_func,arg))
还可以使用 toolz 工具提供的 curry 装饰器。
执行效率 闭包 > 匿名函数 > Currying / partials, 但匿名函数可能更符合我们的使用习惯。
参考 Stack Overflow
常用debug语句
rdd.first() # 返回第一行rdd
rdd.take(n) # 返回n行
print(rdd) # 查看rdd数据类型
print(type(rdd))
print(df) # 查看DataFrame列名、类型。
df.show() # 像MySQL查询那样显示结果。
# collect到内存之后是普通的python数据类型，可以print出来查看
for x in word_count_rdd.collect():
        print(x[0] + ", " + str(x[1]))
partitions、executor数量
rdd.partitions.size

rdd.partitions.length

rdd.getNumPartitions()
partition与并行核数的关系：

partition仍然延续了hadoop的分块的概念，输出文件的part数量就是指定的partition的数量，而当计算核数>=partition数量时，所有partition才能够完全并行计算，否则会有串行。

Spark官网建议的设置原则是，设置spark.default.parallelism参数为num-executors * executor-cores的2~4倍较为合适。同样的我们在repartition的时候也设置partitions数量超过可用的核数就好。
官网指导 Spark RDD Programming Guide上的建议，集群节点的每个核分配2-4个partitions比较合理。（Typically you want 2-4 partitions for each CPU in your cluster.）
https://tech.meituan.com/2016/04/29/spark-tuning-basic.html
https://luminousmen.com/post/spark-tips-partition-tuning  这个里边提的是4倍。
Tuning the Number of Partitions
sc.textFile的分区数为max(sc.defaultMinPartitions, hdfs分块数量)

对于 sc.parallelize，分区数为 sc.defaultParallelism，如sc.parallelize(1 to 100).partitions.size

如果不对spark.default.parallelism进行配置，则sc.defaultParallelism默认为executor的cpu核心数.

我在jupyternotebook中尝试发现defaultMinPartitions、defaultParallelism 均为2，为单个executor的cpu核心数.
查看executor数量
sc.getConf().get("spark.executor.instances")

一个executor就是一个进程，可以利用多核并行。总的并行数量就是总核数（num-executors * executor-cores）

参考 Stack Overflow
repartition/coalesce
rdd.repartition(numPartitions)

df.repartition(numPartitions) 如果不指定partitions数量，则会采用默认partition数量（sc.defaultParallelism）。
注意：repartition实际调用的是 coalesce(numPartitions, shuffle=True)，会进行所有行到机器节点partition的全局shuffle。

而coalesce函数可以控制是否shuffle，但是有约束。当shuffle为False时，只能减小partition数，而无法增大。
由于shuffle操作需要移动数据，代价较高，因此当减少paritition时如无shuffle的必要，就改用coalesce(numPartitions, shuffle=False)。

shuffle是对全局所有partition上的所有行进行shuffle，生成新的partitions。

而如果不进行shuffle，则只能合并partition，将多余的partition上的数据原样拷贝到另外的partition上。由于partition上数据量可能分布不均，在coalesce(shuffle=False)的时候可能更严重，导致数据倾斜问题，这样整体的耗时可能比shuffle后执行的耗时还要高。因此是否要shuffle视情况而定，比如在对数据filter之后，partition数量保持不变，但是各partition上数据量级因为过滤发生不均衡的变化，这时通常需要进行shuffle。
repartition 有以下几个作用：
主要用来使数据重新分配以避免数据倾斜导致的耗时被拖垮的问题。
将partition数量设小一些，在最终输出之前控制输出的文件个数，避免过多的小文件。默认会使用全部的节点数，partition数量可能会非常多。
常用的partition方式为：RoundRobinParititiong，它把所有的数据以轮循的方式放到新的 partition 里面，最后数据是几乎平均分配到各个 partition 里面。有些时候在一个数据处理链中，上游处理完之后的数据分布是极度不均匀的，这使得后续数据处理会变得比较麻烦，很可能几个长时间运行的任务会拖累整个作业运行，这时候加上这样一个 repartition 操作之后，你会发现作业运行时效会有一个质的飞跃。
注意：：

repartition 并不能改变partition内数据的先后顺序，也就是不能实现全局打乱数据行的结果。

可以使用 repartitionAndSortWIthinPartitions 在partition的同时在各个partition内进行排序（指定随机排序），效率比partition完再执行sortBy要高。

DataFrame可以用orderBy(F.rand())来随机排序(但是orderBy是全局排序，性能会有影响)。
repartitionAndSortWIthinPartitions的API如下：
RDD.repartitionAndSortWithinPartitions(numPartitions=None, partitionFunc=, ascending=True, keyfunc=<function RDD.>)
用法示例：
from random import random
# 分区排序
rdd.repartitionAndSortWithinPartitions(numPartitions=100, keyfunc=lambda x: random())
# 全局排序
rdd.sortBy(lambda x: random())
https://www.jianshu.com/p/391d42665a30
https://stackoverflow.com/questions/31610971/spark-repartition-vs-coalesce
rdd排序
rdd是可以排序的，




    

如kv数据按key排序：sortByKey(ascending=False)
sortBy(func)
repartitionAndSortWithinPartitions(numPartitions, keyfunc)
取top数据: top(n, key_func), 这种方式比先全局排序后再take(n)要节省内存（联想小顶堆的实现方式）。

top_word_count = word_count_rdd.top(3, key=lambda record: (record[1], record[0]))
map、mapPartitions
map是对rdd中的每一个元素进行操作，而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。如果在map过程中需要频繁创建额外的对象(例如将rdd中的数据通过jdbc写入数据库,map需要为每个元素创建一个链接而mapPartition为每个partition创建一个链接),则mapPartitions效率比map高的多。

也就是说map是对每个元素调用一次func，而mapPartitions则是对每个partition中的一批数据调用一次func，调用次数减少，但需要自己在函数中迭代每个元素来处理。mapPartitions因为可以迭代拿到partition中所有数据，因此可以用来做一些统计信息，比如batch norm，但是要注意内存溢出的问题。
示例代码：
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Example').getOrCreate()
data = [('James','Smith','M',3000),
  ('Anna','Rose','F',4100),
  ('Robert','Williams','M',6200), 
columns = ["firstname","lastname","gender","salary"]
df = spark.createDataFrame(data=data, schema = columns)
df.show()
#Example 1：返回迭代器，边迭代边创建数据
def reformat(partitionData):
    for row in partitionData:
        yield [row.firstname+","+row.lastname,row.salary*10/100]
#Example 2：先遍历创建完数据再返回迭代器，需要注意OOM
def reformat2(partitionData):
  updatedData = []
  for row in partitionData:
    name=row.firstname+","+row.lastname
    bonus=row.salary*10/100
    updatedData.append([name,bonus])
  return iter(updatedData)
# mapPartitions函数需要返回迭代器
df=df.rdd.mapPartitions(reformat).toDF(["name","bonus"])
df.show()
df2=df.rdd.mapPartitions(reformat2).toDF(["name","bonus"])
df2.show()
mapValues、flatMap、flatMapValues
mapValues(function)

原RDD中的Key保持不变，与新的Value一起组成新的RDD中的元素。
flatMap(function)

与map类似，区别是原RDD中的元素经map处理后只能生成一个元素，而原RDD中的元素经flatmap处理后可生成多个元素，展开到RDD中。function举例：lambda x: [(x[0], y) for y in x[1]]

可以看作是聚集操作的逆过程。
flatMapValues(function)

flatMapValues类似于mapValues，不同的在于flatMapValues应用于元素为KV对的RDD中Value。每个一元素的Value被输入函数映射为一系列的值，然后这些值再与原RDD中的Key组成一系列新的KV对。

可以看作是groupByKey的逆过程。
cache、persist
RDD默认在被多次用到时会重复计算，如果我们确定一个RDD会被多次用到，就可以调用.cache()将其属性信息保存到内存中以便复用。会增加对内存的消耗，但通常会减少耗时。

与之类似的persist()函数则提供了几种不同的存储介质参数，如磁盘、内存优先不足时再用磁盘。

注意，这两者并非action操作，依然是惰性的。
collect
collect会将所有worker上的数据拷贝到driver上汇总到内存里，因此对内存要求较高，需要甚用。
全局唯一编号
某些应用场景下需要给rdd的所有数据进行编号，后续对编号做些筛选等处理后通过编号再找到对应行的数据。比如通过采样函数对(id, probability)进行按照概率的采样之后通过id进行join拼接。

根据需要，有连续编号与非连续编号两种方式可选择。下边列举一些可选用的方式。
连续编号：
rdd.zipWithIndex()
Spark SQL：select row_number() over(order by xx）
pyspark.sql.functions.row_number().over(pyspark.sql.expressions.Window.orderBy(ColName))
非连续编号：
rdd.zipWithUniqueId()
pyspark.sql.functions.monotonically_increasing_id()
hive SQL: select java_method('java.util.UUID','randomUUID')  在spark中是否可用待验证
zipWithIndex 返回原数据与index构成的元组，该操作需要遍历整个数据两遍，第一遍得到每个partition中的行数，一遍给partition定义起始id，第二遍则在partition内根据起始id进行编号。因此使用前最好对rdd进行cache(). 操作完分区数不变，id从0开始，顺序则按各partition内的原始顺序。

sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()

[('a', 0), ('b', 1), ('c', 2), ('d', 3)]
zipWithUniqueId 相比zipWithIndex，编号方式发生改变，只需遍历一遍数据。

每个分区中第一个元素的唯一ID值为：该分区索引号，

每个分区中第N个元素的唯一ID值为：(前一个元素的唯一ID值) + (该RDD总的分区数)

相当于轮转着给各个partition中的数据进行编号。由于partition数据分布不均，则会出现跳越的编号。如果partition完美均衡，则编号全部连续。
sql row_number()函数，生成按某列排序后，新增单调递增，连续的一列。操作完后分区数变为1，id列从1开始。显然，由于有order by排序过程，放到了一个executor上执行，速度很慢，大数据量下难以接受，且可能OOM。
pyspark.sql.functions.monotonically_increasing_id() 能够保证单调递增，但是实现原理为将partition ID写入64位long型整数的头32 upper 31 bits, 低位 lower 33 bits 存放partition内部行号. 分区数不变。

用法：df.select(monotonically_increasing_id().alias('id'))

或：df.withColumn("monotonically_increasing_id", F.monotonically_increasing_id())
假设数据量不大，不考虑放到单个节点上出现OOM的问题，想要在不改变各partitino内部排序的情况下获得连续递增的id，可以结合row_number与monotonically_increasing_id：

F.row_number().over(Window.orderBy(F.col('monotonically_increasing_id')))
另外，在 Microsoft SQL Server等数据库中可以用 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) 来避免排序打乱顺序，其中NULL可替换为任意的字符常量。
如果不在意id的单调递增的特性，采用在Java领域或业界常用的UUID算法即可。全局唯一、分布式、速度快。
对DataFrame利用zipWithIndex添加单调连续递增的列的包装函数：
from pyspark.sql import types as T
def dfZipWithIndex(df, offset=0, idxName="rowId"):
        Enumerates dataframe rows in native order, like rdd.ZipWithIndex(), 
        but on a dataframe and preserves schema
        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param idxName: name of the index column
    new_schema = T.StructType(
                    [T.StructField(idxName,T.LongType(),True)]    # new added field in front
                    + df.schema.fields                            # previous schema
    zipped_rdd = df.rdd.zipWithIndex()
    new_rdd = zipped_rdd.map(lambda tup: ([tup[1] + offset] + list(tup[0])))
    return spark.createDataFrame(new_rdd, new_schema)
https://stackoverflow.com/questions/30304810/dataframe-ified-zipwithindex
https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6
randomUUID： https://stackoverflow.com/a/58625717
https://stackoverflow.com/questions/44105691/row-number-without-order-by
$"字符串" 用法
spark语句：df.select($"name", $"age" + 1).show()

其中的$"字符串"用法并非是scala或python的语法，而是spark的一个函数定义：
参考scala源码实现： org.apache.spark.sql.SQLImplicits

返回的是一个 Column 对象，可用于修改对应的列值.
  implicit class StringToColumn(val sc: StringContext) {
    def $(args: Any*): ColumnName = {
      new ColumnName(sc.s(args: _*))
如果只是select查看，而不修改对应列，则不需要加$,例如

dataframe.select("columnname").show 或

dataframe.select(col("columnname")).show
参考 Stack Overflow
读取文本文件
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
textRDD1 = sc.textFile("hobbit.txt")
textRDD2 = spark.read.text('hobbit.txt').rdd
两种方式的区别：

sc.textFile(String path, int minPartitions) 返回 RDD[String]

spark.read.text(String path) 返回 DataSet[Row] 或者 DataFrame，仅有一列，且默认列名为 value。
文件path可以是hdfs的文件或文件夹（但对文件夹有要求，不能含子文件夹，否则报错提示不是文件），这和hadoop streaming输入要求一样。由于hadoop支持通配符路径展开，因此可以写成/path/to/dir_of_dir/*来指定多个文件夹。
多路径输入

方式一：可以使用逗号，并结合通配符，读到一个RDD：

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
方式二：多个RDD union：
r1 = sc.textFile("xxx1")
r2 = sc.textFile("xxx2")
rdds = [r1, r2, ...]
bigRdd = sc.union(rdds)
Spark SQL
Spark SQL 支持 ANSI SQL 语法标准，在API层面引入了DataFrame。参考：官方API文档 。
执行 $SPARK-HOME/bin/spark-sql 进入sql交互式shell。
下面介绍 SparkContext、SparkSession、SQLContext和HiveContext 的联系与区别。

参考 cnblogs
SparkContext
驱动程序使用SparkContext与集群进行连接和通信，它可以帮助执行Spark任务，并与资源管理器(如YARN 或Mesos)进行协调。
使用SparkContext，可以访问其他上下文，比如SQLContext和HiveContext。
使用SparkContext，我们可以为Spark作业设置配置参数。
在spark-shell中，启动时会根据配置参数创建一个SparkContext，并分配给变量sc。

如果要在代码中创建SparkContext，使用SparkConf来配置并创建。
//set up the spark configuration
val sparkConf = new SparkConf().setAppName("hirw").setMaster("yarn")
//get SparkContext using the SparkConf
val sc = new SparkContext(sparkConf)
SQLContext/HiveContext
SQLContext是通往SparkSQL的入口。HiveContext是通往hive入口，继承自SQLContext，包含了更多的特定的功能。

创建方式：
// scala,   sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
# python
from pyspark import HiveContext
sqlContext = HiveContext(sc)
一旦有了SQLContext，就可以开始处理DataFrame、DataSet等。
SparkSession
SparkSession 是在Spark 2.0中引入的，对不同上下文的访问做了统一。为了后向兼容，继续保留了SQLContext和HiveContext。高版本建议使用SparkSession。
这四者的演进关系为：SparkContext -> [ (SQLContext -> HiveContext) / SparkSession ]
pyspark SparkSession创建示例：
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql import types as T
spark = SparkSession \
    .builder \
    .appName("DemoApp") \
    .config("spark.some.config.option", "some-value") \
    .enableHiveSupport() \
    .getOrCreate()
# 其中config是可选的，用于配置一些属性
# 高版本spark>=2.0默认支持hive，不需要写enableHiveSupport
# sc = spark.sparkContext 可以获取到SparkContext对象
执行sql语句
sql1 = 'select count(1) as cnt from xx'
r = sqlContext.sql(sql1) # 返回对象为DataFrame
r.show() # 可以像MySQL客户端那样展示sql查询结果
r = r.repartition(1).cache() # cache到内存
cnt = r.first()["cnt"]  # 获取row 0 的value值
DataFrame
在数据科学领域，Python 和 R 这两个编程语言都都有着共同的数据抽象 - Dataframe。

以 Python 为例，Dataframe 这个概念对应的是 Pandas 库，可以与 Numpy 、Matplotlib 、 scikit-learn等非常流行的库结合使用。

但目前pandas是单机版的处理工具，spark的DataFrame是大数据版的。虽然有着相似的概念，语法上相近，但是还是有些差异。

相比 pandas DataFrame可变类型，spark DataFrame与RDD均为不可变类型。并且spark DataFrame不维护行号。
DataFrame 的算子很多与 RDD的算子同名，参数会有差别，功能近似。
DataFrame vs RDD
spark的DataFrame 是spark sql的范畴，与rdd可互相转换。由于与RDD都是不可变类型，互相转换或者修改它创建新的DataFrame开销都比较大（参考 can be quite expensive。
RDD转DataFrame：需指定schema（列名list 或者schema对象，调用rdd.schema可以得到某个rdd的schema）
rdd.toDF(schema）
spark.createDataFrame(rdd, schema)
DataFrame转rdd：df.rdd 即可得到由 Row 构成的 pyspark.RDD .
互相转换的代码示例：
df_schema = df.schema
df_new_schema = df_schema.add("tail", T.ArrayType(T.IntegerType())) # 新增一列
df_new_rdd = df.rdd.mapPartitions(lambda x: add_col(x, data_list))  # 自定义函数增加一列数据
df_new = spark.createDataFrame(df_new_rdd, df_new_schema)
DataFrame与RDD类型如何选择?

RDD相比DataFrame，属于更底层的东西。我们在开发应用时如果DataFrame能满足需求就用DataFrame，一个原因是框架对DataFrame计算流图做了优化，而自己写的RDD流程可能并非最优。有人做过耗时统计，DataFrame代码平均耗时比RDD代码耗时更短。
修改Column
withColumn()是一个DataFrame函数，用于向DataFrame中添加新列，更改现有列的值，转换列的数据类型，从现有列派生新列。是一种transformation函数，它只有在调用action时才执行。
withColumn(colName : String, col : Column) : DataFrame

指定要创建的新列colName，从col来创建，返回一个新DataFrame。col可以用F.concat()等算子来处理得到。

colName指定成和已有的列名相同，则为更改现有列的值。
使用 withColumn 来处理某一列，cast(pyspark.sql.types) 来做类型转换，如StringType()、DoubleType()。
如修改DataFrame某列的类型，代码示例：
>>> rawdata.printSchema()
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)
>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))
# alias给新列命名，这里应该是不必要的
>>> rawdata.printSchema()
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)
增加一列固定值：withColumn('label', F.lit(0))

还可以使用select('col1', 'col2', .., F.lit(0).alias('label'))
列名重命名

withColumnRenamed(existing, new)
DataFrame select as用法
df.selectExpr("Name as name", "Age as age")
df.select(F.col("Name").alias("name"), F.col("Age").alias("age"))
全局排序：
df = df.orderBy(F.rand())
rdd = rdd.sortBy(lambda x: random())

先分数据到partition，再对每个partition上数据排序。

rdd = rdd.repartitionAndSortWithinPartitions(numPartitions=100, keyfunc=lambda x: random())
需要注意对rdd进行sortBy、repartitionAndSortWithinPartitions均需要输入rdd是tuple类型，这样才能通过keyfunc指定哪些字段用于排序。

因此，如果是用sc.textFile(xx) 读入的rdd，每行为一个字符串，需要map成元组，如 line->(line, 1) 或者line->split(line, separator). sort完输出时根据需要的格式再map, 如还原：x -> x[0]。

如果不处理成tuple，直接sortBy，会提示：ValueError: too many values to unpack (expected 2)
SQL 集成
前边提过DataFrame属于Spark SQL的范畴，可以与sql语句处理方式无缝融合。某些任务使用sql来写处理逻辑，代码更简介，比如aggregation 和 windowing功能。

由于 HiveQL支持更广泛的SQL语法，因此在使用Hive表存储数据的情况下常用 HiveContext，而不是SQLContext。
首先，需要将待处理的DataFrame注册为指定名称的临时表：DataFrame.createOrReplaceTempView(table_name)

注：旧版本采用DataFrame.registerTempTable(table_name) ，在spark2.0.0之后Deprecated。

该注册表的生命周期同创建该表的 SparkSession，不再使用时可以用spark.catalog.dropTempView(table_name) 来显式删除该注册表。

注册之后便可执行sql解析：

sqlContext.sql("SELECT ... FROM table_name ...") 返回值为DataFrame。
filter
where() 是filter()的别名，用于过滤。

df.filter(condition) 和 rdd的filter一样，condition为true的过滤出来返回结果。

condition 有两种格式, types.BooleanType 类型的Column，或者SQL表达式字符串：
引用列，可以使用两种写法：（1）df.colname （2）F.col("colname")

例如：df.colname != x。  支持多个表达式与或非组合(&|~)。
SQL表达式，例如 df.filter("colname <> x"）、isin(df.colname, [])
其它可用函数：字符串匹配函数 startswith(), endswith() ， contains()

array_contains() 数组是否包含某一元素
like、rlike
# like - SQL LIKE pattern
df2.filter(df2.name.like("%rose%")).show()
# rlike - SQL RLIKE pattern (LIKE with Regex)
#This check case insensitive
df2.filter(df2.name.rlike("(?i)^*rose$")).show()
参考： https://sparkbyexamples.com/pyspark/pyspark-where-filter/
union
DataFrame.union(other) 是 DataFrame.unionAll(other)的别名，

类似SQL中的UNION ALL的作用，要求具有完全一致的列，会简单的将partitions相加，不会做合并或shuffle操作。如果不需要打乱顺序，而想缩减partition数量，使用coalesce.

多个rdd的union操作可以用 bigRdd = sc.union([r1, r2, ...])

而DataFrame的union只能合并两个，但是可以结合reduce来合并多个。
from functools import reduce  # For Python 3.x, 在python2中是内置函数
from pyspark.sql import DataFrame
# 方式1
bigDf = reduce(DataFrame.unionAll, [df1, df2, df3])
# 方式2
def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)
bigDf = unionAll(df1, df2, df3)
参考 Stack Overflow
例如：df_neg = df_neg.join(url_pw_df, on = [url_pw_df.url_id == F.col('choice')], how='inner')

join的输出partition数量与参数 spark.sql.shuffle.partitions 有关系，其默认值为200，控制了在join/aggregation操作中shuffle数据时的partition数量。是Spark SQL专用的设置，除了在配置中指定之外，还可通过代码来修改：

spark.conf.set("spark.sql.shuffle.partitions", 200)
shuffle partition的数量不是很好调节，在数据量波动不是特别大的话，可以考虑通过配置 off-heap 来缓解内存问题，有限内存下可以保存更多的数据，而且性能会更高一些。off-heap 内存管理源自 tungsten 项目（spark前沿发展项目）。

主要配置：
spark.memory.offHeap.enabled 启用 off-heap 内存
spark.memory.offHeap.size 单位 bytes，配置不要超过单个 executor 的内存
sort/orderBy
orderBy是sort()函数的一个别名，与RDD.sortBy(keyfunc, ascending=True, numPartitions=None)函数类似，均为全局排序，最终会收集到一个executor上进行全局排序（最终排序输出的partition数量不变）。注意不要和SparkSQL/Hive中的sort by（partition内部排序）搞混了。sql sort by对应的DataFrame函数是sortWithinPartitions()仅在每个partition内部排序，非全局排序（效率要高很多）。

例子：df.orderBy('country', 'age')
spark 大数据全局排序原理：

如果采用外部排序执行多路归并，排序全放到单台节点上，效率将十分低下。spark对多路归并排序做了一些设计：参考 排序算子sortByKey
Driver 提交一个采样任务，需要Executor对每个Partition进行数据采样，数据采样是一次全数据的扫描
Driver 获取采样数据，每个Partition的数据量，依据数据量的权重，进行Range的分配
Driver 开始进行排序，先提交ShuffleMapTask ，Executor对分配到自己的数据基于Range进行Partition的分配，直接写入Shuffle文件中
Driver 提交ResultTask，Executor读取Shuffle文件中相同的Partition进行合并（相同的key不做值的合并）、排序
Driver 接收到ResultTask的值后，最后进行不同的Partition数据合并
由于排序过程，各个worker节点均参与了计算，因此速度还可以，仅需考虑内存的问题。
reduceByKey的df写法
对于kv pair rdd，我们可以用 rdd.reduceByKey(lambda a,b: a+b) 来实现相同key的value聚合计算。对于DataFrame也有类似的实现API：

不关心列名：

df.groupBy($"key").sum("value")

保持value列名不变，需使用agg：

df.groupBy($"key").agg(sum($"value").alias("value"))
还可以采用SQL查询方式来处理：

df.createOrReplaceTempView("df")

sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")
collect_list
F.collect_list(col) 与agg结合用于将一组数据转换为列表, collect_set则为不重复的列表。

例如：df.groupby("id").agg(F.collect_list(col1), F.collect_set(col2))
附-Spark ML例子
Spark mllib 包含许多传统统计机器学习方法，当不需要特殊或更复杂的算法时，可以选用mllib充分利用Spark大数据处理的优势。
TF-IDF
下面是一个通过TFIDF提取文档关键字（top 10）的例子，调用了mllib scala接口。相比于采用单机版的scikit-learn、gensim等库计算TFIDF，mllib分布式有明显的速度优势。
CountVectorizer 和 HashingTF 均是用于统计词频的transformer，HashingTF 是对单词hash后统计index的次数，存在哈希冲突的可能。IDF是一个Estimator，接收TF特征向量，然后统计拟文档频次，与TF相乘后得到TFIDF。
python版：
from pyspark.ml.feature import HashingTF,IDF,Tokenizer
// rdd = sc.textFile(fn)...
sentenceData = spark.createDataFrame(rdd).toDF("sentence", "label")
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
tfidf = idfModel.transform(featurizedData)
tfidf.select("features", "label").show()
scala版：
import org.apache.spark.mllib.feature.{CountVectorizer, HashingTF, IDF}
val fn = "hdfs:///user/xxx/article.jieba_cut.utf8"
val documents = sc.textFile(fn).map(_.split("\t"))
val prefix = documents.map(_.dropRight(1)) // 删掉最后一列
val doc = documents.map(a=>{val l=a.length; a(l-1).split(" ").toSeq})
// 也可以在读取文件时就直接转换为DataFrame，如：
// val df = spark.read.option("header", "false").option("delimiter", "\t").csv(fn).toDF("words", "other");
// val words_rdd = df.select("words").rdd.map(row => row.getAs[String](0).split(" ").toSeq)
// HashingTF 统计词频
val hashingTF = new HashingTF()
val tf = hashingTF.transform(doc)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// var wordMap = doc.flatMap { row => row.map{ w => (hashingTF.indexOf(w), w) } }.collect().toMap
val wordMap = doc.flatMap(row => row).distinct().map(w=> (hashingTF.indexOf(w), w)).collectAsMap()
val keyWords = tfidf.map { x => { var v=x.toSparse; v.indices.zip(v.values).sortWith((a, b) => { a._2 > b._2 }).take(10).map(x => (wordMap.get(x._1).get, x._2))} }
val fn_out = fn + ".kw"
documents.zip(keyWords).map(e=>{val kw=e._2.map(_._1).mkString(" "); e._1.mkString("\t")+"\t"+kw}).saveAsTextFile(fn_out)
// --------------------------------------
// CountVectorizer 统计词频，并采用DataFrame
val df = doc.toDF("words")
val cvTF = new CountVectorizer().setInputCol("words").setOutputCol("rawFeatures").fit(df)
val tf = cvTF.transform(df)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features").fit(tf)
val tfidf = idf.transform(tf)
val wordMap = cvTF.vocabulary.zipWithIndex.map { case (w, i) => (i, w) }.toMap
val keyWords = tfidf.select("features").rdd.map { x => { var v=x.getAs[org.apache.spark.ml.linalg.SparseVector](0); v.indices.zip(v.values).sortWith((a, b) => { a._2 > b._2 }).take(10).map(x => (wordMap.get(x._1).get, x._2))} }