# set hadoop conf dir
export HADOOP_CONF_DIR=/usr/lib/hadoop
Set SPARK_HOME
in interpreter setting page
If you want to use multiple versions of Spark, then you need to create multiple Spark interpreters and set SPARK_HOME
separately. e.g.
Create a new Spark interpreter spark33
for Spark 3.3 and set its SPARK_HOME
in interpreter setting page,
Create a new Spark interpreter spark34
for Spark 3.4 and set its SPARK_HOME
in interpreter setting page.
Besides setting SPARK_HOME
in interpreter setting page, you can also use inline generic configuration to put the
configuration with code together for more flexibility. e.g.
Set master
After setting SPARK_HOME
, you need to set spark.master property in either interpreter setting page or inline configuartion. The value may vary depending on your Spark cluster deployment type.
For example,
local[*] in local mode
spark://master:7077 in standalone cluster
yarn-client in Yarn client mode (Not supported in Spark 3.x, refer below for how to configure yarn-client in Spark 3.x)
yarn-cluster in Yarn cluster mode (Not supported in Spark 3.x, refer below for how to configure yarn-cluster in Spark 3.x)
mesos://host:5050 in Mesos cluster
That's it. Zeppelin will work with any version of Spark and any deployment type without rebuilding Zeppelin in this way.
For the further information about Spark & Zeppelin version compatibility, please refer to "Available Interpreters" section in Zeppelin download page.
Note that without setting SPARK_HOME
, it's running in local mode with included version of Spark. The included version may vary depending on the build profile. And this included version Spark has limited function, so it
is always recommended to set SPARK_HOME
.
Yarn client mode and local mode will run driver in the same machine with zeppelin server, this would be dangerous for production. Because it may run out of memory when there's many Spark interpreters running at the same time. So we suggest you
only allow yarn-cluster mode via setting zeppelin.spark.only_yarn_cluster
in zeppelin-site.xml
.
Configure yarn mode for Spark 3.x
Specifying yarn-client
& yarn-cluster
in spark.master
is not supported in Spark 3.x any more, instead you need to use spark.master
and spark.submit.deployMode
together.
spark.master
spark.submit.deployMode
Yarn Client
client
Yarn Cluster
cluster
Interpreter binding mode
The default interpreter binding mode is globally shared
. That means all notes share the same Spark interpreter.
So we recommend you to use isolated per note
which means each note has own Spark interpreter without affecting each other. But it may run out of your machine resource if too many
Spark interpreters are created, so we recommend to always use yarn-cluster mode in production if you run Spark in hadoop cluster. And you can use inline configuration via %spark.conf
in the first paragraph to customize your spark configuration.
You can also choose scoped
mode. For scoped
per note mode, Zeppelin creates separated scala compiler/python shell for each note but share a single SparkContext/SqlContext/SparkSession
.
SparkContext, SQLContext, SparkSession, ZeppelinContext
SparkContext, SQLContext, SparkSession (for spark 2.x, 3.x) and ZeppelinContext are automatically created and exposed as variable names sc
, sqlContext
, spark
and z
respectively, in Scala, Python and R environments.
Note that Scala/Python/R environment shares the same SparkContext, SQLContext, SparkSession and ZeppelinContext instance.
Yarn Mode
Zeppelin support both yarn client and yarn cluster mode (yarn cluster mode is supported from 0.8.0). For yarn mode, you must specify SPARK_HOME
& HADOOP_CONF_DIR
.
Usually you only have one hadoop cluster, so you can set HADOOP_CONF_DIR
in zeppelin-env.sh
which is applied to all Spark interpreters. If you want to use spark against multiple hadoop cluster, then you need to define
HADOOP_CONF_DIR
in interpreter setting or via inline generic configuration.
K8s Mode
Regarding how to run Spark on K8s in Zeppelin, please check this doc.
PySpark
There are 2 ways to use PySpark in Zeppelin:
Vanilla PySpark
IPySpark
Vanilla PySpark (Not Recommended)
Vanilla PySpark interpreter is almost the same as vanilla Python interpreter except Spark interpreter inject SparkContext, SQLContext, SparkSession via variables sc
, sqlContext
, spark
.
By default, Zeppelin would use IPython in %spark.pyspark
when IPython is available (Zeppelin would check whether ipython's prerequisites are met), Otherwise it would fall back to the vanilla PySpark implementation.
IPySpark (Recommended)
You can use IPySpark
explicitly via %spark.ipyspark
. IPySpark interpreter is almost the same as IPython interpreter except Spark interpreter inject SparkContext, SQLContext, SparkSession via variables sc
, sqlContext
, spark
.
For the IPython features, you can refer doc Python Interpreter
SparkR
Zeppelin support SparkR via %spark.r
, %spark.ir
and %spark.shiny
. Here's configuration for SparkR Interpreter.
Spark Property
Default
Description
zeppelin.R.cmd
R binary executable path.
zeppelin.R.knitr
Whether use knitr or not. (It is recommended to install knitr and use it in Zeppelin)
zeppelin.R.image.width
R plotting image width.
zeppelin.R.render.options
out.format = 'html', comment = NA, echo = FALSE, results = 'asis', message = F, warning = F, fig.retina = 2
R plotting options.
zeppelin.R.shiny.iframe_width
IFrame width of Shiny App
zeppelin.R.shiny.iframe_height
500px
IFrame height of Shiny App
zeppelin.R.shiny.portRange
Shiny app would launch a web app at some port, this property is to specify the portRange via format ':', e.g. '5000:5001'. By default it is ':' which means any port
Refer R doc for how to use R in Zeppelin.
SparkSql
Spark sql interpreter share the same SparkContext/SparkSession with other Spark interpreters. That means any table registered in scala, python or r code can be accessed by Spark sql.
For examples:
%spark
case class People(name: String, age: Int)
var df = spark.createDataFrame(List(People("jeff", 23), People("andy", 20)))
df.createOrReplaceTempView("people")
%spark.sql
select * from people
You can write multiple sql statements in one paragraph. Each sql statement is separated by semicolon.
Sql statement in one paragraph would run sequentially.
But sql statements in different paragraphs can run concurrently by the following configuration.
Set zeppelin.spark.concurrentSQL
to true to enable the sql concurrent feature, underneath zeppelin will change to use fairscheduler for Spark. And also set zeppelin.spark.concurrentSQL.max
to control the max number of sql statements running concurrently.
Configure pools by creating fairscheduler.xml
under your SPARK_CONF_DIR
, check the official spark doc Configuring Pool Properties
Set pool property via setting paragraph local property. e.g.
%spark(pool=pool1)
sql statement
This pool feature is also available for all versions of scala Spark, PySpark. For SparkR, it is only available starting from 2.3.0.
Dependency Management
For Spark interpreter, it is not recommended to use Zeppelin's Dependency Management for managing
third party dependencies (%spark.dep
is removed from Zeppelin 0.9 as well). Instead, you should set the standard Spark properties as following:
Spark Property
Spark Submit Argument
Description
spark.files
--files
Comma-separated list of files to be placed in the working directory of each executor. Globs are allowed.
spark.jars
--jars
Comma-separated list of jars to include on the driver and executor classpaths. Globs are allowed.
spark.jars.packages
--packages
Comma-separated list of Maven coordinates of jars to include on the driver and executor classpaths. The coordinates should be groupId:artifactId:version. If spark.jars.ivySettings is given artifacts will be resolved according to the configuration in the file, otherwise artifacts will be searched for in the local maven repo, then maven central and finally any additional remote repositories given by the command-line option --repositories.
As general Spark properties, you can set them in via inline configuration or interpreter setting page or in zeppelin-env.sh
via environment variable SPARK_SUBMIT_OPTIONS
.
For examples:
export SPARK_SUBMIT_OPTIONS="--files <my_file> --jars <my_jar> --packages <my_package>"
To be noticed, SPARK_SUBMIT_OPTIONS
is deprecated and will be removed in future release.
ZeppelinContext
Zeppelin automatically injects ZeppelinContext
as variable z
in your Scala/Python environment. ZeppelinContext
provides some additional functions and utilities.
See Zeppelin-Context for more details. For Spark interpreter, you can use z to display Spark Dataset/Dataframe
.
Setting up Zeppelin with Kerberos
Logical setup with Zeppelin, Kerberos Key Distribution Center (KDC), and Spark on YARN:
There are several ways to make Spark work with kerberos enabled hadoop cluster in Zeppelin.
Share one single hadoop cluster.
In this case you just need to specify zeppelin.server.kerberos.keytab
and zeppelin.server.kerberos.principal
in zeppelin-site.xml, Spark interpreter will use these setting by default.
Work with multiple hadoop clusters.
In this case you can specify spark.yarn.keytab
and spark.yarn.principal
to override zeppelin.server.kerberos.keytab
and zeppelin.server.kerberos.principal
.
Configuration Setup
On the server that Zeppelin is installed, install Kerberos client modules and configuration, krb5.conf.
This is to make the server communicate with KDC.
Add the two properties below to Spark configuration ([SPARK_HOME]/conf/spark-defaults.conf
):
spark.yarn.principal
spark.yarn.keytab
User Impersonation
In yarn mode, the user who launch the zeppelin server will be used to launch the Spark yarn application. This is not a good practise.
Most of time, you will enable shiro in Zeppelin and would like to use the login user to submit the Spark yarn app. For this purpose,
you need to enable user impersonation for more security control. In order the enable user impersonation, you need to do the following steps
Step 1 Enable user impersonation setting hadoop's core-site.xml
. E.g. if you are using user zeppelin
to launch Zeppelin, then add the following to core-site.xml
, then restart both hdfs and yarn.
<property>
<name>hadoop.proxyuser.zeppelin.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.zeppelin.hosts</name>
<value>*</value>
</property>
Step 2 Enable interpreter user impersonation in Spark interpreter's interpreter setting. (Enable shiro first of course)
Step 3(Optional) If you are using kerberos cluster, then you need to set zeppelin.server.kerberos.keytab
and zeppelin.server.kerberos.principal
to the user(aka. user in Step 1) you want to
impersonate in zeppelin-site.xml
.
Community
Join our community to discuss with others.