添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Local mode

Local mode is an excellent way to learn and experiment with Spark. Local mode also provides a convenient development environment for analyses, reports, and applications that you plan to eventually deploy to a multi-node Spark cluster.

To work in local mode, you should first install a version of Spark for local use . You can do this using the spark_install() function, for example:

Recommended properties

The following are the recommended Spark properties to set when connecting via R:

  • sparklyr.cores.local - It defaults to using all of the available cores. Not a necessary property to set, unless there’s a reason to use less cores than available for a given Spark session.

  • sparklyr.shell.driver-memory - The limit is the amount of RAM available in the computer minus what would be needed for OS operations.

  • spark.memory.fraction - The default is set to 60% of the requested memory per executor. For more information, please see this Memory Management Overview page in the official Spark website.

  • Connection example

    Running on YARN page in Spark’s official website is the best place to start for configuration settings reference, please bookmark it. Cluster administrators and users can benefit from this document. If Spark is new to the company, the YARN tunning article, courtesy of Cloudera, does a great job at explaining how the Spark/YARN architecture works.

    Recommended properties

    The following are the recommended Spark properties to set when connecting via R:

  • spark.executor.memory - The maximum possible is managed by the YARN cluster. See the Executor Memory Error

  • spark.executor.cores - Number of cores assigned per Executor.

  • spark.executor.instances - Number of executors to start. This property is acknowledged by the cluster if spark.dynamicAllocation.enabled is set to “false”.

  • spark.dynamicAllocation.enabled - Overrides the mechanism that Spark provides to dynamically adjust resources. Disabling it provides more control over the number of the Executors that can be started, which in turn impact the amount of storage available for the session. For more information, please see the Dynamic Resource Allocation page in the official Spark website.

  • Client mode

    Using yarn-client as the value for the master argument in spark_connect() will make the server in which R is running to be the Spark’s session driver. Here is a sample connection:

    Running Spark on YARN

    The server will need to have copies of at least two files: yarn-site.xml and hive-site.xml . There may be other files needed based on your cluster’s individual setup.

    This is an example of connecting to a Cloudera cluster:

    Apache - Authenticate with kinit

  • A preferred option may be to use the out-of-the-box integration with Kerberos that the commercial version of RStudio Server offers.
  • Recommended properties

    The following are the recommended Spark properties to set when connecting via R:

    The default behavior in Standalone mode is to create one executor per worker. So in a 3 worker node cluster, there will be 3 executors setup. The basic properties that can be set are:

  • spark.executor.memory - The requested memory cannot exceed the actual RAM available.

  • spark.memory.fraction - The default is set to 60% of the requested memory per executor. For more information, please see this Memory Management Overview page in the official Spark website.

  • spark.executor.cores - The requested cores cannot be higher than the cores available in each worker.

  • Dynamic Allocation

    If dynamic allocation is disabled, then Spark will attempt to assign all of the available cores evenly across the cluster. The property used is spark.dynamicAllocation.enabled .

    For example, the Standalone cluster used for this article has 3 worker nodes. Each node has 14.7GB in RAM and 4 cores. This means that there are a total of 12 cores (3 workers with 4 cores) and 44.1GB in RAM (3 workers with 14.7GB in RAM each).

    If the spark.executor.cores property is set to 2, and dynamic allocation is disabled, then Spark will spawn 6 executors. The spark.executor.memory property should be set to a level that when the value is multiplied by 6 (number of executors) it will not be over total available RAM. In this case, the value can be safely set to 7GB so that the total memory requested will be 42GB, which is under the available 44.1GB.

    Connection example