Multiple Spark Worker Instances on a single Node. Why more of less is more than less.

Uli Bethke Big Data, Hadoop, Spark

If you are running Spark in standalone mode on memory rich nodes it can be beneficial to have multiple worker instances on the same node as a very large heap size has two disadvantages:

- Garbage collector pauses can hurt throughput of Spark jobs.
- Heap size of >32 GB can't use CompressedOoops. So 35 GB is actually less than 32 GB.

Mesos and YARN can, out of the box, support packing multiple, smaller executors onto the same physical host, so requesting smaller executors doesn’t mean your application will have fewer overall resources. In Spark’s Standalone mode each worker can have only a single executor. This limitation will likely be removed in Spark 1.4.0. For more information refer to the issue SPARK-1706.

To avoid this issue it is possible to launch multiple Spark worker instances on a single node. The number of worker instances is controlled by the environment variable SPARK_WORKER_INSTANCES. If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all cores.

Example of command which launches multiple workers on each slave node:

This will launch three worker instances on each node. Each worker instance will use two cores.

Also it is possible to manually start workers and connect to Spark’s master node. To do this use command:

It will start a worker on the node where the command is executed. To start multiple workers just use this command multiple times. Make sure that you correctly configure the environment variables at your node.

Sonra

We are a Big Data company based in Ireland. We are experts in data lake implementations, clickstream analytics, real time analytics, and data warehousing on Hadoop. We can help with your Big Data implementation. Get in touch.