Multiple Spark Worker Instances on a single Node. Why more of less is more than less.

Uli Bethke Big Data, Hadoop, Spark

If you are running Spark in standalone mode on memory rich nodes it can be beneficial to have multiple worker instances on the same node as a very large heap size has two disadvantages:

- Garbage collector pauses can hurt throughput of Spark jobs.
- Heap size of >32 GB can't use CompressedOoops. So 35 GB is actually less than 32 GB.

Mesos and YARN can, out of the box, support packing multiple, smaller executors onto the same physical host, so requesting smaller executors doesn’t mean your application will have fewer overall resources. In Spark’s Standalone mode each worker can have only a single executor. This limitation will likely be removed in Spark 1.4.0. For more information refer to the issue SPARK-1706.

To avoid this issue it is possible to launch multiple Spark worker instances on a single node. The number of worker instances is controlled by the environment variable SPARK_WORKER_INSTANCES. If you do set this, make sure to also set SPARK_WORKER_CORES explicitly to limit the cores per worker, or else each worker will try to use all cores.

Example of command which launches multiple workers on each slave node:

This will launch three worker instances on each node. Each worker instance will use two cores.

Also it is possible to manually start workers and connect to Spark’s master node. To do this use command:

It will start a worker on the node where the command is executed. To start multiple workers just use this command multiple times. Make sure that you correctly configure the environment variables at your node.

Sonra

We are a Big Data company based in Ireland. We are experts in data lake implementations, clickstream analytics, real time analytics, and data warehousing on Hadoop. We can help with your Big Data implementation. Get in touch.

About the author

Uli Bethke LinkedIn Profile

Uli has 18 years’ hands on experience as a consultant, architect, and manager in the data industry. He frequently speaks at conferences. Uli has architected and delivered data warehouses in Europe, North America, and South East Asia. He is a traveler between the worlds of traditional data warehousing and big data technologies.

Uli is a regular contributor to blogs and books, holds an Oracle ACE award, and chairs the the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a non for profit global data management organization. He has co-founded the Irish Oracle Big Data User Group.