Long Running Spark Streaming Jobs

Fault Tolerance

Spark and YARN both were not really designed for real time processing, but they have adapted to the need of near real time processing.
For a Spark job in cluster mode, driver and Application Master share a single JVM. that's why any error in Spark driver stops our long-running job.
To tackle that problem we can set following property to a higher number that default which is 2. also For an streaming application which is supposed to run for days and weeks, this limit of 4 attempts can be exhausted in hours, that's the reason we need to reset that attempt counter.

Another similar and important setting is number of allowed executor failures, default is max(2* num-executors, 3) but not sufficient here so we can set these

      --conf spark.yarn.maxAppAttempts=4
      --conf spark.yarn.am.attemptFailuresValidityInterval=1h
      --conf spark.yarn.max.executor.failures={8 * num_executors}
      --conf spark.yarn.executor.failuresValidityInterval=1h

Also you might want to increase the max task failure which is defaulted to 4 --conf spark.task.maxFailures=8

Performance

It is suggested that we should create a new queue otherwise the resources might get chocked up by other jobs.
If the streaming job and spark actions are idempotent you can consider enabling speculation mode that helps clearing stuck tasks, --conf spark.speculation=true

Logging

Standard spark logging using log4j is not valid in streaming job cases as we need to kill the job to access logs.
One of the recommendation is to use Elastic, Logstash and Kibana [ELK] stack. and push following fields for sure,
- YARN application id
- YARN container hostname
- Executor id (driver is 000001, executors start from 000002)
- YARN attempt (to check number of driver restarts)

We need following properties to enable this using log4j file with ELK changes,

         --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
         --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
         --files /path/to/log4j.properties

Monitoring

The problem with Spark UI is it keeps statistics only for a limited number of batches, and after restart all metrics are gone.
one option is to use Graphite and Grafana in combination but If you already have something like datadog, we can integrate that as well. [docs.datadoghq.com/integrations/spark]
This is also very interesting to try out, I can try this on my own the logging into ELK and monitoring using either dd or graphite+grafana option.

Graceful Shutdown

When a spark streaming application is killed/dies the shutdown could happen in the middle of a batch, which is not really an ideal state to be in.
Using something like this might be super helpful, gist.github.com/GrigorievNick/bf920e32f70cb..

Reference - mkuthan.github.io/blog/2016/09/30/spark-str..

Day 18/100