Day 18/100

Long Running Spark Streaming Jobs

Fault Tolerance

  • Spark and YARN both were not really designed for real time processing, but they have adapted to the need of near real time processing.
  • For a Spark job in cluster mode, driver and Application Master share a single JVM. that's why any error in Spark driver stops our long-running job.
  • To tackle that problem we can set following property to a higher number that default which is 2. also For an streaming application which is supposed to run for days and weeks, this limit of 4 attempts can be exhausted in hours, that's the reason we need to reset that attempt counter.
  • Another similar and important setting is number of allowed executor failures, default is max(2* num-executors, 3) but not sufficient here so we can set these
          --conf spark.yarn.maxAppAttempts=4
          --conf spark.yarn.am.attemptFailuresValidityInterval=1h
          --conf spark.yarn.max.executor.failures={8 * num_executors}
          --conf spark.yarn.executor.failuresValidityInterval=1h
    
  • Also you might want to increase the max task failure which is defaulted to 4 --conf spark.task.maxFailures=8

Performance

  • It is suggested that we should create a new queue otherwise the resources might get chocked up by other jobs.
  • If the streaming job and spark actions are idempotent you can consider enabling speculation mode that helps clearing stuck tasks, --conf spark.speculation=true

Logging

  • Standard spark logging using log4j is not valid in streaming job cases as we need to kill the job to access logs.
  • One of the recommendation is to use Elastic, Logstash and Kibana [ELK] stack. and push following fields for sure,
    • YARN application id
    • YARN container hostname
    • Executor id (driver is 000001, executors start from 000002)
    • YARN attempt (to check number of driver restarts)
  • We need following properties to enable this using log4j file with ELK changes,
             --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
             --conf spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties \
             --files /path/to/log4j.properties
    

Monitoring

  • The problem with Spark UI is it keeps statistics only for a limited number of batches, and after restart all metrics are gone.
  • one option is to use Graphite and Grafana in combination but If you already have something like datadog, we can integrate that as well. [docs.datadoghq.com/integrations/spark]

  • This is also very interesting to try out, I can try this on my own the logging into ELK and monitoring using either dd or graphite+grafana option.

Graceful Shutdown

  • When a spark streaming application is killed/dies the shutdown could happen in the middle of a batch, which is not really an ideal state to be in.
  • Using something like this might be super helpful, gist.github.com/GrigorievNick/bf920e32f70cb..

Reference - mkuthan.github.io/blog/2016/09/30/spark-str..