Day 17/100

Spark Streaming Misc

Spark Streaming's issue with s3/hdfs as stream source

With reference to spoddutur.github.io/spark-notes/s3-filesyst.. It is evident that when we use s3/hdfs as source for our streaming application, on POC level it works just fine. the actual problem comes when we try to deploy it.

  • the identification of new files to process using checkpoint and s3/hdfs api calls takes a good amount of time nearly as much as 50% of time if your batch size is 30secs.
  • Not to mention the high costs the List API on s3 will add up
  • Also these filesystems are based on eventual consistency
  • But we can't just not use them as streaming source the options are

I think even if we end up using secondary index like the netflix s3mper, you need to implement your own Streaming Source in spark. so the sql-streaming-sqs gives us an edge over this.

Spark Streaming Monitoring

  • The other important and interesting thing about spark streaming might be Monitoring on top of streaming job
  • found two interesting reads on the same,
  • So basically the idea is to be able to monitor spark streaming metrics like batches processed, files processed, number of rows processed, etc.
  • I think this kind of an visualisation is a must, and it can help a lot to create monitors and dashboards on top of long running spark applications.

These both of the above topics have really fascinated me and I am going to code out something on top of these lines coming weekend, or within few days. Will keep you all posted.