Spark Streaming Misc

Spark Streaming's issue with s3/hdfs as stream source

With reference to spoddutur.github.io/spark-notes/s3-filesyst.. It is evident that when we use s3/hdfs as source for our streaming application, on POC level it works just fine. the actual problem comes when we try to deploy it.

the identification of new files to process using checkpoint and s3/hdfs api calls takes a good amount of time nearly as much as 50% of time if your batch size is 30secs.
Not to mention the high costs the List API on s3 will add up
Also these filesystems are based on eventual consistency
But we can't just not use them as streaming source the options are
- github.com/Netflix/s3mper
- docs.databricks.com/spark/latest/structured.. [Won't recommend this as it is not open source]
- github.com/apache/bahir/tree/master/sql-str.. [I am going to test this out, and probably in next few days will write down details of how that went]

I think even if we end up using secondary index like the netflix s3mper, you need to implement your own Streaming Source in spark. so the sql-streaming-sqs gives us an edge over this.

Spark Streaming Monitoring

The other important and interesting thing about spark streaming might be Monitoring on top of streaming job
found two interesting reads on the same,
- slideshare.net/databricks/the-top-five-mist..
- github.com/spoddutur/spark-streaming-monito..
So basically the idea is to be able to monitor spark streaming metrics like batches processed, files processed, number of rows processed, etc.
I think this kind of an visualisation is a must, and it can help a lot to create monitors and dashboards on top of long running spark applications.

These both of the above topics have really fascinated me and I am going to code out something on top of these lines coming weekend, or within few days. Will keep you all posted.

Day 17/100

Spark Streaming Misc

Spark Streaming's issue with s3/hdfs as stream source

Spark Streaming Monitoring