Day 17/100
Spark Streaming Misc
Spark Streaming's issue with s3/hdfs as stream source
With reference to spoddutur.github.io/spark-notes/s3-filesyst.. It is evident that when we use s3/hdfs as source for our streaming application, on POC level it works just fine. the actual problem comes when we try to deploy it.
- the identification of new files to process using checkpoint and s3/hdfs api calls takes a good amount of time nearly as much as 50% of time if your batch size is 30secs.
- Not to mention the high costs the List API on s3 will add up
- Also these filesystems are based on eventual consistency
- But we can't just not use them as streaming source the options are
- github.com/Netflix/s3mper
- docs.databricks.com/spark/latest/structured.. [Won't recommend this as it is not open source]
- github.com/apache/bahir/tree/master/sql-str.. [I am going to test this out, and probably in next few days will write down details of how that went]
I think even if we end up using secondary index like the netflix s3mper, you need to implement your own Streaming Source in spark. so the sql-streaming-sqs gives us an edge over this.
Spark Streaming Monitoring
- The other important and interesting thing about spark streaming might be Monitoring on top of streaming job
- found two interesting reads on the same,
- So basically the idea is to be able to monitor spark streaming metrics like batches processed, files processed, number of rows processed, etc.
- I think this kind of an visualisation is a must, and it can help a lot to create monitors and dashboards on top of long running spark applications.
These both of the above topics have really fascinated me and I am going to code out something on top of these lines coming weekend, or within few days. Will keep you all posted.