original post

https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html

note

  1. from periodic to real time
  2. transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse related notebook: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html
  3. val rawRecords = spark.readStream.option("maxFilesPerTrigger", "100") option maxFilesPerTrigger to get earlier access the final Parquet data

I’d say it’s mini-batch low-interval fast ETL, precisely, there could be delay due to:

  • batch interval
  • db io delay
  • computation delay if on-the-fly aggregation

key method,readStream and writeStream