original post
https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html
note
- from
periodic
toreal time
- transform
AWS CloudTrail
audit logs into an efficient, partitioned, parquetdata warehouse
related notebook: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html val rawRecords = spark.readStream.option("maxFilesPerTrigger", "100")
optionmaxFilesPerTrigger
to get earlier access the final Parquet data
I’d say it’s mini-batch low-interval fast ETL, precisely, there could be delay due to:
- batch interval
- db io delay
- computation delay if on-the-fly aggregation
key method,readStream
and writeStream