Apache Spark Streaming ETL

Posted on February 6, 2017 · < 1 minute read

original post

https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html

note

from periodic to real time
transform AWS CloudTrail audit logs into an efficient, partitioned, parquet data warehouse related notebook: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8599738367597028/2070341989008532/3601578643761083/latest.html
val rawRecords = spark.readStream.option("maxFilesPerTrigger", "100") option maxFilesPerTrigger to get earlier access the final Parquet data

I’d say it’s mini-batch low-interval fast ETL, precisely, there could be delay due to:

batch interval
db io delay
computation delay if on-the-fly aggregation

key method,readStream and writeStream

Tags: Infrastructure Databricks