Spark Dataframe and Dataset

Posted on September 9, 2017 · < 1 minute read

Slides

https://github.com/keypointt/reading/blob/master/spark/2017_spark_spring_SF_IBM_dataframe.pdf

note

1. How DF, DS, and RDD Work

bucketing use case

SPARK-19008 enables generated code to use int value

Avoid boxing/unboxing overhead when a Dataset program calls a lambda, which operates on a primitive type, written in Scala.
In such a case, Catalyst can directly call a method <primitiveType> apply(<primitiveType>); instead of Object apply(Object);.
PR: https://github.com/apache/spark/pull/17172

SPARK-14083 will allow future Spark to understand Java byte code lambda expressions and to combine them

related paper: Jimple: Simplifying Java Bytecode for Analyses and Transformations (1998)
- http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.58.7708
Watson Libraries for Analysis (WALA) provide static analysis capabilities for Java bytecode and related languages and for JavaScript

Reference:

https://spark-summit.org/2017/events/demystifying-dataframe-and-dataset/