Slides

https://github.com/keypointt/reading/blob/master/spark/2017_spark_spring_SF_LinkedIn_graph.pdf

note

1. Graph Partition && Vertex Representation

random walks vertex representation.png

Here layered lookup table used: offset + length, and this can totally be transformed as a design question.

  • dst_vertex_id table should be a global one, broadcasted among executors.

2. zipPartition operator while reduce shuffling

ramdom-walks-shuffle-reduce.png

Start with part of walks, not all of them.

Spark’s zipPartition operator efficiently leverages the routing table.

And p% of the random walkers in each stage helps on co-locate results and reduce shuffle.

Reference: