Related reading People Search AI at LinkedIn

Lessons Learned from Building LinkedIn’s AI Data Platform

In a recent presentation, Felix GV from LinkedIn shared insights into the construction of LinkedIn’s AI data platform, Venice.

Introduction
AI at LinkedIn
AI Ecosystem
Venice: LinkedIn’s AI Data Platform
Data Infrastructure Components
Conclusion
Appendix: Tools and Frameworks

Introduction

Felix GV discussed the challenges and solutions in building LinkedIn’s AI infrastructure. He highlighted the complexity of integrating machine learning systems into real-world applications, emphasizing the importance of robust surrounding infrastructure.

author: Félix GV - principal staff engineer

AI at LinkedIn

LinkedIn uses AI for various applications, including People You May Know (PYMK) and the main feed. These applications involve massive data and require sophisticated recommendation systems to score and rank entities, ensuring users receive the most relevant content.

AI Ecosystem

Initially, LinkedIn’s AI tools were fragmented, causing inefficiencies. To address this, they developed an integrated AI platform catering to both AI researchers and engineers. The platform covers feature management, model creation, deployment, serving, and maintenance, providing a holistic approach to AI workflows.

Venice: LinkedIn’s AI Data Platform

Key Features

Frame: A virtual feature store abstracting over multiple storage types.
King Kong: Kubernetes-based deep learning training infrastructure.
FedEx: Feature productionization pipeline.
Model Cloud: Inference platform for serving models efficiently.

Venice’s Role

Venice is designed for derived data, supporting high-throughput ingestion from batch and streaming sources, and providing low-latency responses essential for AI applications. Venice’s self-service nature ensures ease of use, allowing AI engineers to focus on their business needs.

Data Infrastructure Components

LinkedIn’s data infrastructure includes various open-source tools:

Apache Spark: Batch processing.
Kubeflow: Deep learning workflows.
OpenHouse: Operational catalog.
Hadoop: Large-scale storage.
Apache Samza: Stream processing.
Apache Kafka: Pub/Sub system.
Brooklin: Change data capture.
Apache Pinot: Online analytical processing.
Venice: Feature storage for online inference.

Conclusion

LinkedIn’s AI data platform, Venice, exemplifies the importance of a well-integrated infrastructure in deploying large-scale AI applications. By leveraging open-source tools and developing specialized components, LinkedIn ensures efficient, scalable, and reliable AI operations.

Performance is the Best Feature!

For more detailed insights, you can watch the full presentation on InfoQ: Lessons Learned from Building LinkedIn’s AI Data Platform.

Appendix: Tools and Frameworks

Tools and Frameworks Mentioned

Apache Spark: Batch processing.
Kubeflow: Deep learning workflows.
OpenHouse: Operational catalog.
Hadoop: Large-scale storage.
Apache Samza: Stream processing.
Apache Kafka: Pub/Sub system.
Brooklin: Change data capture.
Apache Pinot: Online analytical processing.
Venice: Feature storage for online inference.
avro-util: A collection of utilities and libraries to allow java projects to better work with avro.

Lessons Learned from Building LinkedIn AI Data Platform