Related reading People Search AI at LinkedIn
Lessons Learned from Building LinkedIn’s AI Data Platform
In a recent presentation, Felix GV from LinkedIn shared insights into the construction of LinkedIn’s AI data platform, Venice
.
Table of Contents
- Introduction
- AI at LinkedIn
- AI Ecosystem
- Venice: LinkedIn’s AI Data Platform
- Data Infrastructure Components
- Conclusion
- Appendix: Tools and Frameworks
Introduction
Felix GV discussed the challenges and solutions in building LinkedIn’s AI infrastructure. He highlighted the complexity of integrating machine learning systems into real-world applications, emphasizing the importance of robust surrounding infrastructure.
author: Félix GV - principal staff engineer
AI at LinkedIn
LinkedIn uses AI for various applications, including People You May Know (PYMK) and the main feed. These applications involve massive data and require sophisticated recommendation systems to score and rank entities, ensuring users receive the most relevant content.
AI Ecosystem
Initially, LinkedIn’s AI tools were fragmented, causing inefficiencies. To address this, they developed an integrated AI platform catering to both AI researchers and engineers. The platform covers feature management, model creation, deployment, serving, and maintenance, providing a holistic approach to AI workflows.
Venice: LinkedIn’s AI Data Platform
Key Features
- Frame: A virtual feature store abstracting over multiple storage types.
- King Kong: Kubernetes-based deep learning training infrastructure.
- FedEx: Feature productionization pipeline.
- Model Cloud: Inference platform for serving models efficiently.
Venice’s Role
Venice is designed for derived data, supporting high-throughput ingestion from batch and streaming sources, and providing low-latency responses essential for AI applications. Venice’s self-service nature ensures ease of use, allowing AI engineers to focus on their business needs.
Data Infrastructure Components
LinkedIn’s data infrastructure includes various open-source tools:
- Apache Spark: Batch processing.
- Kubeflow: Deep learning workflows.
- OpenHouse: Operational catalog.
- Hadoop: Large-scale storage.
- Apache Samza: Stream processing.
- Apache Kafka: Pub/Sub system.
- Brooklin: Change data capture.
- Apache Pinot: Online analytical processing.
- Venice: Feature storage for online inference.
Conclusion
LinkedIn’s AI data platform, Venice, exemplifies the importance of a well-integrated infrastructure in deploying large-scale AI applications. By leveraging open-source tools and developing specialized components, LinkedIn ensures efficient, scalable, and reliable AI operations.
Performance is the Best Feature!
For more detailed insights, you can watch the full presentation on InfoQ: Lessons Learned from Building LinkedIn’s AI Data Platform.
Appendix: Tools and Frameworks
Tools and Frameworks Mentioned
- Apache Spark: Batch processing.
- Kubeflow: Deep learning workflows.
- OpenHouse: Operational catalog.
- Hadoop: Large-scale storage.
- Apache Samza: Stream processing.
- Apache Kafka: Pub/Sub system.
- Brooklin: Change data capture.
- Apache Pinot: Online analytical processing.
- Venice: Feature storage for online inference.
- avro-util: A collection of utilities and libraries to allow java projects to better work with avro.