O’REILLY、INTEL AI主办

English中文
将人工智能用起来
2019年6月18-21日
北京,中国

AVA: a Cloud-Native Deep Learning Platform at Qiniu

此演讲使用中文 (This will be presented in Chinese)

Chaoguang Li (Qiniu), Bin Fan (Alluxio)
14:0014:40 Friday, June 21, 2019
实施人工智能 (Implementing AI)
Location: 报告厅(Auditorium)

必要预备知识 (Prerequisite Knowledge)

- basic knowledge in Machine Learning Software Stack - basic knowledge in Cloud

您将学到什么 (What you'll learn)

- Machine learning infrastructure in depth - Design and implementation of Cloud-native infrastructure - Performance optimization and tuning

描述 (Description)

AVA is a general deep learning training platform, built by Atlab Lab at Qiniu Cloud (specialized in deep learning for computer vision), to provide deep-learning as the service to our internal and external users. This platform is built upon a stack of open source software including Tensorflow, Caffe, Alluxio and etc, as well as our object storage service KODO provided by Qiniu. In this talk, we will focus on its internal infrastructure from design to implementation.

One key goal of AVA is to serve different machine learning users and applications which may have different tools in mind. Thus, the platform is required to serve a large amount of training data source from KODO to all the machine learning frameworks including Tensorflow, Caffe, Pytorch, mxNet and etc. In addition, our platform is designed for cloud-environment on day 1 and the computation is naturally separate from the training data source (stored in KODO), and the network easily becomes the bottleneck to transfer a large scale of training data to the GPU machines. As a result, we must also serve the data efficiently as well as reduce the development and maintenance cost. As a result, we leveraged Alluxio to connect to KODO and present training data via its POSIX interface to all different machine learning frameworks like Tensorflow and Caffe, and deploy Alluxio as the unified data access layer to retrieve and also accelerate training task. With our training tasks reading a large number of sample files such as video and pictures, file read and write performance is improved by more than 50%, with significantly reduced the cost in the capacity of the storage system.

In this talk, we will share the motivation, architecture and the lessons we learned when building and maintaining this platform. Hope our experience of building machine learning infrastructure at scale can shed light on the machine learning infrastructure in the cloud.

Photo of Chaoguang Li

Chaoguang Li

Qiniu

Chaoguang has been working in distributed systems for more than 10 years. He was working at IBM on the first generation of SSD tiered storage DS8000, then he was the chief architect of the all-flash storage Dorado Cache in Huawei. Currently he is the leading the deep learning platform at Qiniu.

Photo of Bin Fan

Bin Fan

Alluxio

Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)