AVA is a general deep learning training platform, built by Atlab Lab at Qiniu Cloud (specialized in deep learning for computer vision), to provide deep-learning as the service to our internal and external users. This platform is built upon a stack of open source software including Tensorflow, Caffe, Alluxio and etc, as well as our object storage service KODO provided by Qiniu. In this talk, we will focus on its internal infrastructure from design to implementation.
One key goal of AVA is to serve different machine learning users and applications which may have different tools in mind. Thus, the platform is required to serve a large amount of training data source from KODO to all the machine learning frameworks including Tensorflow, Caffe, Pytorch, mxNet and etc. In addition, our platform is designed for cloud-environment on day 1 and the computation is naturally separate from the training data source (stored in KODO), and the network easily becomes the bottleneck to transfer a large scale of training data to the GPU machines. As a result, we must also serve the data efficiently as well as reduce the development and maintenance cost. As a result, we leveraged Alluxio to connect to KODO and present training data via its POSIX interface to all different machine learning frameworks like Tensorflow and Caffe, and deploy Alluxio as the unified data access layer to retrieve and also accelerate training task. With our training tasks reading a large number of sample files such as video and pictures, file read and write performance is improved by more than 50%, with significantly reduced the cost in the capacity of the storage system.
In this talk, we will share the motivation, architecture and the lessons we learned when building and maintaining this platform. Hope our experience of building machine learning infrastructure at scale can shed light on the machine learning infrastructure in the cloud.
Chaoguang has been working in distributed systems for more than 10 years. He was working at IBM on the first generation of SSD tiered storage DS8000, then he was the chief architect of the all-flash storage Dorado Cache in Huawei. Currently he is the leading the deep learning platform at Qiniu.
Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google building next-generation storage infrastructure, where he won Google’s Technical Infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org