AVA is a general deep learning training platform built by Atlab Lab at Qiniu Cloud (specialized in deep learning for computer vision) to provide deep learning to internal and external users. The platform is built upon a stack of open source software including TensorFlow, Caffe, Alluxio, etc., as well as the object storage service KODO, provided by Qiniu.
Chaoguang Li and Bin Fan explore the internal infrastructure from design to implementation in AVA and examine the motivation, architecture, and the lessons learned when building and maintaining this platform so their experience of building machine learning infrastructure at scale can shed light on the machine learning infrastructure in the cloud for you.
A primary goal of AVA is to serve different ML users and applications requiring different tools, which means the platform is required to serve a large amount of training data sourced from KODO to all the ML frameworks, including TensorFlow, Caffe, PyTorch, mxNet, etc. You won’t have to make the transition to the cloud environment later, because the platform is designed for the cloud from day one, and the computation is naturally separate from the training data source (stored in KODO). The network easily becomes the bottleneck to transfer a large scale of training data to the GPU machines, so the company must also serve the data efficiently and reduce the development and maintenance cost. As a result, Alluxio was leveraged to connect to KODO and present training data via its portable operating system interface (POSIX) to all different machine learning frameworks, like TensorFlow and Caffe, and to deploy Alluxio as the unified data access layer to retrieve and also accelerate the training task. With the training tasks reading a large number of sample files such as videos and pictures, file read and write performance is improved by more than 50%, which significantly reduced the cost in the capacity of the storage system.
Chaoguang Li is the senior architect and director of AI in Qiniu Cloud. He’s been working in distributed systems for more than 10 years. Previously, he worked at IBM on the first generation of SSD tiered storage DS8000, then was the chief architect of the all-flash storage Dorado Cache in Huawei.
Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google, building next-generation storage infrastructure, where he won Google’s technical infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org