AVA is a general deep learning training platform built by the Atlab Lab at Qiniu Cloud (specialized in deep learning for computer vision) to provide deep learning to internal and external users. The platform is built upon a stack of open source software including TensorFlow, Caffe, and Alluxio, as well as the object storage service KODO, provided by Qiniu.
Chaoguang Li, Haoyuan Li, and Bin Fan explore the internal infrastructure from design to implementation in AVA, discuss the motivation, architecture, and share lessons learned building and maintaining this platform.
A primary goal of AVA is to serve different ML users and applications requiring different tools, which means the platform is required to serve a large amount of training data sourced from KODO to all the ML frameworks, including TensorFlow, Caffe, PyTorch, and MXNet. You won’t have to make the transition to the cloud environment later, because the platform is designed for the cloud from day one, and the computation is naturally separate from the training data source (stored in KODO). The network easily becomes the bottleneck to transfer a large scale of training data to the GPU machines, so the company must also serve the data efficiently and reduce the development and maintenance cost. As a result, Alluxio was leveraged to connect to KODO and present training data via its portable operating system interface (POSIX) to all different machine learning frameworks, like TensorFlow and Caffe, and to deploy Alluxio as the unified data access layer to retrieve and also accelerate the training task. With the training tasks reading a large number of sample files such as videos and pictures, file read and write performance is improved by more than 50%, which significantly reduced the cost in the capacity of the storage system.
Chaoguang Li is the senior architect and director of AI at Qiniu Cloud. He’s been working in distributed systems for more than 10 years. Previously, he worked on the first generation of SSD tiered storage DS8000 at IBM and was the chief architect of the all-flash storage Dorado Cache in Huawei.
Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google, building next-generation storage infrastructure, where he won Google’s technical infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.
Haoyuan (H.Y.) Li is the founder, chairman, and CTO of Alluxio. He holds a PhD in computer science from UC Berkeley’s AMPLab, where he created the Alluxio (formerly Tachyon) open source data orchestration system, cocreated Apache Spark Streaming, and became an Apache Spark founding committer. He also holds an MS from Cornell University and a BS from Peking University, both in computer science.
©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org