June 18-21, 2019
Beijing, CN

Atom:Supremind云原生深度学习平台(Atom:A cloud native deep learning platform at Supremind)

此演讲使用中文 (This will be presented in Chinese)

Chaoguang Li (Qiniu), Bin Fan (Alluxio), Haoyuan Li (Alluxio)
14:0014:40 Friday, June 21, 2019
实施人工智能 (Implementing AI)
Location: 报告厅(Auditorium)

必要预备知识 (Prerequisite Knowledge)

  • A basic knowledge of the machine learning (ML) software stack and the cloud

您将学到什么 (What you'll learn)

  • Understand machine learning infrastructure in depth and the design and implementation of cloud native infrastructure
  • Learn performance optimization and tuning

描述 (Description)

AVA is a general deep learning training platform built by the Atlab Lab at Qiniu Cloud (specialized in deep learning for computer vision) to provide deep learning to internal and external users. The platform is built upon a stack of open source software including TensorFlow, Caffe, and Alluxio, as well as the object storage service KODO, provided by Qiniu.

Chaoguang Li, Haoyuan Li, and Bin Fan explore the internal infrastructure from design to implementation in AVA, discuss the motivation, architecture, and share lessons learned building and maintaining this platform.

A primary goal of AVA is to serve different ML users and applications requiring different tools, which means the platform is required to serve a large amount of training data sourced from KODO to all the ML frameworks, including TensorFlow, Caffe, PyTorch, and MXNet. You won’t have to make the transition to the cloud environment later, because the platform is designed for the cloud from day one, and the computation is naturally separate from the training data source (stored in KODO). The network easily becomes the bottleneck to transfer a large scale of training data to the GPU machines, so the company must also serve the data efficiently and reduce the development and maintenance cost. As a result, Alluxio was leveraged to connect to KODO and present training data via its portable operating system interface (POSIX) to all different machine learning frameworks, like TensorFlow and Caffe, and to deploy Alluxio as the unified data access layer to retrieve and also accelerate the training task. With the training tasks reading a large number of sample files such as videos and pictures, file read and write performance is improved by more than 50%, which significantly reduced the cost in the capacity of the storage system.

Photo of Chaoguang Li

Chaoguang Li


Chaoguang Li is the senior architect and director of AI at Qiniu Cloud. He’s been working in distributed systems for more than 10 years. Previously, he worked on the first generation of SSD tiered storage DS8000 at IBM and was the chief architect of the all-flash storage Dorado Cache in Huawei.

Photo of Bin Fan

Bin Fan


Bin Fan is a software engineer at Alluxio and a PMC member of the Alluxio project. Previously, Bin worked at Google, building next-generation storage infrastructure, where he won Google’s technical infrastructure award. He holds a PhD in computer science from Carnegie Mellon University.

Photo of Haoyuan Li

Haoyuan Li


Haoyuan (H.Y.) Li is the founder, chairman, and CTO of Alluxio. He holds a PhD in computer science from UC Berkeley’s AMPLab, where he created the Alluxio (formerly Tachyon) open source data orchestration system, cocreated Apache Spark Streaming, and became an Apache Spark founding committer. He also holds an MS from Cornell University and a BS from Peking University, both in computer science.