Jeff Dean’s Talk on Large-Scale Deep Learning

Jeff Dean is a Google Senior Fellow. He leads the Google Brain project. He spoke at Y Combinator in August 2017. The video is available on YouTube, and slides on Scribd.

Google uses machine learning almost everywhere. Jeff highlighted few most interesting applications, including machine translation. Recently they switched Google Translate to use deep neural networks trained end-to-end. It was interesting to learn how they optimized the performance of the algorithm. They run different layers of the model on different GPUs. I never seen such approach to parallelization before.

They went even further and started using machine learning for performance optimization. A reinforcement learning model learned which device to place tensor operations to in order to optimize the time of computation. This results in 19% faster computation comparing to device placement by an expert human.

A large area of research is learning to learn.

Current: solution = machine learning expertise + data + computation
Future: solution = data + 100X computation

Deep learning talent is scarce and expensive. Computation gets cheaper every year. That makes sense to try to automate this work. Google is trying 2 approaches: learning model architectures and learning optimizers.

The idea for architecture search is to use reinforcement learning. Generate ten models, train them for a few hours, use the loss of generated models as reinforcement learning signal. This work appeared at ICLR 2017. A preprint is on Arxiv. A model learned using this algorithm did well on CIFAR-10 image recognition task and Penn Tree Bank language modeling task. So far, this approach is too computationally expensive and is feasible only for small models, but over time it will be able to handle larger models.

Google researchers used reinforcement learning to train optimizers. A recurrent neural network constructs a weight update function from a few primitives, trains a child model using this optimizer, and uses accuracy as reinforcement learning signal. A few optimizers learning with this approach achieved better results than popular handcrafted optimizers like SGD, Momentum, ADAM and RMSProp. See a preprint on Arxiv.

What will machine learning researchers do after all the tedious work of model architecture and hyperparameter tuning is automated away? It’s likely that humans and machine learning systems can work together. Humans can create handcrafted architectures and use them to narrow down search space for automated learning. Also, researchers’ time will be freed up to think about which problems to solve and how to use machine learning to solve them.

More computational power is needed. And we need a different kind of computation. That’s why Google designed TPU, Tensor Processing Units. Reduced precision, optimized for tensor operations. In May 2017, Google revealed the 2nd version of TPU, designed for training and inference. 180 teraflops of computation, 64 GB of memory. It will be available through Google Cloud later this year.

Besides computation, deep learning needs large datasets. Currently, machine learning researchers build different models for solving different problems, and each model needs a lot of data to train. A way to get to more data efficient algorithms is to train one large model which can do 1000 different things. Then adding 1001st thing doesn’t require much data, it can use representations it already learned for previous 1000 tasks.

Watch the video on YouTube.