July | 2017 | Pavel Surmenok

The field of deep learning is very active, arguably there are one or two breakthroughs every week. Research papers, industry news, startups, and investments. How to keep up with the news?

There are a few newsletters with well-curated links and summaries:

Papers and code:

AI section of Arxiv.org is useful if you are looking for the latest research papers.
Gitxiv is a collection of source code links for deep Arxiv papers.

Good regular podcasts about deep learning:

Data is the most important component for building a machine learning model. Recently researchers from Google trained a CNN model for image classification on 300 million images and they demonstrated that even on a scale of hundreds of millions of examples adding more data helps to improve the model performance. Apparently, more data is better. But where can you get large datasets if you are doing research on text classification?

I found nice references to a few large text classification datasets in “Text Understanding from Scratch” paper by Xiang Zhang and Yann LeCun. The paper describes a character-level CNN model for text classification. Authors provide benchmarks of different CNN architectures and a few simple models on a few datasets. More recent version of this paper: “Character-level Convolutional Networks for Text Classification” contains more experimental results but it misses some details on dataset usage: which fields to use, how to truncate long texts, etc. If you are looking for information about datasets, read the older paper. If you want to learn more about the character level CNN models, read the latest paper.

Somebody uploaded the datasets to Google Drive, so you can download them here.

If you have other large text classification datasets, please share in comments to this post.

Pavel Surmenok

Monthly Archives: July 2017

Best Sources of Deep Learning News

Large Text Classification Datasets