Large Text Classification Datasets

Data is the most important component for building a machine learning model. Recently researchers from Google trained a CNN model for image classification on 300 million images and they demonstrated that even on a scale of hundreds of millions of examples adding more data helps to improve the model performance. Apparently, more data is better. But where can you get large datasets if you are doing research on text classification?

I found nice references to a few large text classification datasets in “Text Understanding from Scratch” paper by Xiang Zhang and Yann LeCun. The paper describes a character-level CNN model for text classification. Authors provide benchmarks of different CNN architectures and a few simple models on a few datasets. More recent version of this paper: “Character-level Convolutional Networks for Text Classification” contains more experimental results but it misses some details on dataset usage: which fields to use, how to truncate long texts, etc. If you are looking for information about datasets, read the older paper. If you want to learn more about the character level CNN models, read the latest paper.

Somebody uploaded the datasets to Google Drive, so you can download them here.

If you have other large text classification datasets, please share in comments to this post.

Kaggle Competition: Intel & MobileODT Cervical Cancer Screening

I started looking at Kaggle competitions to practice my machine learning skills. One of currently running competitions is framed as an image classification problem. Intel partnered with MobileODT to start a Kaggle competition to develop an algorithm which identifies a woman’s cervix type based on images.

The training set contains 1481 images split into three types. Kagglers can use 6734 additional images. Some of them come from duplicate patients. Some of the additional images are lower quality. Test sets for two stages of the competition are available, kagglers have to submit a set of predicted probabilities, one for each of 3 classes, for each image of the test set. The total prize pool is $100,000.

I tried to approach the problem in a naïve way: just get a pre-trained Inception V3 image classification model and fine-tune it on this dataset.

Continue reading

NIPS 2016 Reviews

NIPS (the Conference and Workshop on Neural Information Processing Systems) is a machine learning and computational neuroscience conference held every December. It was first proposed in 1986, and for a long time, it was a small conference. Interest to NIPS significantly increased when deep learning started demonstrating great results in image recognition, speech recognition, and multiple other areas. Last year NIPS had 2500+ papers submitted and 5000+ people in attendance.

Continue reading

2016 Review

I treat this blog as my lab journal. I will keep posting random thoughts along with better-written articles on particular topics. This post is an attempt to summarize my activity in some areas in 2016 (see overview of ML and AI in 2016 general in my previous post). Continue reading

Machine Learning and Artificial Intelligence in 2016

2016 was an interesting year. AI winter is over, but this time AI is almost a synonym for deep learning. Major technology companies (Google, Microsoft, Facebook, Amazon and Apple) announced new products and services built using machine learning. DeepMind AlphaGo beat the world champion in Go. Salesforce bought MetaMind to build a deep learning lab. Apple promised to open up its deep learning research.

Continue reading

Deep Learning for Spell Checking

I use spell checking every day, it is built into word processors, browsers, smartphone keyboards. It helps to create better documents and make communication clear. More sophisticated spell checker can find grammatical and stylistic errors.

How to add spell checking to your application? A very simple implementation by Peter Norvig is just 22 lines of Python code.

In “Deep Spelling” article, Tal Weiss wrote that he tried to use this code and found that it is slow. The code is slow because it is brute forcing all possible combinations of edits on the original text.

An interesting approach is to use deep learning. Just create artificial dataset by adding spelling errors into correct English texts. And you better have lots of text! The author has been using one billion words dataset released by Google. Then train character-level sequence-to-sequence model with LSTM layers to convert a text with spelling errors to a correct text. Tal got very interesting results. Read the article for details.

Good quality spell checkers can be very useful for chatbots. Most of the chatbots rely on simple NLP techniques, and typical NLP pipeline includes syntax analysis and part of speech tagging, which can be easily broken if the input message is not grammatically correct or has spelling errors. Perhaps fixing spelling errors earlier in the NLP pipeline can improve the accuracy of natural language understanding.

It can be good to try train such spell checker model on another dataset, more conversational.

Have you tried to use any models like this in your apps?

Intelligence Platform Stack

Machine intelligence field grows with breakneck speed since 2012. That year Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton achieved the best result in image classification on LSVRC-2010 ImageNet dataset using convolutional neural networks. It’s amazing that end-to-end training of a deep neural network worked better than sophisticated computer vision systems with handcrafted feature engineering pipelines being refined by researchers for decades.

Since then deep learning field got the attention of machine learning researchers, software engineers, entrepreneurs, venture investors, even artists and musicians. Deep learning algorithms surpassed the human level of image classification and conversational speech recognition, won Go match versus 18-time world champion. Every day new applications of deep learning emerge, and tons of research papers are published. It’s hard to keep up. We live in a very interesting time, future is already here.

Continue reading