June | 2016 | Pavel Surmenok

One of the common natural language understanding problems is text classification. Over last few decades, machine learning researchers have been moving from the simplest “bag of words” model to more sophisticated models for text classification.

Bag of words model uses only information about which words are used in the text. Adding TFIDF to the bag of words helps to track relevancy of each word to the document. Bag of n-grams enables using partial information about structure of the text. Recurrent neural networks, like LSTM, can capture dependencies between words even if they are far from each other. LSTM learns structure of sentences from the raw data, but we still have to provide a list of words. Word2vec algorithm adds knowledge about word similarity, which helps a lot. Convolutional neural networks can also help to process word-based datasets.

A trend is to learn using raw data, and provide machine learning models with an access to more information about text structure. A logical next step would be to feed a stream of characters to the model and let it learn all about the words. What can be cruder than a stream of characters? An additional benefit is that the model can learn misspellings and emoticons. Also, the same model can be used for different languages, even those where segmentation into words is not possible.

Continue reading →

Pavel Surmenok

Monthly Archives: June 2016

Character-level Convolutional Networks for Text Classification