Character-level Convolutional Networks for Text Classification

One of the common natural language understanding problems is text classification. Over last few decades, machine learning researchers have been moving from the simplest “bag of words” model to more sophisticated models for text classification.

Bag of words model uses only information about which words are used in the text. Adding TFIDF to the bag of words helps to track relevancy of each word to the document. Bag of n-grams enables using partial information about structure of the text. Recurrent neural networks, like LSTM, can capture dependencies between words even if they are far from each other. LSTM learns structure of sentences from the raw data, but we still have to provide a list of words. Word2vec algorithm adds knowledge about word similarity, which helps a lot. Convolutional neural networks can also help to process word-based datasets.

A trend is to learn using raw data, and provide machine learning models with an access to more information about text structure. A logical next step would be to feed a stream of characters to the model and let it learn all about the words. What can be cruder than a stream of characters? An additional benefit is that the model can learn misspellings and emoticons. Also, the same model can be used for different languages, even those where segmentation into words is not possible.

 

The article “Character-level Convolutional Networks for Text Classification” (Xiang Zhang, Junbo Zhao, Yann LeCun) explores usage of character-level ConvNet networks for text classification. They compare performance of a few different models on several large-scale datasets.

Datasets contain from 120,000 to 3,600,000 samples in the training set, from 2 to 14 classes. The smallest dataset is AG’s News: news articles divided into 4 classes, 30,000 articles for each class in the training set. The largest is Amazon Review Polarity: 2 polarity segments with 1,800,000 reviews for each of them.

Character-level ConvNet was compared with state-of-the-art models: Bag-of-words and its TFIDF, Bag-of-ngrams and its TFIDF, Bag-of-means on word embedding, word-based ConvNet, word-based LSTM.

Character-level ConvNet contains 6 convolutional layers and 3 fully-connected layers.

 

Results are quite interesting. N-gram and N-gram TFIDF models have are the best for smaller datasets, up to several hundreds of thousands of samples. But when dataset size grows to several million we can observe that character-level ConvNet performs better.

ConvNet tends to work better for texts which are less curated. For example, ConvNet works better on Amazon reviews dataset. Amazon Reviews are raw user inputs, whereas users could be more careful in writings on Yahoo! Answers.

Choice of alphabet matters. ConvNet works better is not distinguishing between upper and lower case characters.

 

Another overview of this paper

Discussion on Reddit

Torch 7 code from Xiang Zhang

Keras (Theano) implementation

 

Other useful links:

Understanding Convolutional Neural Networks for NLP

A Set of Character-Based Models for Sentiment Analysis, Ad Blocking and other tasks

2016 is the Year of Chatbots

When Apple introduced App Store in 2008, developers’ attention moved from web-based to native mobile apps.

A few years later the app market stabilized. Facebook, Amazon, and Google apps dominate in their verticals. Consumers don’t want to install new apps anymore. According to comScore’s mobile app report, most US smartphone owners download zero apps in a typical month, and a “staggering 42% of all app time spent on smartphones occurs on the individual’s single most used app”.

More than half of the time we spend on our phones is talking, texting or in email, according to Experian’s report.

Continue reading

RE-WORK Virtual Assistant Summit Presentation Notes

In the end of January, RE-WORK organized a Virtual Assistant Summit, which took place in San Francisco at the same time as RE-WORK Deep Learning Summit.

Craig Villamor wrote a nice overview of key things discussed on the summit.

I didn’t attend these conferences, but I watched a few presentations which RE-Work kindly uploaded to YouTube. I would like to share notes I took while watching these videos. I could have misinterpreted something, so please keep that in mind, and watch original videos for more details.

Continue reading

Deep Learning Hardware

Deep learning is computationally intensive. Model training and model querying have very different computation complexities. A query phase is fast: you apply a function to a vector of input parameters (forward pass), get results.

Model training is much more intensive. Deep learning requires large training datasets in order to produce good results. Datasets with millions of samples are common now, e.g. ImageNet dataset contains over 1 million images. Training is an iterative process: you do forward pass on each sample of the training set, do backward pass to adjust model parameters, repeat the process a few times (epochs). Thus training requires millions, or even billions more computation than one forward pass, and a model can include billions of parameters to adjust.

Continue reading

Convolutional Neural Networks to Map Human Brain

It’s interesting that The New York Times published an article about brain cryonics, immortality, connectomics, trans-humanism, and uploading. Kim Suozzi, who died of cancer at age 23, chose to have her brain preserved in hope to get alive sometime in future. One of the options is to scan the brain and map the connections between individual neurons.

“I can see within, say, 40 years that we would have a method to generate a digital replica of a person’s mind,” said Winfried Denk, a director at the Max Planck Institute of Neurobiology in Germany, who has invented one of several mapping techniques.”

“The mapping technique pioneered by Dr. Denk and others involves scanning brains in impossibly thin sheets with an electron microscope. Stacked together on a computer, the scans reveal a three-dimensional map of the connections between each neuron in the tissue, the critical brain anatomy known as the connectome.”

The author doesn’t dive into details of reconstructing a map of neuron connections, though. As Yan LeCunn points out, “connectomics efforts use 3D convolutional nets to analyze the volumetric brain images and to reconstruct the neural circuits.

As strange as it may sound, neuroscientists use artificial neural networks to reconstruct models of human neural networks. Yet another good use of deep learning techniques.

Import data from SQLite to Microsoft SQL Server

I’m trying to solve one of recent Kaggle competitions: “ICDM 2015: Drawbridge Cross-Device Connections“. That competition provides data on device/browser usage and asks you to determine which cookies belong to an individual using a device.

The data for this competition is available in two formats: CSV files and SQLite database. A relational database looks more suitable for ad-hoc queries because SQL is a quite powerful tool: you can easily join tables, filter and group data. Though it lacks some of statistical analysis capabilities which you can get in R or other tools specialized in statistics.

SQLite is an awesome technology for small embedded databases, but there are certainly no good GUI applications for querying SQLite databases. I also was afraid that SQLite query execution engine is not very smart, and SQLite dialect of SQL is not rich enough, in comparison to SQL Server or Oracle, so I decided to import SQLite database into Microsoft SQL Server 2012.

Continue reading

Data Deduplication in Relational Databases

A huge part of database related work is to make sure that the data is consistent. In real world data is never ideal, and whenever you need using data from existing data sources, you have to understand what is right and what is wrong there, and know how to circumvent data quality issues.  Two most frequent data integrity issues in relational databases are missing date and duplicate data. A record/document is missing if it was not written to the database by an application, or was mistakenly deleted. A record/document is duplicated if it was recorded more than once.

Why does application write same record more than once? A user or an upstream code could send same document twice, and an application doesn’t handle this case. Or a user could send incorrect record the first time, and a corrected one later: an application could be designed to save all records instead of modifying existing ones.

Continue reading

ReSharper Architecture Tools

JetBrains introduced Architecture tools in Resharper 8. It lets you visualize dependencies between projects and types.

To start using it, open a solution in Visual Studio and go to Resharper – Architecture menu. “Show Project Dependency Diagram” menu item opens a new tab with graph of project dependencies:

resharper-architecture-2

Continue reading

Software Archaeology

Human civilization history dates thousands of years. Archaeologists strive to investigate fragile clues of former cultures in order to better understand the past. It is not easy to do because many artifacts are partially or completely destroyed.

It would be a mistake to think that only archaeologists deal with information loss. It is as easy to lose information about systems and organizations which were built quite recently, just a few years ago. In “Institutional memory and reverse smuggling”, an engineer tells a story about a petrochemical company where knowledge about plant design and processes was lost after decades of operation, and they had to bring in a former engineer to smuggle the knowledge back to the company.

Continue reading

Refactoring Databases

Even if you had perfect database design from the start, it is likely that requirements changed, and you have to change database schema accordingly. As there is legacy code, there can be legacy data, with schema designed for use cases which are no longer actual.

It is usually harder to fix data design than to fix code design, but it is doable. And as with code, you can apply refactoring techniques: improving the design without change in behavior.

A book “Refactoring Databases: Evolutionary Database Design” describes basics of database refactoring and lists different kinds of changes to relational databases. It starts from explanation why refactoring is a good thing, why it is important to make small incremental changes, and which organizational obstacles you can get on the way of implementing these techniques. You can imagine how hard it can be to make database changes if DBA and application developers are in different teams, and any database change requires coordination between these teams, sometimes using strict change management process. Basically, agile methodologies with their focus on cross-functional teams and short development cycles can help here.

Continue reading