Building a spell checker using deep learning is a great idea. After reading Tal Weiss’s article about the character-based model for spell checking I wanted to run his code, see how well it works for real applications, and work on improvements. This became my fun one-month side project for January 2017.
What’s all that about
The goal of the project is to have an algorithm which can correct spelling errors in English texts. Then, if it works with decent accuracy, I will use it in an NLP pipeline for chatbots. I hope that spelling correction can improve the accuracy of downstream tasks in the NLP pipeline.
Original DeepSpell code by Tal Weiss is on GitHub. The code reads 286 MB news from a text file that is a part of a billion word dataset. Preprocessing pipeline cleans the text, parses into lines, creates artificial spelling errors and splits every line into chunks of no longer than 40 characters. A result of this pipeline is a set of question/answer pairs where a question is a string with spelling errors, and an answer is a string in original, syntactically correct, form. The dataset is split into a training set and a dev (validation) set in 90/10 proportion. Then an LSTM sequence-to-sequence model is trained to predict original text based on a text with typos.
Docker
The first challenge is to install all the required software on my machine. As you know, installing a bunch of dependencies to run a deep learning algorithm is often a painful and error-prone process. I simplified it a bit using Docker. When you package an application in a Docker container, all dependencies are nicely isolated, and you don’t need to worry that the code requires a different version of Python/Theano/Keras/CUDA than already installed on your machine.
NVIDIA Docker can be used to run Docker containers with GPU access.
TensorFlow
Tal wrote that he chose Theano as a back-end because TensorFlow is painful to install, and it is slow. I decided to run it with TensorFlow because Docker makes it easy to do. Google provides an official TensorFlow Docker image. There is a separate Docker image for running computation on GPU, so both CPU and GPU are supported.
Build script
A billion word dataset is a 10.5 GB archive. Downloading this file from a Docker container would be inefficient because every time you instantiate a container, it would download 10 gigabytes again. I wrote a build script which downloads the file, extracts one file which is used for training and places it in a directory available for Docker build.
Performance
The next challenge is to make the code work on my machine. The first time I run it on my 8GB MacBook Pro I got an issue with insufficient amount of memory available. Python process consumed 6.9 GB memory and crashed with “Killed” message. The problem was that it tried to load the entire dataset into memory and do all pre-processing steps on the entire dataset.
I rewrote pre-processing code to clean each line separately. Vectorization should process each batch separately. Then model training code should work with a batch generator instead of full X and y matrices.
Logging and checkpoints
Keras allows applying callbacks in certain points of model training. You can use callbacks, for example, for saving intermediary training results to CSV file and writing a log for TensorBoard. Also, Keras lets you save model checkpoints in the middle of the training process.
Concerns
Spelling errors in the dataset are artificial, so the model learns to fix artificial typos. Will it work as well for fixing real typos? I haven’t checked it yet.
The dataset is based on news articles. Language in the news is different from language on Reddit or Twitter. Will the model work well on text from other sources?
The model trains on strings up to 40 characters, and often it is only part of a sentence. Is it enough context? Will it work better if trained on whole sentences?
To do
Currently, DeepSpell code runs on Keras 1.1.2 and TensorFlow 0.12-rc1. I tried to upgrade it to Keras 1.2.1 and TensorFlow 1.0.0-rc1, but model compilation fails on this configuration. It needs investigation. Maybe something changed in Keras API or TensorFlow API.
Code
See my changes on GitHub.
If you are doing any research in this area, please let me know!