Natural Language Pipeline for Chatbots

Chatbot developers usually use two technologies to make the bot understand the meaning of user messages: machine learning andhardcoded rules. See more details on chatbot architecture in my previous article.

Machine learning can help you to identify the intent of the message and extract named entities. It is quite powerful but requires lots of data to train the model. Rule of thumb is to have around 1000 examples for each class for classification problems.

If you don’t have enough labeled data then you can handcraft rules which will identify the intent of a message. Rules can be as simple as “if a sentence contains words ‘pay’ and ‘order’ then the user is asking to pay for an order”. And the simplest implementation in your favorite programming language could look like this:

def isRefundRequest(message):
    return 'pay' in message or 'order' in message

Any intent classification code can make errors of two types. True positives: the user doesn’t express an intent, but the chatbot identifies an intent. False positives: the user expresses an intent, but the chatbot doesn’t find it. This simple solution will make lots of errors:

The user can use words “pay” and “order” in different sentences: “I make an order by mistake. I won’t pay.”
A keyword is a substring of another word: “Can I use paypal for order #123?”
Spelling errors: “My orrder number is #123. How can I pay?”
Different forms of words: “How can I pay for my orders?”

Your chatbot needs a preprocessing NLP pipeline to handle typical errors. It may include these steps:

Spellcheck

Get the raw input and fix spelling errors. You can do something very simple or build a spell checker using deep learning.

Split into sentences

It is very helpful to analyze every sentence separately. Splitting the text into sentences is easy, you can use one of NLP libraries, e.g. NLTK, StanfordNLP, SpaCy.

Split into words

This is also very important because hardcoded rules typically operate with words. Same NLP libraries can do it.

POS tagging

Some words have multiple meanings, for an example “charge” as a noun and “charge” as a verb. Knowing a part of speech can help to disambiguate the meaning. You can use same NLP libraries, or Google SyntaxNet, that is a little bit more accurate and supports multiple languages.

Lemmatize words

One word can have many forms: “pay”, “paying”, “paid”. In many cases, an exact form of the word is not important for writing a hardcoded rule. If preprocessing code can identify a lemma, a canonical form of the word, it helps to simplify the rule. Lemmatization, identifying lemmas, is based on dictionaries which list all forms of every word. The most popular dictionary for English is WordNet. NLTK and some other libraries allow using it for lemmatization.

Entity recognition: dates, numbers, proper nouns

Dates and numbers can be expressed in different formats: “3/1/2016″, “1st of March”, “next Wednesday”, “2016-03-01″, “123″, “one hundred”, etc. It may be helpful to convert them to unified format before doing pattern matching. Other entities which require special treatment: locations (countries, regions, cities, street addresses, places), people, phone numbers.

Find concepts/synonyms

If you want to search for a breed of a dog, you don’t want to list all the dog breeds in the rule, because there are hundreds of them. It is nice if preprocessing code identified a dog breed in the message and marked the word with a special tag. Then you can just look for that tag when applying the rule.

WordNet can be used to identify common concepts. You may need to add domain specific concept libraries, e.g. a list of drug names if you are building a healthcare bot.

After preprocessing is done you have a nice clean list of sentences and lists of words inside each sentence. Each word is marked with a part of speech and concepts, and you have a lemma for every word. The next step is to define patterns for intent identification.

You can invent your own pattern language using common logical operators AND, OR, NOT. The rule can look like this if you create an internal DSL (domain-specific language) based on Python:

r = Rule(
    And(
        Or('cancel', 'close'),
        'membership',
    Respond('Would you like to cancel your membership immediately?'))

Alternatively, you can invent external DSL, which can be more readable, but you will need extra work to create a compiler or an interpreter for that language. If you use a ChatScript language, it can look like this:

u: (<<[cancel close] membership>>)
    Would you like to cancel your membership immediately?

Do you use a chatbot engine with hardcoded rules? Have your developed your own? What issues have you encountered when building or using a chatbot engine? Please share in comments!

Kaushik Govindarajan

Hi Pavel, the article serves as a good intro to a newbie. In response to the last part of your post, I am currently building an intent classifier and I have a brief idea about the training of static questions that contain hardcoded text. I am having problems in training sentences with temporal data. For ex. consider and example sentence and an intent
Book a table for 2 at 7pm tomorrow – restaurant_booking
Now in this case “7pm tomorrow” is an arbitrary value and a user can enter any time expression. So in this case how should I structure my training data in order to train a more accurate model?
- surmenok
  
  intent is restaurant_booking. “7pm tomorrow” is an entity, you should treat it as entity recognition problem. Parsing dates and times can be quite hard because there are many ways how people can express it. This article about x.ai architecture for natural language understanding may help: https://x.ai/blog/a-peek-at-x-ais-data-science-architecture/ They deal with datetime parsing a lot.
  - Kaushik Govindarajan
    
    Thanks, I am aware of the entity extraction part but I was wondering if there is a way to tell the classifier that a time expression is expected in this part of the sentence. This might significantly improve the performance of the classifier.
    - surmenok
      
      Yes. One way to do this is to perform entity recognition first, and then add features like HasTime (0 or 1), HasDate (0 or 1). Using such features together with raw text of the message can be very helpful.
      - Kaushik Govindarajan
        
        That sounds like good approach. So while converting my words to vectors I need to add additional features for the entities. While predicting, the NER has to be done first and based on its output I can predict the intent also. But the only thing that concerns me is that the features will keep increasing as my entities increase and hence the classification algorithm may not perform well.
        Thanks
        
        surmenok
        
        Another idea. Entity recognition can be done by assigning a tag to each word (e.g. https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging) )
        If you perform this kind of entity tagging first, then you can just a sequence of pairs (word, tag) into classification model instead of a sequence of words.
        This way you don’t increase complexity of the model too much.