Chatbot developers usually use two technologies to make the bot understand the meaning of user messages: machine learning andhardcoded rules. See more details on chatbot architecture in my previous article.
Machine learning can help you to identify the intent of the message and extract named entities. It is quite powerful but requires lots of data to train the model. Rule of thumb is to have around 1000 examples for each class for classification problems.
If you don’t have enough labeled data then you can handcraft rules which will identify the intent of a message. Rules can be as simple as “if a sentence contains words ‘pay’ and ‘order’ then the user is asking to pay for an order”. And the simplest implementation in your favorite programming language could look like this:
def isRefundRequest(message): return 'pay' in message or 'order' in message |
Any intent classification code can make errors of two types. True positives: the user doesn’t express an intent, but the chatbot identifies an intent. False positives: the user expresses an intent, but the chatbot doesn’t find it. This simple solution will make lots of errors:
- The user can use words “pay” and “order” in different sentences: “I make an order by mistake. I won’t pay.”
- A keyword is a substring of another word: “Can I use paypal for order #123?”
- Spelling errors: “My orrder number is #123. How can I pay?”
- Different forms of words: “How can I pay for my orders?”
Your chatbot needs a preprocessing NLP pipeline to handle typical errors. It may include these steps:
- Spellcheck
Get the raw input and fix spelling errors. You can do something very simple or build a spell checker using deep learning.
- Split into sentences
It is very helpful to analyze every sentence separately. Splitting the text into sentences is easy, you can use one of NLP libraries, e.g. NLTK, StanfordNLP, SpaCy.
- Split into words
This is also very important because hardcoded rules typically operate with words. Same NLP libraries can do it.
- POS tagging
Some words have multiple meanings, for an example “charge” as a noun and “charge” as a verb. Knowing a part of speech can help to disambiguate the meaning. You can use same NLP libraries, or Google SyntaxNet, that is a little bit more accurate and supports multiple languages.
- Lemmatize words
One word can have many forms: “pay”, “paying”, “paid”. In many cases, an exact form of the word is not important for writing a hardcoded rule. If preprocessing code can identify a lemma, a canonical form of the word, it helps to simplify the rule. Lemmatization, identifying lemmas, is based on dictionaries which list all forms of every word. The most popular dictionary for English is WordNet. NLTK and some other libraries allow using it for lemmatization.
- Entity recognition: dates, numbers, proper nouns
Dates and numbers can be expressed in different formats: “3/1/2016″, “1st of March”, “next Wednesday”, “2016-03-01″, “123″, “one hundred”, etc. It may be helpful to convert them to unified format before doing pattern matching. Other entities which require special treatment: locations (countries, regions, cities, street addresses, places), people, phone numbers.
- Find concepts/synonyms
If you want to search for a breed of a dog, you don’t want to list all the dog breeds in the rule, because there are hundreds of them. It is nice if preprocessing code identified a dog breed in the message and marked the word with a special tag. Then you can just look for that tag when applying the rule.
WordNet can be used to identify common concepts. You may need to add domain specific concept libraries, e.g. a list of drug names if you are building a healthcare bot.
After preprocessing is done you have a nice clean list of sentences and lists of words inside each sentence. Each word is marked with a part of speech and concepts, and you have a lemma for every word. The next step is to define patterns for intent identification.
You can invent your own pattern language using common logical operators AND, OR, NOT. The rule can look like this if you create an internal DSL (domain-specific language) based on Python:
r = Rule( And( Or('cancel', 'close'), 'membership', Respond('Would you like to cancel your membership immediately?')) |
Alternatively, you can invent external DSL, which can be more readable, but you will need extra work to create a compiler or an interpreter for that language. If you use a ChatScript language, it can look like this:
u: (<<[cancel close] membership>>) Would you like to cancel your membership immediately?
Do you use a chatbot engine with hardcoded rules? Have your developed your own? What issues have you encountered when building or using a chatbot engine? Please share in comments!