There are a number of ways to implement chatbots. You can find organisations like API.ai, Gupshup, Recast, etc providing an easy to API for intelligent bots. All these bots use a model known as Rule based model. Using the rule based model, a developer can define a set of rules which the bot follows while responding to queries. The most popular way of implementing the rule based model is using AIML (Artificial Intelligence Markup Language). AIML is an XML based language which allows developers to create patterns and templates. Whenever the bot recognizes a pattern, it follows the corresponding set of rules.
Though it is easy enough for everyone to use AIML and make their own chatbots, it is extremely difficult to implement all the possible use cases the bot might encounter. And thus, the bot would eventually fail to respond when tested with an unknown pattern.
But then how to create chatbots? Consider a bot which can learn from actual human interactions? This is where machine learning comes into play.
In traditional programming, if we have a problem statement, then to solve it, we go step by step. For example, if you are programming an app which can recognize between different types of chairs, you would probably write a code like :
def number_of_legs():
do_something()
def type_of_material():
do_something()
...
But in Machine Learning, we define the problem statement, and tell our program, that here are a bunch of images of different types of chairs, learn which one is which type on your own. So, we can say that in machine learning, we do not define the steps to solve a particular problem, but tell the computer to learn all the steps by itself. Sometimes it might also happen that we won't have any idea about some steps. Let's leave this here. I'll be coming up with a proper series of posts to get started with machine learning!
Types of models
Let us get back to chatbots. There are different types of models which can be implemented in a Machine Learning based bot. These types of models can be mainly classified into two types:
- Retrieval-based models
- Generative models
The Retrieval-based models choose a response from a collection of responses based on the query. It does not generate any new sentences, hence we don’t need to worry about grammar.
While the Generative models are quite intelligent. They generate a response, word by word based on the query. Due to this, the responses generated are prone to grammatical errors.
Sequence to Sequence (seq2seq)
Sequence To Sequence model demonstrates the Learning Phrase Representations using RNN Encoder-Decoder. Basically, It consists of two RNNs (Recurrent Neural Network) : An Encoder and a Decoder. The encoder takes a sequence(sentence) as input and processes one symbol(word) at a time.
It converts a sequence of symbols into a fixed size vector that only encodes the important information in the sequence, and losing the unnecessary information. This is possible by the use of LSTMs (Long Short-Term Memory). The LSTM has the capability to alter the passing information by using Gates. A simple LSTM can be visualized as follows :
In simple terms, the LSTM can use the Forget Gate to alter the unnecessary information and throw it out allowing only the necessary information to flow forward. You can head to an article written by Christopher Olah which properly explains the working of a LSTM here.
A seq2seq model can be represented as follows :
Here, every block you see is a LSTM. Each hidden state influences the next hidden state and the final hidden state can be seen as the summary of the sequence. This state is called the context or thought vector, as it represents the intention of the sequence. From the context, the decoder generates another sequence, one symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the previously generated symbols.
Tokenizing a sentence and padding
Before the dataset is completely usable, we have to pad the dataset. Padding refers to a process where we convert variable lengths of sentences to a fixed length. There are four special symbols we need to understand before we dive deeper.
- EOS : End of sentence
- PAD : Filler
- GO : Start decoding
- UNK : Unknown; word not in vocabulary
To understand padding, consider a small conversation.
> Hi! How are you?
= Hello. I am fine.
Here, if we consider a fixed length of 10, the sentences would be padded in accordance to 10 elements. The above conversation would look like :
> ["PAD", "PAD", "PAD", "PAD", "?", "you", "are", "How", "!", "Hi"]
= ["GO", "Hello", ".", "I", "am", "fine", ".", "PAD", "PAD", "PAD"]
Bucketing
Padding can be useful in cases of a dataset with all sentences having almost the same number of words. Consider the above conversation with a fixed padding while the largest sentence in our dataset is of 100 words. There would be 94 "PAD"s in the query. This would cause an information loss.
Bucketing solves this problem, by putting sentences into buckets of different sizes. Consider this list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. If the length of a query is 4 and the length of its response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be padded to length 5 and the response will be padded to length 10. While running the model (training or predicting), we use a different model for each bucket, compatible with the lengths of query and response. All these models, share the same parameters and hence function exactly the same way.
Word Embedding
Word Embedding is a technique for dense representation of words in a low dimensional vector space. Each word can be seen as a point in this space, represented by a fixed length vector. A word embed would look like the following:
The word embedding in our project with a Vocabulary of 20,000 in 3-Dimensional Space would look like:
Code Explaination
The code hosted for DeepConversations would be explained in the next post. Follow the blog for updates!
Fork the Source
A seq2seq model implementation code has been open sourced and can be found on my GitHub profile. Fork the source here.