Character Based Language Model

19 Nov. 2020

By: Aryan Gulati
Have you ever wondered how Gmail automatic reply works? Or how your WhatsApp suggests the next word when texting? Or even how we can generate musical notes? The general way of generating a sequence of text is to train a model to predict the next word/character given all previous words/characters. Such a model is called a Statistical Language Model. What is a statistical language model? A statistical language model is a probability distribution over sequences of words trained on. The main task of it is to predict the next character given all previous characters in a sequence of data, i.e. generates text character by character. Markov chain is based on a principle of“memorylessness”. In other words, the next state of the process only depends on the previous state and not the sequence of states. This simple assumption makes the calculation of conditional probability easy and enables this algorithm to be applied in a number of scenarios. In real-life problems, we generally use the Hidden Markov model, which is a much-evolved version of the Markov chain. It's a stochastic process( means that one has a system for which there are observations at certain times and that the outcome, that is, the observed value at each time is a random variable.) containing random variables, transitioning from one state to another depending on certain assumptions and definite probabilistic rules. Let’s derive this mathematically: Let the random process be, {Xm, m=0,1,2,⋯}. This process is a Markov chain only if, P(Xm+1=j | Xm=i, Xm-1= im-1,...,X0=i0 ) for all m, j, i, i0, i1, ⋯ im−1 For a finite number of states, S={0, 1, 2, ⋯, r}, this is called a finite Markov chain. P(Xm+1 = j|Xm = i) here represents the transition probabilities to transition from one state to the other. Here, we’re assuming that the transition probabilities are independent of time. This means that P(Xm+1 = j|Xm = i) does not depend on the value of ‘m’. Therefore, we can summarise, P(Xm+1 = j|Xm = i) So this equation represents the Markov chain. Markov Chain In Python To run a simple weather model, I’ll be using Python. Now let’s get started with coding! Character-Based-Language-Model (see this Notebook for this: https://github.com/aryangulati/Character-Based-Language-Model ) Do basic installation of python and download from here https://www.python.org/downloads/ Practice on Jupyter Notebook by running pip install jupyter on cmd .see this for basic installation https://jupyter.readthedocs.io/en/latest/install.html Foresight about Markov chains Let's have a look at the following text: text="arty arts arty arts arty arts arty arts" We can say the following: ⦿ When the current word is "arty", the next word will be "arts" 100% of the time. ⦿ When the current word is "arts", the next word will be "arty" 75% of the time, and it will be a newline the rest (25%) of the time. That's it! We have now defined a Markov chain. From this definition of probabilities, we can now help users with writing similar text, or we can generate completely new texts autonomously. ⦿ Let's start with "arts". ⦿ The generator knows the probabilities of the possible next words after "arts": 75% "arty", 25% newline. ⦿ If we are writing an auto-suggesting keyboard, we can now place "arty" prominently and something like "enter" less prominently. ⦿ If we're building a text generator, we pick the next word randomly. Either completely randomly (50% arty, 50% newline) or in proportion (arty has 3 times the chance than a newline). We're using the second approach here, as it generates text that is closer to the original. ⦿ Let's say we ended up with "arty". The generated string is now "arts arty". ⦿Next, we take the last word ("arty") and repeat that word. ⦿ The probabilities for "arty" are simple (100% "arts"). If we are writing an auto-suggestion keyboard, we can even insert "arts" without asking the user (if that's the kind of domain we're in). ⦿ We now have "arts arty arts". ⦿ So on repeatedly. Markov Chain Real-World Applications ⦿ PageRank: an algorithm used by Google Search to rank web pages in their search engine results. where every web page can be a state and the links or references between these pages can be thought of as transitions with probabilities. So irrespective of your search, the chance of getting to a certain web page has a fixed probability. ⦿Typing Word Prediction: It is used for predicting upcoming words. Like Gmail and WhatsApp used in auto-completion and suggestions. ⦿Text generator: It generates dummy texts or produces large essays and compiles speeches. Too used in the name generation that you see on the web.