Getting Started with Text Generation Using Markov Chains

4 min readMay 16, 2022

INTRO

The concept behind markov chains is that in a transition, an event is dependent on the occurrence of a previous event i.e. an event can only occur if an event that is dependent on it has occurred already.

Markov chains have a lot of applications including speech recognition, genetic, web page ranking, etc. but in our case we will be using it for text generation as the topic implies

Our aim in this project is to generate comedic text for the comedian of our choosing based on a routines by the comedian.

STRATEGIES

NB: If you are familiar with my previous two articles, the starting methodologies are almost the same. In NLP, almost all projects you will work on start with this processes

Loading the dataset
Data cleaning and preprocessing
Creating the markov chains
Building our text generator

LOADING THE DATASET

Using the famous pandas library we will load our dataset into our working environment. The dataset we will be working on contains texts of comedy routines by twelve different comedians and I will be choosing my favorite among them. Also the dataset is in pickle format (.pkl), don’t worry if you don’t know what a pickle file is; Pandas handles them all. I will recommend you read the documentation of pandas. I will put a link to it in the resources section

Looking at the dataset, we can see that the comedian’s names has been set as the index. Getting the comedian we want has been simplified already. We just have to use the iloc function of the Dataframe.

NB: Any dataset loaded by pandas is converted into a Dataframe

DATA CLEANING AND PREPROCESSING

Merely looking at the text, we can see that text cleaning is requires with the most obvious ones being the square bracket enclosing the audience’s reaction. Also, punctuations will be removed, and the texts will all be converted into lowercase so that “Yes” and “yes” are treated the same. Lets create a function to perform this task for us.

You might be wondering that a common step in NLP text cleaning has been skipped i.e. removal of stop words. When working with sequence of text, stop words removal is not needed because a word still depends on the previous word even if the predecessor is a stop word. Also so that the text generated by our model makes more sense. Now to the most interesting part:

BUILDING THE MARKOV MODEL

What we are trying to do here is to make every unique word in the text corpus a key and then append every word that exist next to word. With this, the more a word occurs next to a specific word, the more like it is to be selected as the next word when we build our text generator.

We use default this in order to mitigate a problem that may occur as a result of a key not yet existing keys of the dictionary. For more understanding of defualt_dict, I will link an article in the resources section below that explains it well.

TEXT GENERATOR

Here with the markov chains, we will generate texts. we will use the random library here to choose random words that will occur after a specific word.

Below, we can see the text generated using the markov chains. Read through them and let me know what you think of the texts generated.

Final Remarks:

If you have gotten to this point, kindly follow my page. I will be uploading articles relating to NLP every week. If you also have an NLP technique you want me write about. Let me know in the comments. Thanks so much and see you on the other side.

Resources:

More on Default_dict: ‘https://medium.com/swlh/python-collections-defaultdict-dictionary-with-default-values-and-automatic-keys-305540540d2a’

Source Code on my GitHub: oluwatomsin/Text_Generation: This project involves using markov chains for text generation (github.com)

Pandas documentation: https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html

Subscribe to my page: https://medium.com/r/?url=https%3A%2F%2Famusatomisin65.medium.com%2Fsubscribe