Making natural human language accessible to computer programmes is the goal of the field of natural language processing with python (NLP). You can use the Python package NLTK, or Natural Language Toolkit, for NLP.
A large portion of the data that you might be analysing is unstructured and contains text that can be read by humans. Preprocessing that data is necessary before you can programmatically analyse it. You will get an introduction to the different text preprocessing tasks that NLTK can perform in this tutorial so that you are prepared to use them in future projects. Additionally, you’ll learn how to make visualisations and perform some simple text analysis.
If you’re comfortable with the fundamentals of using Python and want to get started with
An Introduction to Natural Language Processing With Python
Make sure Python is installed first before doing anything else. Python 3.9 will be used in this tutorial. Check out Python 3 Installation & Setup Guide to get started if you don’t already have Python installed.
The following action is to use pip to install NLTK once that has been finished. Installing it in a virtual environment is recommended. Check out Python Virtual Environments: A Primer for more information on virtual environments.
- You will be installing version 3.5 for this tutorial:
$ python -m pip install nltk==3.5
- You’ll also need to install NumPy and Matplotlib in order to build visualisations for named entity recognition:
$ python -m pip install numpy matplotlib
natural language processing with python makes it simple to divide the text into words or sentences. This will enable you to work with shorter passages of text that, even when read separately from the rest of the text, are still largely coherent and meaningful. It’s the first step in structuring unstructured data so that it can be analysed more easily.
You will tokenize by word and by sentence when you are analysing text. What both forms of tokenization bring to the table is as follows:
- Word-by-word tokenization: The building blocks of natural language are words. They are the tiniest unit of meaning that can still stand alone. You can find words that appear frequently by tokenizing your text word by word. For instance, if you were to examine a collection of job postings, you might discover that the word “Python” frequently appears. That might imply that there is a high demand for Python expertise, but further investigation is required to learn more.
- Tokenizing by sentence: By tokenizing by sentence, you can examine the relationships between the words and gain a deeper understanding of the sentence’s context. Does the hiring manager dislike Python because there are many unfavourable words surrounding the word “Python”? Are there more terms from the field of herpetology than from the field of software development, indicating that you might be dealing with a completely different type of python than you anticipated?
Here is how to import the necessary NLTK components to tokenize by word and by phrase:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
You can now create a string to tokenize after importing the items you required. An appropriate quote from Dune is as follows:
>>> example_string = “””
… Muad’Dib learned rapidly because his first training was in how to learn.
… And the first lesson of all was the basic trust that he could learn.
… It’s shocking to find how many people do not believe they can learn,
… and how many more believe learning to be difficult.”””
You can separate example strings into phrases using sent tokenize().
[“Muad’Dib learned rapidly because his first training was in how to learn.”,
‘And the first lesson of all was the basic trust that he could learn.’,
“It’s shocking to find how many people do not believe they can learn, and how many more believe learning to be difficult.”]
You can get a list of three strings that are sentenced by tokenizing the example string string by string:
- Since learning how to learn was his first training, Muad’Dib picked up knowledge quickly.
- The first lesson was the fundamental trust he could develop.
- The number of people who don’t think they can learn and the number who think it’s hard to learn is shocking.
Try to tokenize the example string word by word now:
You received a list of strings that NLTK interprets as words, including:
However, the phrases listed below were also regarded as words:
Notice how “It’s” was split at the apostrophe to give you “It” and “s,” but “Muad’Dib” remained intact? This occurred as a result of NLTK counting them separately because it is aware that “It” and “‘s” (a contraction of “is”) are two separate words. However, “Muad’Dib” was not read as two separate words because it is not a recognised contraction like “It’s,” so it was left intact.
- Remove Stop Words
Stop words are words you want to ignore, so when you’re processing text, you filter them out. Common words like “in,” “is,” and “an” are frequently used as stop words because they don’t add much meaning to a text on their own.
The necessary NLTK components can be imported in the following manner to remove stop words:
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
Here is a filterable quote from Worf:
>>> worf_quote = “Sir, I protest. I am not a merry man!”
Tokenize worf quote word by word at this point and save the list in words in quote:
>>> words_in_quote = word_tokenize(worf_quote)
[‘Sir’, ‘,’, ‘protest’, ‘.’, ‘merry’, ‘man’, ‘!’]
The following step is to create a list of stop words to filter words in quote now that you have a list of the words in worf quote. You should concentrate on stop words in “english” for this example:
>>> stop_words = set(stopwords.words(“english”))
Create an empty list to hold the words that survive the filter after that.
>>> filtered_list = 
To hold all the words in words in quote that aren’t stop words, you made an empty list called filtered list. You can now filter words in quote using stop words:
>>> for word in words_in_quote:
… if word.casefold() not in stop_words:
By using a for loop to iterate over words in quotes, you added all the words that didn’t stop words to the filtered list. To disregard whether the letters in Word were uppercase or lowercase, you used the. casefold() function. Because of stopwords. words(“English”) only contain lowercase stop words, doing this is worthwhile.
As an alternative, you could create a list of all the words in your text that don’t stop words using a list comprehension:
>>> filtered_list = [
… word for word in words_in_quote if word. casefold() not in stop_words
You don’t make an empty list and then add items to the end of it when you use a list comprehension. Instead, you define both the list and its items simultaneously. It’s common to think of using a list comprehension as more Pythonic.
Check out the words that made it into the filtered list:
[‘Sir’, ‘,’, ‘protest’, ‘.’, ‘merry’, ‘man’, ‘!’]
A few words, such as “am” and “a,” were filtered out, but you also filtered out “not,” which has an impact on the sentence’s overall meaning. Worf won’t take kindly to this.
Depending on the type of analysis you want to conduct, words like “I” and “not” might seem too significant to filter out. This is why:
- A pronoun, such as “I,” is a context word rather than a content word:
- Content words inform you of the subjects discussed in the text or the author’s feelings on those subjects.
- Contextual words provide insight into the writing style. In order to quantify an author’s writing style, you can look for patterns in the way they use context-sensitive words. A text written by an unidentified author can be examined to see how closely it adheres to a particular writing style once you’ve quantified its writing style in order to try and determine who the author is.
- Even though the word “not” is technically an adverb, it is still on the NLTK list of English stop words. You can download the list of stop words and edit it to remove the word “not” or to add other words.
In the text processing task of stemming, you break down words to their root, or the essential component of a word. For instance, the roots of the words “helping” and “helper” are the same. Stemming enables you to focus on a word’s core meaning rather than all the specifics of its usage. There are various stemmers available in NLTK, but you’ll be using the Porter stemmer.
The necessary NLTK components can be imported in the following manner to begin stemming:
>>> from nltk.stem import PorterStemmer
>>> from nltk.tokenize import word_tokenize
Once you’ve finished importing, use PorterStemmer() to create a stemmer:
>>> stemmer = PorterStemmer()
The following step is for you to make a stemming string. Here is one you may employ:
>>> string_for_stemming = “””
… The crew of the USS Discovery discovered many discoveries.
… Discovering is what explorers do.”””
You must first separate every word in that string before you can stem it:
>>> words = word_tokenize(string_for_stemming)
Now that you have a list of every word that was tokenized from the string, consider what each word contains:
Using stemmer.stem() in a list comprehension, make a list of the words in words that have been stemmed:
>>> stemmed_words = [stemmer.stem(word) for word in words]
Look at the contents of stemmed words:
What transpired to all the words beginning “discov” or “Discov” is as follows:
|Original Word||Stemmed Version|
Those outcomes appear to be a little erratic. When “discovering” gives you “discov,” why would “discovery” give you “discoveri”?
There are two ways stemming can go wrong: understemming and overstemming.
When two related words ought to be reduced to the same stem but aren’t, this is known as understemming. A false negative has occurred.
When two unrelated words are improperly reduced to the same stem, overstemming occurs. This result is erroneous.
The Porter stemming algorithm was developed in 1979, making it somewhat dated. You can use the Snowball stemmer, also known as Porter2, in your own projects because it is an improvement over the first one and is also offered by NLTK. It’s also important to remember that the Porter stemmer’s goal is to identify different word forms rather than to create complete words.
Fortunately, there are other techniques you can use, like lemmatizing, which you’ll see later in this tutorial, to distil words to their essence. We must first discuss parts of speech, though.
- Parts of Speech Labeling
Part of speech is a grammatical concept that refers to the functions that words perform when they are combined in sentences. The process of labelling the words in your text with their appropriate part of speech is known as POS tagging.
The category articles (such as “a” or “the”) are sometimes listed among the parts of speech, but other sources classify them as adjectives. The term “determiner” is used by NLTK to describe articles.
To tag parts of speech, import the pertinent NLTK components as follows:
>>> from nltk.tokenize import word_tokenize
Create some text for the tags now. This Carl Sagan quotation is appropriate:
>>> sagan_quote = “””
… If you wish to make an apple pie from scratch,
… you must first invent the universe.”””
To separate the words in that string and store them in a list, use word tokenize:
>>> words_in_sagan_quote = word_tokenize(sagan_quote)
Make a call to nltk.pos tag() with your updated list of words:
>>> import nltk
With a tag designating their part of speech, each word in the quote is now contained in its own separate tuple. How about the tags, though? Obtaining a list of tags and their descriptions is as follows:
You can use the following summary to get familiar with NLTK’s POS tags:
|TAG that start with||Deal with|
You can see that your tagging was largely successful now that you understand what the POS tags mean:
Pie was classified as NN since it is a singular noun.
Since “you” is a personal pronoun, it has been marked as PRP.
Because it is a verb’s base form, “invent” was marked as VB.
However, how would NLTK deal with tagging the parts of speech in a text that is essentially unintelligible? Although technically meaningless, the nonsense poem Jabberwocky is written in a way that allows English speakers to derive some sort of meaning from it.
Create a string to hold this poem’s passage:
>>> jabberwocky_excerpt = “””
… ‘Twas brillig, and the slithy toves did gyre and gimble in the wabe:
… all mimsy were the borogoves, and the mome raths outgrabe.”””
To separate the words in the excerpt and put them in a list, use word tokenize:
>>> words_in_excerpt = word_tokenize(jabberwocky_excerpt)
On your fresh list of words, call nltk.pos tag()
Words like “and” and “the,” which are commonly used in English, were correctly classified as conjunctions and determiners, respectively. Given the context of the poem, a human English speaker would likely also interpret the nonsense word “slithy” as an adjective. Well done, NLTK!
Now that you are familiar with the various speech components, and natural language processing with python you can return to lemmatizing. Lemmatizing, like stemming, distils words down to their essence, but it gives you a full English word that stands alone as opposed to just a part of a word like “discovery,” for example.
The necessary NLTK components can be imported in the following manner to begin lemmatizing:
>>> from nltk.stem import WordNetLemmatizer
Make a lemmatizer and use it:
>>> lemmatizer = WordNetLemmatizer()
First, let’s lemmatize a plural noun:
This is already a little bit more sophisticated than what you would have gotten with the Porter stemmer, which is’scarv’, because “scarves” gave you scarf. Next, natural language processing with python makes a string containing multiple words to lemmatize:
>>> string_for_lemmatizing = “The friends of DeSoto love scarves.”
Tokenize that string now, word by word:
>>> words = word_tokenize(string_for_lemmatizing)
Your word list is as follows:
Make a list of every word in words after it has been lemmatized:
>>> lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
The list is as follows:
That seems correct. The words “friends” and “scarves,” which were previously plural nouns, are now singular nouns.
What would happen, however, if you lemmatized a word whose appearance was very dissimilar from its lemma? Lemmatize “worst” as follows:
Due to the lemmatizer, you received the result “worst”. Lemmatize() regarded “worst” as a noun by default. You can make it clear that you want “worst” to be an adjective:
>>> lemmatizer.lemmatize(“worst”, pos=“a”)
You added the parameter pos=” a” to ensure that “worst” was treated as an adjective rather than a noun, even though the default parameter for pos is “n” for noun. natural language processing with python You ended up with “bad,” which is very dissimilar to your original word and not what you would have obtained through stemming. This is because “worst” is the superlative form of the adjective ‘bad’, and lemmatizing reduces superlatives as well as comparatives to their lemmas.
You can try tagging your words before lemmatizing them if you know how to use NLTK to tag parts of speech. This will help you avoid confusing homographs, which are words with the same spelling but different meanings and that can be different parts of speech.
You now have access to a natural language processing with python whole new world of unstructured data to explore. You can now move on to find some texts to analyse and see what you can discover about the texts themselves, the authors of the texts, and the topics they are about now that you’ve covered the fundamentals of text analytics tasks.
You are aware of how to:
Locate a text to study
Make your text ready for analysis.
Examine your writing
Based on your analysis, create visualisations.