Natural Language Processing:- A Beginner's Introduction
Originally Written on:- 11th May, 2019.
Natural Language Processing
There are a few concepts that are
absolutely essential for NLP. Only a few has been discussed in this blog post
This blog consists of a few parts:-
- stemming
and lemmatization
- TF-IDF
in NLP
- Cosine
Similarity.
Stemming
Stemming algorithms work by cutting
off the end or the beginning of the word, taking into account a list of common
prefixes and suffixes that can be found in an inflected word. This
indiscriminate cutting can be successful in some occasions, but not always, and
that is why we affirm that this approach presents some limitations. Below we
illustrate the method with examples in both English and Spanish.
Stemming refers to a crude heuristic
process that chops off the end of words in the hope of achieving this goal
correctly most of the time and often includes the removal of derivational
affixes.
Overstemming and
Understemming
- However,
because stemming is usually based on heuristics, it is far from perfect.
- In
fact, it commonly suffers from two issues in particular: overstemming and
understemming.
- Overstemming
comes from when too much of a word is cut off. This can result in
nonsensical stems, where all the meaning of the word is lost or muddled.
Or it can result in words being resolved to the same stems, even though
they probably should not be.
- Take
the four words university, universal, universities, and universe. A
stemming algorithm that resolves these four words to the stem “univers”
has overstemmed. While it might be nice to have universal and universe
stemmed together and university and universities stemmed together, all
four do not fit. A better resolution might have the first two resolve to
“univers” and the latter two resolve to “universi.” But enforcing rules
that make that so might result in more issues arising.
- Understemming
is the opposite issue. It comes from when we have several words that
actually are forms of one another. It would be nice for them to all
resolve to the same stem, but unfortunately, they do not.
- This
can be seen if we have a stemming algorithm that stems the words data and
datum to “dat” and “datu.” And you might be thinking, well, just resolve
these both to “dat.” However, then what do we do with date? And is there a
good general rule? Or are we just enforcing a very specific rule for a
very specific example?
Those questions quickly become issues
when it comes to stemming. Enforcing new rules and heuristics can quickly get
out of hand.
Stemming Algorithm
Examples
Two stemming algorithms that are
necessary to discuss are the Porter stemmer and the Snowball stemmer from NLTK.
While I won’t go into a lot of details about either, I will highlight a little
bit about them so that you can know even more than I did when I first started
using them.
- Porter
stemmer:
This stemming algorithm is an older one. It’s from the 1980s and its main
concern is removing the common endings to words so that they can be
resolved to a common form. It’s not too complex and development on it is
frozen. Typically, it’s a nice starting basic stemmer, but it’s not really
advised to use it for any production/complex application. Instead, it has
its place in research as a nice, basic stemming algorithm that can
guarantee reproducibility. It also is a very gentle stemming algorithm
when compared to others.
- Snowball
stemmer:
This algorithm is also known as the Porter2 stemming algorithm. It is
almost universally accepted as better than the Porter stemmer, even being
acknowledged as such by the individual who created the Porter stemmer.
That being said, it is also more aggressive than the Porter stemmer. A lot
of the things added to the Snowball stemmer were because of issues noticed
with the Porter stemmer. There is about a 5% difference in the way that
Snowball stems versus Porter.
- Lancaster
stemmer:
Just for fun, the Lancaster stemming algorithm is another algorithm that
you can use. This one is the most aggressive stemming algorithm of the
bunch. However, if you use the stemmer in NLTK, you can add your own
custom rules to this algorithm very easily. It’s a good choice for that.
One complaint around this stemming algorithm though is that it sometimes
is overly aggressive and can really transform words into strange stems.
Just make sure it does what you want it to before you go with this option!
Lemmatization.
Lemmatization
Lemmatization, on the other hand,
takes into consideration the morphological analysis of the words. To do so, it
is necessary to have detailed dictionaries which the algorithm can look through
to link the form back to its lemma. Again, you can see how it works with the
same example words.
Lemmatization refers to doing things
properly with the use of vocabulary and morphological analysis of words.
An important difference to highlight
is that a lemma is the base form of all its inflectional forms wheareas a stem
isn't.
TF-IDF in Natural
Language Processing
What is TF-IDF? In this blog post,
we’ll be exploring a text mining method called TF-IDF. TF-IDF, which stands for
term frequency inverse-document frequency, is a statistic that measures how
important a term is relative to a document and to a corpus, a collection of
documents. The TF-IDF of a term is given by the equation:
- TF-IDF(term)
= TF(term in a document) * IDF(term)
- TF(term)
= # of times the term appears in document / total # of terms in document
- IDF(term)
= log(total # of documents / # of documents with term in it)
To explain TF-IDF,
let’s walk through a concrete example. Say you are sifting through some blog
posts about building games in Python.
- In
post #1, the word “Python” appears once in five pages In post #2, “Python”
appears dozens of times in two pages The term frequency (TF) measures
how often a term shows up in a single document. If a term appears
relatively frequently in a document, it is most likely an important term.
In our example, the word “Python” appears quite frequently in post #2, so
it would have a high term frequency. We may also infer that it is an
important word in post #2. On the other hand, “Python” appears only once
in post #1, so it would have a low term frequency. Observe that dividing
by the length of the document allows us compare two different length
documents fairly.
- Now
suppose the word Python doesn’t appear in any of our other blog posts. We
draw the conclusion that the word Python isn’t very relevant to most of
our blog posts, but very relevant to post #2, where it appeared dozens of
times. The inverse document frequency (IDF) tells us how
important a term is to a collection of documents. A good example of how
IDF comes into play is for the word “the.” We know that just about every
document contains “the,” so the term isn’t really special anymore, thereby
producing a very low IDF. Now let’s contrast “the” with “Python” in our
example. “Python” appears rarely in the other posts, so its IDF should be
high. In fact, “Python” now carries a weight signalling that in any
document in which it appears, it is important to that document.
TF-IDF is
useful when we are concerned with whole words in our text. A cool usage of
TF-IDF is in automated blog post tagging, where running TF-IDF on every blog
post helps single out words that can be used as tags. We can also use TF-IDF
vectors for machine learning, we can use to power our recommendations.
When we multiply TF and IDF, we
observe that the larger the number, the more important a term in a document is
to that document. We can then compute the TF-IDF for each word in each document
and create a vector.
Introduction
A commonly used approach to match
similar documents is based on counting the maximum number of common words
between the documents.
But this approach has an inherent
flaw. That is, as the size of the document increases, the number of common
words tend to increase even if the documents talk about different topics.
The cosine similarity helps overcome
this fundamental flaw in the ‘count-the-common-words’ or Euclidean distance
approach.
2. What is Cosine
Similarity and why is it advantageous?
- Cosine
similarity is a metric used to determine how similar the documents are
irrespective of their size.
- Mathematically,
it measures the cosine of the angle between two vectors projected in a
multi-dimensional space. In this context, the two vectors I am talking about
are arrays containing the word counts of two documents.
- As a
similarity metric, how does cosine similarity differ from the number of
common words?
- When
plotted on a multi-dimensional space, where each dimension corresponds to
a word in the document, the cosine similarity captures the orientation
(the angle) of the documents and not the magnitude. If you want the
magnitude, compute the Euclidean distance instead.
- The
cosine similarity is advantageous because even if the two similar
documents are far apart by the Euclidean distance because of the size
(like, the word ‘cricket’ appeared 50 times in one document and 10 times
in another) they could still have a smaller angle between them. Smaller
the angle, higher the similarity.
3. Cosine Similarity
Example
Let’s suppose you have 3 documents
based on a couple of star cricket players – Sachin Tendulkar and Dhoni. Two of
the documents (A) and (B) are from the wikipedia pages on the respective
players and the third document (C) is a smaller snippet from Dhoni’s wikipedia
page.
As you can see, all three documents
are connected by a common theme – the game of Cricket.
Our objective is to quantitatively
estimate the similarity between the documents.
For ease of understanding, let’s
consider only the top 3 common words between the documents: ‘Dhoni’, ‘Sachin’
and ‘Cricket’.
You would expect Doc A and Doc C,
that is the two documents on Dhoni would have a higher similarity over Doc A
and Doc B, because, Doc C is essentially a snippet from Doc A itself.
However, if we go by the number of
common words, the two larger documents will have the most common words and
therefore will be judged as most similar, which is exactly what we want to
avoid.
The results would be more congruent
when we use the cosine similarity score to assess the similarity.
Let me explain.
Let’s project the documents in a
3-dimensional space, where each dimension is a frequency count of either:
‘Sachin’, ‘Dhoni’ or ‘Cricket’. When plotted on this space, the 3 documents
would appear something like this.
As you can see, Doc Dhoni_Small and
the main Doc Dhoni are oriented closer together in 3-D space, even though they
are far apart by magnitiude.
It turns out, the closer the
documents are by angle, the higher is the Cosine Similarity (Cos theta).
As you include more words from the
document, it’s harder to visualize a higher dimensional space. But you can
directly compute the cosine similarity using the above math formula.
We will be discussing more on NLP in
the next blog post
Credits/citations
1) Analytics India
2) Analytics India Mag
3) Stanford.edu -> NLP
4) Machine Learning Plus.
Comments
Post a Comment