Natural Language Processing:- A Beginner's Introduction

Originally Written on:-  11th May, 2019.


Natural Language Processing

There are a few concepts that are absolutely essential for NLP. Only a few has been discussed in this blog post


This blog consists of a few parts:-
  1. stemming and lemmatization
  2. TF-IDF in NLP
  3. Cosine Similarity.
Stemming
Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.



Stemming refers to a crude heuristic process that chops off the end of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes.
Overstemming and Understemming
  1. However, because stemming is usually based on heuristics, it is far from perfect.
  2. In fact, it commonly suffers from two issues in particular: overstemming and understemming.
  3. Overstemming comes from when too much of a word is cut off. This can result in nonsensical stems, where all the meaning of the word is lost or muddled. Or it can result in words being resolved to the same stems, even though they probably should not be.
  4. Take the four words university, universal, universities, and universe. A stemming algorithm that resolves these four words to the stem “univers” has overstemmed. While it might be nice to have universal and universe stemmed together and university and universities stemmed together, all four do not fit. A better resolution might have the first two resolve to “univers” and the latter two resolve to “universi.” But enforcing rules that make that so might result in more issues arising.
  5. Understemming is the opposite issue. It comes from when we have several words that actually are forms of one another. It would be nice for them to all resolve to the same stem, but unfortunately, they do not.
  6. This can be seen if we have a stemming algorithm that stems the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.” However, then what do we do with date? And is there a good general rule? Or are we just enforcing a very specific rule for a very specific example?
Those questions quickly become issues when it comes to stemming. Enforcing new rules and heuristics can quickly get out of hand.
Stemming Algorithm Examples

Two stemming algorithms that are necessary to discuss are the Porter stemmer and the Snowball stemmer from NLTK. While I won’t go into a lot of details about either, I will highlight a little bit about them so that you can know even more than I did when I first started using them.


  1. Porter stemmer: This stemming algorithm is an older one. It’s from the 1980s and its main concern is removing the common endings to words so that they can be resolved to a common form. It’s not too complex and development on it is frozen. Typically, it’s a nice starting basic stemmer, but it’s not really advised to use it for any production/complex application. Instead, it has its place in research as a nice, basic stemming algorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm when compared to others.
  2. Snowball stemmer: This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer. That being said, it is also more aggressive than the Porter stemmer. A lot of the things added to the Snowball stemmer were because of issues noticed with the Porter stemmer. There is about a 5% difference in the way that Snowball stems versus Porter.
  3. Lancaster stemmer: Just for fun, the Lancaster stemming algorithm is another algorithm that you can use. This one is the most aggressive stemming algorithm of the bunch. However, if you use the stemmer in NLTK, you can add your own custom rules to this algorithm very easily. It’s a good choice for that. One complaint around this stemming algorithm though is that it sometimes is overly aggressive and can really transform words into strange stems. Just make sure it does what you want it to before you go with this option! Lemmatization.


Lemmatization

Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.


Lemmatization refers to doing things properly with the use of vocabulary and morphological analysis of words.

An important difference to highlight is that a lemma is the base form of all its inflectional forms wheareas a stem isn't.




TF-IDF in Natural Language Processing
What is TF-IDF? In this blog post, we’ll be exploring a text mining method called TF-IDF. TF-IDF, which stands for term frequency inverse-document frequency, is a statistic that measures how important a term is relative to a document and to a corpus, a collection of documents. The TF-IDF of a term is given by the equation:
  1. TF-IDF(term) = TF(term in a document) * IDF(term)
  1. TF(term) = # of times the term appears in document / total # of terms in document
  2. IDF(term) = log(total # of documents / # of documents with term in it)
To explain TF-IDF, let’s walk through a concrete example. Say you are sifting through some blog posts about building games in Python.

  1. In post #1, the word “Python” appears once in five pages In post #2, “Python” appears dozens of times in two pages The term frequency (TF) measures how often a term shows up in a single document. If a term appears relatively frequently in a document, it is most likely an important term. In our example, the word “Python” appears quite frequently in post #2, so it would have a high term frequency. We may also infer that it is an important word in post #2. On the other hand, “Python” appears only once in post #1, so it would have a low term frequency. Observe that dividing by the length of the document allows us compare two different length documents fairly.
  2. Now suppose the word Python doesn’t appear in any of our other blog posts. We draw the conclusion that the word Python isn’t very relevant to most of our blog posts, but very relevant to post #2, where it appeared dozens of times. The inverse document frequency (IDF) tells us how important a term is to a collection of documents. A good example of how IDF comes into play is for the word “the.” We know that just about every document contains “the,” so the term isn’t really special anymore, thereby producing a very low IDF. Now let’s contrast “the” with “Python” in our example. “Python” appears rarely in the other posts, so its IDF should be high. In fact, “Python” now carries a weight signalling that in any document in which it appears, it is important to that document.



TF-IDF is useful when we are concerned with whole words in our text. A cool usage of TF-IDF is in automated blog post tagging, where running TF-IDF on every blog post helps single out words that can be used as tags. We can also use TF-IDF vectors for machine learning, we can use to power our recommendations.

When we multiply TF and IDF, we observe that the larger the number, the more important a term in a document is to that document. We can then compute the TF-IDF for each word in each document and create a vector.

Introduction


A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents.
But this approach has an inherent flaw. That is, as the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.
The cosine similarity helps overcome this fundamental flaw in the ‘count-the-common-words’ or Euclidean distance approach.
2. What is Cosine Similarity and why is it advantageous?

  1. Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.
  2. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. In this context, the two vectors I am talking about are arrays containing the word counts of two documents.
  3. As a similarity metric, how does cosine similarity differ from the number of common words?
  4. When plotted on a multi-dimensional space, where each dimension corresponds to a word in the document, the cosine similarity captures the orientation (the angle) of the documents and not the magnitude. If you want the magnitude, compute the Euclidean distance instead.
  5. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘cricket’ appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.
3. Cosine Similarity Example
Let’s suppose you have 3 documents based on a couple of star cricket players – Sachin Tendulkar and Dhoni. Two of the documents (A) and (B) are from the wikipedia pages on the respective players and the third document (C) is a smaller snippet from Dhoni’s wikipedia page.



As you can see, all three documents are connected by a common theme – the game of Cricket.
Our objective is to quantitatively estimate the similarity between the documents.
For ease of understanding, let’s consider only the top 3 common words between the documents: ‘Dhoni’, ‘Sachin’ and ‘Cricket’.
You would expect Doc A and Doc C, that is the two documents on Dhoni would have a higher similarity over Doc A and Doc B, because, Doc C is essentially a snippet from Doc A itself.
However, if we go by the number of common words, the two larger documents will have the most common words and therefore will be judged as most similar, which is exactly what we want to avoid.
The results would be more congruent when we use the cosine similarity score to assess the similarity.
Let me explain.
Let’s project the documents in a 3-dimensional space, where each dimension is a frequency count of either: ‘Sachin’, ‘Dhoni’ or ‘Cricket’. When plotted on this space, the 3 documents would appear something like this.



As you can see, Doc Dhoni_Small and the main Doc Dhoni are oriented closer together in 3-D space, even though they are far apart by magnitiude.
It turns out, the closer the documents are by angle, the higher is the Cosine Similarity (Cos theta).




As you include more words from the document, it’s harder to visualize a higher dimensional space. But you can directly compute the cosine similarity using the above math formula.

We will be discussing more on NLP in the next blog post

Credits/citations

1) Analytics India
2) Analytics India Mag
3) Stanford.edu -> NLP
4) Machine Learning Plus.

Comments