Main ideas

The process for NLP is always similar to other classification algos:

  • Compile documents. Get the data which uses to be raw text.
  • Featurize documents. Get the text in a format that ML algorithms understand.
  • Compare features for classification. Use ML techniques to build the model.

Unstructured text ⇒ Compile documents ⇒ Featurize them ⇒ Compare features

How does does the algorithm work

In NLP the featurization is done through vectorization:

  • Corpus of D documents: a = “The House is Blue” , b = ”The House is Red”.
  • Build and index of relevant, meaningful keywords. Eg (house,blue,red)
  • Vectorize documents. Eg a = “The Blue House” ⇒ (1,1,0)
  • Compare the docs as follows:

Use cosine similarity to compare: similarity docs(a,b) = cos (θ)

Characterize the terms:
Term Frequency TF(t) = TF(t,d) ⇒ Importance of the term t within doc d
Inverse Doc Frequency IDF(t) = log (D/t) ⇒ Importance of term within corpus D
TF-IDF = This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

More references

Natural Language Processing With Python
The Best Way to Learn Practical NLP