The process for NLP is always similar to other classification algos:
- Compile documents. Get the data which uses to be raw text.
- Featurize documents. Get the text in a format that ML algorithms understand.
- Compare features for classification. Use ML techniques to build the model.
Unstructured text ⇒ Compile documents ⇒ Featurize them ⇒ Compare features
How does does the algorithm work
In NLP the featurization is done through vectorization:
- Corpus of D documents: a = “The House is Blue” , b = ”The House is Red”.
- Build and index of relevant, meaningful keywords. Eg (house,blue,red)
- Vectorize documents. Eg a = “The Blue House” ⇒ (1,1,0)
- Compare the docs as follows:
Use cosine similarity to compare: similarity docs(a,b) = cos (θ)
Characterize the terms:
– Term Frequency TF(t) = TF(t,d) ⇒ Importance of the term t within doc d
– Inverse Doc Frequency IDF(t) = log (D/t) ⇒ Importance of term within corpus D
– TF-IDF = This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.