Traditional NLP
Applications
- Text Classification
- Sequence Labelling: NER, POS
- Dependency Parsing
- Text-to-Speech
Concepts
Regular Expressions
- TO-DO
Word Type
- The number of distinct words in a corpus
- If the set of words in the vocabulary is , the number of types is the vocabulary size
Word Instance
- The total number of running words
Corpus
- Plural corpora
- A computer-readable collection of text or speech
Code Switching
- The phenomenon of speakers or writers to use multiple languages in a single communicative act
Utterance
- The spoken correlate of a sentence
- e.g. I do uh main- mainly business data processing
- This utterance has two kinds of disfluencies
- Fragment: the broken-off word main-
- Filled pause: words like uh and um
Heaps' Law / Herdan's Law
The relationship between the number of types and number of instances :
Lemma / Citation Form
- A set of lexical forms having the same stem, the same major part-of-speech, and the same word sense
- e.g. Consider inflected forms like cats and cat, they are different wordforms but have the same lemma
Word Normalization
- The process of transforming text into a standard or canonical form
- At least three tasks are commonly applied
- Tokenizing (segmenting) words
- Normalizing word formats
- Segmenting sentences
Stop Words
- Commonly used words that are often filtered out before processing text
- These words are considered to have little value in terms of meaning
- e.g. and, the, a, is, are, to etc.
Morphology
- The study of the way words are built up from morphemes
- Stems: the central morpheme of the word, supplying the main meaning
- Affixes: adding "additional" meanings of various kinds
Phoneme
- Expert term for 'sound'
- e.g. /k/
Grapheme
- Expert term for 'letter(s) that spell a sound'
- e.g. the following table shows 4 different graphemes to represent 1 phoneme, the /k/ sound
Word | Grapheme |
---|---|
cat | c |
kite | k |
duck | -ck |
school | ch |
Morpheme
- The smalles meaning-bearing unit of a language
- e.g. unwashable has the morphemes un-, wash, -able
Case Folding
- The process of converting all characters in a text to a uniform case
Lemmatizaion
- The task of determining that two words have the same root, despite their surface differences
- e.g. The words am, are, is have the shared lemma be
Stemming
- Chopping off word-final affixes
- e.g. Porter stemmer
Sentence Segmentaion
- The most useful cues for segmenting a text into sentences are punctuation, like periods, question marks, and exclamation points
- In general, sentence segmentation methods work by first deciding whether a period is part of the word or is a sentence-boundary marker
Minimum Edit Distance
- The minimum edit distance between two strings is defined as the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another
- The Levenshtein distance between two sequences is the simplest weighting factor in which each of the three operations has a cost of 1
- Can be computed by dynamic programming, which also results in an alignment of the two strings
Perplexity
The perplexity of computed with a bigram language model:
- The higher the probabiity of the word sequence, the lower the perplexity
- The lower the perpexity of a model on the data, the better the model
- Minimizng the perplexity is equivalent to maximizing the test set probability according to the language model
- Can also be thought of as the weighted average branching factor
Smoothing
- TO-DO
- smoothing, interpolation and backoff
Statistical Significance Test
- TO-DO
- Used to determine whether we can be confident that one versrion of a model is better than another
- Bootstrap test
Model Card
Documents a ML model with information like:
- Training algorithms and parameters
- Training data sources, motivation, and preprocessing
- Evaluation data sources, motivation, and preprocessing
- Intended use and users
- Model performance across demographic or other groups and environmental situations
Part of Speech (POS)
- A category of words that have similar grammatical properties:
- e.g. noun, verb, pronoun, adjective, adverb, preposition, etc.
Natural Language Inference (NLI)
The task of determining whether a "hypothesis" is:
- true (entailment)
- false (contradiction)
- undetermined (neutral)
given a premise
Models
N-gram Language Models
Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence:
- An n-gram is a sequence of words
- Markov assumption: the probability of a words depends only on the previous word
- The n-gram model looks words into the past
- Estimate n-gram probability through Maximum Likelihood Estimation (MLE)
For example, the bigram probability of a word given a previous word :
- Compute the count of the bigram , and normalize by the sum of all the bigrams that share the same first word
- The sum of all bigram counts that start with a given word must be equal to the unigram count for that word
Dealing with scale in large n-gram models:
- Use log probabilities, since multiplying enough n-grams together would result in numerical underflow
- Although for pedagogical purposes we have only described bitrigram gram models, when there is sufficient training data we use trigram models, which condition on the previous two words, or 4-gram or 5-gram models.
- For these larger n-grams, we’ll need to assume extra contexts to the left and right of the sentence end
- The infini-gram project allows n-gram of any length
Bag-Of-Words
- Tokenization: Split the text into individual tokens
- Vocabulary Creation: An unordered set of words with their position ignored, keeping only their frequency in the text document
- Vectorization: Represent each document as a vector, where each element corresponds to the count of a word from the vocabulary
Naive Bayes
A probabilistic classifier, for a document , out of all classes , the classifier returns the class which has the maximum posterior probability given the document:
- We call Naive Bayes a generative model, an implicit assumption about how a document is generated:
- A class is sampled from (prior probability of the class)
- Then the words are generated by sampling from (likelihood of the document)
We can represent a document as a set of features :
- Estimating the probability of every possible combination of features would require huge number of parameters and impossibly large training sets
- Naive Bayes classifiers make two simplifying assumptions:
- Bag-of-words assumption: assume position doesn't matter
- Naive Bayes assumption: conditional independence assumption
Final equation for the class chosen by a Naive Bayes classifier:
Naive Bayes as a Language Model: