1. Semantic Roles
1.1 Commonly used thematic roles
1.2 Issues with thematic roles
- No universally agreed-upon set of roles
- Items with the same role may not behave quite the same
Problem avoid:
- Specific to individual verbs only (PropBank)
- Specific to small groups of verbs (FrameNet)
#2. Labelling
2.1 PropBank
Frame Roles:
- Arg0-PAG: loader
- Arg1-GOL: beast of burden
- Arg2-PPT: cargo
- Arg3-MNR: instrument
2.2 FrameNet
==Meanings are reletavised to scenes!==
- Tries to capture relationships among word and phrase meanings by assigning them the same frame
- $\approx 1000$ frames represent scenarios
- Frames are explained with textual descriptions and linguistic examples
Resources:
- FrameNets for several languages
- Some data annotated with Frame elements from FrameNet
- SEMAFOR: a frame-semantic parser
Distributional hypothesis:
- Can infer meaning just by looking at the contexts a word occurs in
- Meaning IS the contexts a word occurs in
- either way , similar contexts imply similar meanings
1. Distributional semantics
1.1 Basic idea
Represent each word $w_i$ as a vector of its contexts
Distributional semantic models also called vector-space models
1.2 Context
1.2.1 Defining the context
- usually ignore stopwords
- usually use a large window around the target word
1.2.2 Weights
Mutual information: $$ \text{PMI}(x,y) = \log_2 \frac{P(x,y)}{P(x)P(y)} $$
- PMI tells us how much more / less likely the cooccurrence is than if the words were independent
- Problem:
- in practice, PMI is computed with counts, so it is over-sensitive to the chance co-occurrence of infrequent words
- Improvement: use positive PMI (PPMI)
- change all negative PMI values to 0 (because for infrequent words, not enough data to accurately determine negative PMI values)
- smoothing in PMI computation (see J & M)
1.3 Compositionality in vector space
Compositionality implies some operator $\oplus$ such that $$ \text{meaning}(w_1w_2) = \text{meaning}(w_1) \oplus \text{meaning}(w_2) $$
2. Measure similarity
- Euclidean distance
- Dot product
- Normalised dot product (the cosine of the angle between vectors)
- Other
- Jaccard measure
- Dice measure
- Jenson-Shannon divergence
- …
3. Evaluation
- Extrinsic: IR, QA, automatic essay marking, …
- Intrinsic
- relatedness judgement
- word association
3.1 Relatedness judgement
E.g. on a scale of 1-10, how related are the two words lated?
lemon-truth = 1
lemon-orange = 10
- Answers depend a lot on how the question is asked
3.2 Word association
Say the first word that comes to ming
Collect the data and provide probabilities:
3.3 Dimension reduction: learn a more compact space
- LSA (Latent Semantic Analysis)
- Neural network methods
Goal:
- a machine which can answer questions in natural language
- can access to KB
- can access to vast quantities of English text
1. Semantics
1.1 Lexical and Sentential
Lexical semantics: the meaning of individual words
Sentential semantics: how word meanings combine (e.g. who did that to whom; when, how, why…)
1.2 Problems
- Senses: words may have different meanings (need to disambiguate between them)
- Synonyms: words may have the same meaning
- Ontology (hyponym & hypernym): words can refer to a subset or superset of the concept referred to by another word (A is a B relationships)
- Similarity & gradation: words may be related in other ways
- Inference
PS: some of these problems can be solved with a good ontology (e.g. WordNet)
2. WordNet
WordNet is a hand-built resource containing 117,000 synsets
Relations:
- hyponym / hypernym (IS-A: chair-furniture)
- meronym (PART-WHOLE: leg-chair)
- antonym (OPPOSITES: good -bad)
3. Word sense ambiguity
Words can have multiple senses
A way to define senses: different sense = different translation
3.1 Semantics ambiguity types
- Polysemy: a word has related meanings
- Homonymy: a word has unrelated meanings
- Synonymy: different words can mean the same thing
3.2 Word sense disambiguation (WSD)
Task: Given a sense ambiguous word, find the sense in a given context
Classifier:
- Naive Bayes
- Decision lists
- Decision trees
3.1.1 Naive Bayes for WSD
3.1.2 Features
- Directly neighbouring words
- Any content words in a 50 word window
- Syntactically related words
- Syntactic role in sense
- Topic of the text
- POS tag, surrounding POS tags
3.3 Evaluation
Extrinsic: test as part of IR, QA, or MT system
Intrinsic: evaluate classification accuracy or precision / recall against gold-standard senses
Baseline: choose the most frequent sense
3.4 Issues with WSD
- Not always clear how fine-grained the gold-standard should be
- Difficult / expensive to annotate corpora with fine-grained senses
- Classifier must be trained separately for each word
- hard to learn anything for infrequent or unseen words
- requires new annotations for each new word
- motivates unsupervised and semi-supervised methods
4. Semantic classes
Methods: define coarse-grained semantic categories like PERSON, LOCATION, ARTIFACT
- Named entity recognition (NER): recognise and classify perper names in text is important for many applications
- Supersense tagging
1. Parser evaluation
Correct: if there is a gold constituent that spans the same sentence positions
Hasher measure: also require the constituent labels to match
precision
recall
F-score (balances precision and recall): $$ \frac{2pr}{p+r} $$
2. Lexicalisation
2.1 Introduction
Problem: replacing one word with another with the same POS will never result in a different parsing decision, even though it should
Solution: adding the lexical head of the phrase
2.2 Pros & Cons
Pros:
- makes the grammar more specific
Cons:
- leads to huge grammar blowup and very sparse data
2.3 Improvement
- complex smoothing schemes (similar to N-gram interpolation / backoff)
- increasing emphasis on automatically learned subcategories
2.4 Lexicalised constituency parse
Direct dependency parsing:
Transition-based (shift-reduce)
Train a classifier as the oracle to predict next action (Shift, LeftArc, or RightArc), and proceed left-to-right through the sentence
$O(n)$ time
Graph-based
From the fully connected directed graph of all possible edges, choose the best ones that form a tree
Edge-factored models: classifier assigns a nonne g ative score to each possible edge ; maximum spanning tree algorithm finds the spanning tree with highest total score
$O(n^2)$ time
Head Rules
3. Phrase Structure & Dependencies
In language with free word order, phrase structure grammars don’t make as much sense
3.1 Pros and Cons
Pros:
- Sensible framework for free word order languages
- Identifies syntactic relations directly
- Dependency pairs / chains can make good features in classifiers, for information extraction
- Parsers can be very fast
Cons:
- The assumption of asymmetric binary relations is not always right (e.g. parse dogs and cats)
3.2 Edge labels
Labels: subject, direct object, determiner, adjective modifier, adverbial modifier
3.3 Dependency Paths
For information extraction tasks involving real-world relationships between entities, chains of dependencies can provide good features
Projectivity:
A sentence’s dependency parse is projective if every subtree occupies a contiguous span of the sentence
- = The dependency parse can be drawn on top of the sentence without any crossing edges
Nonprojectivity:
- Nonprojectivity is rare in English, but quite common in many langauges
4. Dependency Parsing
4.1 CKY
CKY can be adapated, but efficiency is not good: obvious approach is $O(Gn^5)$; Eisner algorithm can bring it down to $O(Gn^3)$
4.2 Transition-based: Shift reduce
3 possible actions:
- LeftArc: assign head-dependent relation between $s_1$ and $s_2$; pop $s_2$
- RightArc: assign head-dependent relation between $s_2$ and $s_1$; pop $s_1$
- Shift: put $w_1$ on top of the stack
4.3 Graph-based
4.4 Comparison
5. Choosing a Parser
- Target representation: constituency or dependency?
- Efficiency
- Incrementality: parse the whole sentence at once, or obtain partial left-to-right analyses / expectations
- Retrainable system or not
Two theories of syntax:
- Context-free grammar
- Dependency grammar
1. Context-free grammer
Symbols:
- Terminals (t): words
- Non-terminals (NT): phrasal categories (e.g. S, NP, VP, PP) with S being the start symbol
Chomsky Normal Form: only has rules of the form $$ NT \rightarrow NT \ \ NT \ \ \ NT\rightarrow t $$
2. Syntactic analysis
2.1 Global Ambibuity
Structural ambiguity
POS ambiguity:
Attachment ambiguity: Some sentences have structural ambiguity even without part-of-speech ambiguity
Depends on where different phrases attach in the tree
Different attachments have different meanings
I saw the man with the telescope
She ate the pizza on the floor
Good boys and girls get presents from Santa
Why we want parse trees:
- compositional semantics: the meaning of a constituent is a function of the meaning of its children
- rule-to-rule semantics: that function is determined by the rule which licenses the constituent
Huge problem:
- correctness problem: need to find the right structure to get the right meaning
- efficiency problem: searching all possible structures can be very slow; want to use parsing for large-scale language tasks (for example, as used to create Google’s “infoboxes”)
2.2 Local ambiguity
Local ambiguity is also a big problem: multiple analyses for parts of sentence.
- “The dog bit the child”: first three words could be NP (but aren’t)
- Building useless partial structures wastes time
- Avoiding useless computation is a major issue in parsing
3. Parsing
3.1 Normal Parsing
- Top-down: depth-first, breadth-firsts, best-first
- Bottom-up
- Mixed strategy (e.g. Earley algorithm)
3.1.1 Recursive Descent Parsing
Idea: top-down, depth-first parsing
e.g. Slide 11 pp. 23-24
3.1.2 Shift-reduce Parsing
Idea: bottom-up, depth-first parsing
e.g. Slide 11 pp. 27
3.1.3 CKY algorithm (Cocke, Kasami, Younger)
Idea: bottom-up, breadth-first parsing
Assumption: grammar in Chomsky Normal Form
Pseudocode:
for len = 1 to n: # number of words in constituent
for i = 0 to (n-len): # start position
j = i + len # end position
for each A->B where c[i,j] has B:
# process unary rules
add A to c[i,j] with a pointer to B
for k = (i+1) to (j-1):
# process binary rules
for each (A->B C) where c[i,k] has B and c[k,j] has C:
add A to c[i,j] with pointers to B and C
Time complexity: $O(Gn^3)$
e.g. Slide 11 pp.5-11
Problem:
- Ambiguity is only clear if we go on to reconstruct the parses using our backpointers
Pros & Cons:
Pros:
- much more efficient than depth-first
Cons:
- still may compute a lot of unnecessary partial parses
- requires converting the grammar to CNF
Can be implemented to probabilistic CKY
3.1.4 Choose parsing algorithm
==Probabilities ???==
3.2 Statistical Parsing
3.2.1 Treebank grammars
Idea: annotate real sentences with parse trees
3.2.1.1 The Penn Treebank (Treebank PCFG)
PCFG (probabilistic context-free grammar) $$ NT \rightarrow \beta \ \text{ with probability } P(\beta \mid NT) $$
3.2.1.2 Create PCFG
Easiest way: MLE
- Count all occurrences of $NT \rightarrow \beta$ in treebank
- Divide by the count of all rules whose left-hand-side is $NT$ to get $P(\beta \mid NT)$
$$ P(NT \rightarrow C_1, C_2 \dots C_n \mid NT) = \frac{count(NT \rightarrow C_1,C_2 \dots C_n)}{count(NT)} $$
e.g. Slide 12 pp. 21-24
3.2.2 Probability of a parse
$$ P(t) = \prod_{NT \rightarrow \beta \in t} NT \rightarrow \beta $$
3.2.3 Probabilistic CKY
- When we find an $NT$ spanning $(i,j)$, store its probability along with its label in cell $(i,j)$
- If we later find an $NT$ with the same span but higher probability, replace the probability and the backpointers for $NT$ in cell $(i,j)$
3.2.4 Other algorithms
- Inside algorithm
- Inside-outside algorithm
3.2.5 Best-first probabilistic parsing
Basic idea: use probabilities of subtrees to decide which ones to build up further
How to score constituents:
Why not probability: smaller constituents will almost always have higher scores
Possible solution:
- divide by the number of words (but not guaranteed to find the best parse first)
- A*
1. Annotation
Penn Treebank: 36 POS tags (excluding punctuation)
Factors in Annotation:
- source data (genre? size? licensing?)
- annotation scheme (complexity? guidelines?)
- annotators (expertise? training?)
- annotation software (graphical interface?)
- quality control procedures (multiple annotation, adjudication?)
2. Evaluation
2.1 Hypotheses
About existing linguistic objects:
- Is this text by Shakespeare or Marlowe?
About output of a language system:
- How well does this language model predict the data?
- How accurate is this segmenter / tagger / parser?
- Is this segmenter / tagger / parser better than that one?
About human beings:
- How reliable is this person’s annotation?
- To what extent do these two annotators agree? ( IAA )
2.2 Gold Standard Evaluation
- Gold standards used both for training and for evaluation
- But testing must be done on unseen data
2.3 Tuning
- Lidstone (add-$\lambda$) smoothing
- Choosing features for text classification
Overfitting: the test set is no longer a reliable proxy for new data.
Solution: hold out a second set for tuning, called a development set, save the test set for the very end
2.4 Cross-validation
Use when the dataset is too small to have a nice train / test or train / dev / test split
$k$-fold cross-validation
Partition the data into $k$ pieces and treat them as mini held-out sets. Each fold is an experiment with a different held-out set, using the rest of the data for training
PS: If tuning the system via cross-validation, still important to have a separate blind test set.
2.5 Measurement
$$ \text{precision} = \frac{TP}{TP+FP} \ \ \ \text{recall (true positive rate)} = \frac{TP}{TP+FN}\ \ \ F_1 = \ \frac{\text{precision} \cdot \text{recall}}{\text{precision}+\text{recall}} $$
Upper bound: Turing test
- When using a human Gold Standard, check the agreement of humans against that standard.
Lower bound: performance of baseline model
Confusion matrix
Use Hidden Markov Models to do POS tagging
Notation:
Sequence of observation overtime (sentence): $ O=o_1\dots o_T $
Sequence of states (tags): $Q=q^1\dots q^N$
Sequence states over time: $Q = q_1 \dots q_T$
Vocabulary size: $V$
Time step: $t$, not a tag
Matrix of transition probabilities: $A$
- $a_{ij}$: the prob of transitioning from state $q^i$ to $q^j$
Matrix of output probabilities: $B$
- $b_i(o_t)$: the prob of emitting $o_t$ from state $q^i$
1. Pos tagging
1.1 Introduction
Uses:
- First step towards syntactic analysis
- Often useful for semantic analysis
- Illustrate the use of HMMs
Depends on:
- The word to be labeled
- The labels of surrounding words
Tags: Penn Treebank POS tags
Difficulties:
- Ambiguity
- Sparse data
1.2 Probabilistic model for tagging (forward algorithm?)
Assumption:
- Each tag depends only on previous tag (bigram tag model)
- Words are independent given tags
Finite-state machine: sentences are generated by walking through states in a graph, each state represents a tag
Given:
- The transition and output proabilities
$$ P(O,Q) = \prod_{t=1}^n P(o_t \mid q_t) P(q_t \mid q_{t-1}) $$
2. HMM
2.1 Introduction
Purpose: find the best tag sequence for an untagged sentence
- Markov: Markov independence assumption (each tag / state only depends on fixed number of previous tags / states)
- Hidden: at test time we only see the words / emissions, the tags / states are hidden variables
Elements:
- a set of states (e.g. tags)
- a set of output symbol (e.g. words)
- initial state (e.g. beginning of the sentence
- state transition probabilities (e.g. $P(t_i \mid t_{i-1})$)
- symbol emission probabilities (e.g. $P(w_i \mid t_i)$)
2.2 Model
$$ \underset{Q}{\operatorname{argmax}} P(Q \mid O) = \underset{Q}{\operatorname{argmax}} P(O \mid Q) P(Q) $$
- Using Bayes’ rule
- $P(Q) = \prod_t P(q_t \mid q_{t-1})$
- $P(O \mid Q) = \prod_t P(o_t \mid q_t)$
Joint probability: $$ P(O,Q\mid \lambda) = \prod_{t=1}^T P(o_t \mid q_t) P(q_{t-1}) = \prod_{t=1}^T b_{t}(o_t)a_{t-1,t} $$
- $\lambda = (A,B)$
2.3 Transition and Output Probabilities (Matrix $A,B$)
- Transition matrix $A$: $a_{ij} = P(q_j \mid q_i)$
- Output matrix $B$: $b_i(o)=P(o \mid q^i)$ for output $o$
2.4 Searching: Viterbi algorithm
Decoding: finding the best tag sequence for a sentence is called decoding
Intuition: the best path of length $t$ ending in state $Q$ must include the best path of length $t-1$ to the previous state
Find $\underset{Q}{\operatorname{argmax}} P(Q \mid O) $: $$ v(j,t) = \max_{i=1}^N v(i,t-1) \cdot a_{ij} \cdot b_j $$
- Can also use negative log probabilities and take minimum over sum of costs
Chart: $N \times T$
2.5 Training: Baum-Welch (Forward-Backward) algorithm
Compute the best parameters from unannotated corpus
2.5.1 Expectation-maximisation
- Initialise parameters $\lambda^{(0)}$
- At each iteration $k$,
- E-step: compute expected counts using $\lambda^{(k-1)}$
- M-step: set $\lambda^{(k)}$ using MLE on the expected counts
- Repeat until $\lambda$ does not change
PS:
- Can find a local maximum, not guarantee to find the global maximum
2.5.2 Baum-Welch
Idea: computes expected counts using forward probabilities and backward probabilities $$ \beta(j,t) = P(q^t = j, o_{t+1}, o_{t+2}, \dots,o_T \mid \lambda) $$ PS: EM is much more general, can use for many latent variable models
2.6 Evaluation: Forward algorithm (sum instead of max): compute the likelihood
$$ a(j,t) = \sum_{i=1}^N a(i,t-1)\cdot a_{ij}\cdot b_j(o_t) $$
- Actually a language model, not just count for the one previous step, but also all the earlier steps
2.7 Three main problems
- Evaluation problem
- Evaluation problem answers the question: what is the probability that a particular sequence of symbols is produced by a particular model?
- For evaluation we use two algorithms: the forward algorithm or the backwards algorithm (DO NOT confuse them with the forward-backward algorithm).
- Decoding problem
- Decoding problem answers the question: Given a sequence of symbols (your observations) and a model, what is the most likely sequence of states that produced the sequence.
- For decoding we use the Viterbi algorithm.
- Training problem
- Training problem answers the question: Given a model structure and a set of sequences, find the model that best fits the data.
- For this problem we can use the following 3 algorithms:
- MLE (maximum likelihood estimation)
- Viterbi training (DO NOT confuse with Viterbi decoding)
- Baum Welch = forward-backward algorithm
Introduce two different methods to do text classification: Naive Bayes and Maximum Entropy
1. Introduction
Application:
- spam detection (binary: spam / not spam)
- sentiment analysis (binary or multiway)
- review (pos / neg or 1-5 stars)
- political argument (pro / con, or pro / con / neutral)
- topic classification (multiway: sport / finance / travel, etc.)
- categorise the author
- native language identification
- diagnosis of disease
- gender, dialect, educational background, political, orientation
Why not N-gram:
- sequential relationships are largely irrelevant for many tasks (bag of words assumption)
- want to include other features (e.g. POS tags) that N-gram models do not include
Two methods:
- Naive Bayes
- Maximum Entropy (multinomial logistic regression)
2. Naive Bayes
2.1 Model
$$ \hat{c} = \underset{c\in C}{\operatorname{argmax}} P(c\mid d) = \underset{c\in C}{\operatorname{argmax}} \frac{P(d \mid c)P(c)}{P(d)} = \underset{c\in C}{\operatorname{argmax}}P(d \mid c)P(c) $$
Naive Bayes assumption: $P(d \mid c) = P(f_1, f_2,\dots ,f_n \mid c) \approx P(f_1 \mid c)P(f_2\mid c)\dots P(f_n \mid c)$ $$ \ \ \hat{P}(f_i \mid c) = \frac{\text{count}(f_i,c)+\alpha}{\sum_{f \in F}(\text{count}(f,c)+\alpha)} $$ PS: $\alpha$ needs to be added in calculations for every class, so the total count should add $\left |F\right |*\alpha$
- $F$: the set of possible features
- $\alpha$: smoothing parameter
- all features not shown earlier before have $\hat{P}(f_i \mid +) = \frac{\alpha}{68+\alpha \left |F\right |}$
$P(c)$: estimated with MLE $$ \hat{P}(c) = \frac{N_c}{N} $$
2.2 Step
Test document $d$
Calculate $$ P(+ \mid d) \propto P(+) \prod_{i=1}^n P(f_i \mid +) \ \ \ P(-\mid d) \propto P(-) \prod_{i=1}^n P(f_i \mid -) $$
Choose the one with larger value
2.3 Choose features
- Binary value for $f_i$
- Use subset of $\left | F \right |$ (e.g. ignore stop words)
- Use more complex features (e.g. bigrams, etc.)
2.4 Deal with very small numbers (underflow): use costs
Find the lowest cost classification $$ \hat{c} = \underset{c\in C}{\operatorname{argmin}} (-\log P(c) + \sum_{i=1}^n - \log P(f_i \mid c)) $$
- Costs are negative log probabilities
- Can avoid underflow by turning multiplication to sum
- Look for the lowest cost overall
2.5 Implementation
- Naive Bayes is a linear classifier since it uses a linear function of the input feature
- As is logistic regression
2.6 Problems, Pros & Cons
Problem:
might need domain-specific non-sentiment words
e.g. memory, CPU for computer product reviews
stopwords might be very useful features for some tasks
probably better to use too many irrelevant features than not enough relevant features
Pros:
- Easy to implement
- Fast to train
- Do not need much training data compared with other methods (good for small datasets)
- Usually works reasonably well
- Can be a baseline method for any classification task
Cons:
- Assumption is naive!
- features are not usually independent given the class
- adding feature types often leads to even stronger correlations between features
- accuracy can somtimes be OK, but will be highly overconfident in its decisions
3. Maximum Entropy Model
3.1 Model
$$ P(c \mid \vec{x}) = \frac{1}{Z}\exp(\sum_i w_i f_i(\vec{x},c)) \ \ \ \text{where } Z = \sum_{c’} \exp(\sum_iw_i f_i(\vec{x},c)) $$
- Multinomial logistic regression, model $P(c \mid d)$ directly instead of using Bayes’ rule
- Use vector to represent the observed data
- End up choosing the class for whose $\vec{w} \cdot \vec{f}$ is the highest
PS:
- For each class, $\vec{w}$ varies since each word cause different influence on the class
3.2 Training
Conditional maximum likelihood estimation (CMLE):
Given instances $x^{(1)} \dots x^{(N)}$ with labels $c^{(1)}\dots c^{(N)}$ $$ \vec{w} = \underset{\vec{w}}{\operatorname{argmax}} \sum_j \log P(c^{(j)}\mid x^{(j)}) $$ Avoid overfitting: regularisation
3.3 Pros & Cons
Pros:
- Better performance
- Do not need conditionally independent assumption
Cons:
- Time-consuming if there are a large number of classes and / or thousands of features to extract from each training example
3.4 Why is called MaxEnt?
- The probabilistic model we are building should follow whatever constraints we impose on it
- But beyond these constraints we should follow Occam’s Razor, make the fewest possible assumptions
4. Summary
4.1 Differences
Naive Bayes | MaxEnt | |
---|---|---|
Model | MLE (maximum likelihood estimation)Bayes’ rule | CMLE (conditional maximum likelihood estimation)Multinomial logistic regression |
Features | Directly observed | Use vector representationFeatures are functions that depend on both observation $\vec{x}$ and class $c$ |
Assumption | Conditionally independent assumption | None |
4.2 Similarity
- Both are easily available in standard ML toolkits
- Both need to choose what features are good to use