1. Semantic Roles

1.1 Commonly used thematic roles

1.2 Issues with thematic roles

No universally agreed-upon set of roles
Items with the same role may not behave quite the same

Problem avoid:

Specific to individual verbs only (PropBank)
Specific to small groups of verbs (FrameNet)

#2. Labelling

2.1 PropBank

Frame Roles:

Arg0-PAG: loader
Arg1-GOL: beast of burden
Arg2-PPT: cargo
Arg3-MNR: instrument

2.2 FrameNet

==Meanings are reletavised to scenes!==

Tries to capture relationships among word and phrase meanings by assigning them the same frame
$\approx 1000$ frames represent scenarios
Frames are explained with textual descriptions and linguistic examples

Resources:

FrameNets for several languages
Some data annotated with Frame elements from FrameNet
SEMAFOR: a frame-semantic parser

Distributional hypothesis:

Can infer meaning just by looking at the contexts a word occurs in
Meaning IS the contexts a word occurs in
either way , similar contexts imply similar meanings

1. Distributional semantics

1.1 Basic idea

Represent each word $w_i$ as a vector of its contexts

Distributional semantic models also called vector-space models

1.2 Context

1.2.1 Defining the context

usually ignore stopwords
usually use a large window around the target word

1.2.2 Weights

Mutual information: $$ \text{PMI}(x,y) = \log_2 \frac{P(x,y)}{P(x)P(y)} $$

PMI tells us how much more / less likely the cooccurrence is than if the words were independent
Problem:
- in practice, PMI is computed with counts, so it is over-sensitive to the chance co-occurrence of infrequent words
Improvement: use positive PMI (PPMI)
- change all negative PMI values to 0 (because for infrequent words, not enough data to accurately determine negative PMI values)
- smoothing in PMI computation (see J & M)

1.3 Compositionality in vector space

Compositionality implies some operator $\oplus$ such that $$ \text{meaning}(w_1w_2) = \text{meaning}(w_1) \oplus \text{meaning}(w_2) $$

2. Measure similarity

Euclidean distance
Dot product
Normalised dot product (the cosine of the angle between vectors)
Other
- Jaccard measure
- Dice measure
- Jenson-Shannon divergence
- …

3. Evaluation

Extrinsic: IR, QA, automatic essay marking, …
Intrinsic
- relatedness judgement
- word association

3.1 Relatedness judgement

E.g. on a scale of 1-10, how related are the two words lated?

lemon-truth = 1

lemon-orange = 10

Answers depend a lot on how the question is asked

3.2 Word association

Say the first word that comes to ming

Collect the data and provide probabilities:

3.3 Dimension reduction: learn a more compact space

LSA (Latent Semantic Analysis)
Neural network methods

Goal:

a machine which can answer questions in natural language
can access to KB
can access to vast quantities of English text

1. Semantics

1.1 Lexical and Sentential

Lexical semantics: the meaning of individual words

Sentential semantics: how word meanings combine (e.g. who did that to whom; when, how, why…)

1.2 Problems

Senses: words may have different meanings (need to disambiguate between them)
Synonyms: words may have the same meaning
Ontology (hyponym & hypernym): words can refer to a subset or superset of the concept referred to by another word (A is a B relationships)
Similarity & gradation: words may be related in other ways
Inference

PS: some of these problems can be solved with a good ontology (e.g. WordNet)

2. WordNet

WordNet is a hand-built resource containing 117,000 synsets

Relations:

hyponym / hypernym (IS-A: chair-furniture)
meronym (PART-WHOLE: leg-chair)
antonym (OPPOSITES: good -bad)

3. Word sense ambiguity

Words can have multiple senses

A way to define senses: different sense = different translation

3.1 Semantics ambiguity types

Polysemy: a word has related meanings
Homonymy: a word has unrelated meanings
Synonymy: different words can mean the same thing

3.2 Word sense disambiguation (WSD)

Task: Given a sense ambiguous word, ﬁnd the sense in a given context

Classifier:

Naive Bayes
Decision lists
Decision trees

3.1.1 Naive Bayes for WSD

3.1.2 Features

Directly neighbouring words
Any content words in a 50 word window
Syntactically related words
Syntactic role in sense
Topic of the text
POS tag, surrounding POS tags

3.3 Evaluation

Extrinsic: test as part of IR, QA, or MT system

Intrinsic: evaluate classification accuracy or precision / recall against gold-standard senses

Baseline: choose the most frequent sense

3.4 Issues with WSD

Not always clear how fine-grained the gold-standard should be
Difficult / expensive to annotate corpora with fine-grained senses
Classifier must be trained separately for each word
- hard to learn anything for infrequent or unseen words
- requires new annotations for each new word
- motivates unsupervised and semi-supervised methods

4. Semantic classes

Methods: define coarse-grained semantic categories like PERSON, LOCATION, ARTIFACT

Named entity recognition (NER): recognise and classify perper names in text is important for many applications
Supersense tagging

1. Parser evaluation

Correct: if there is a gold constituent that spans the same sentence positions

Hasher measure: also require the constituent labels to match

precision
recall
F-score (balances precision and recall): $$ \frac{2pr}{p+r} $$

2. Lexicalisation

2.1 Introduction

Problem: replacing one word with another with the same POS will never result in a different parsing decision, even though it should

Solution: adding the lexical head of the phrase

2.2 Pros & Cons

Pros:

makes the grammar more specific

Cons:

leads to huge grammar blowup and very sparse data

2.3 Improvement

complex smoothing schemes (similar to N-gram interpolation / backoff)
increasing emphasis on automatically learned subcategories

2.4 Lexicalised constituency parse

Direct dependency parsing:

Transition-based (shift-reduce)
Train a classiﬁer as the oracle to predict next action (Shift, LeftArc, or RightArc), and proceed left-to-right through the sentence
$O(n)$ time
Graph-based
From the fully connected directed graph of all possible edges, choose the best ones that form a tree
Edge-factored models: classiﬁer assigns a nonne g ative score to each possible edge ; maximum spanning tree algorithm ﬁnds the spanning tree with highest total score
$O(n^2)$ time

Head Rules

3. Phrase Structure & Dependencies

In language with free word order, phrase structure grammars don’t make as much sense

3.1 Pros and Cons

Pros:

Sensible framework for free word order languages
Identifies syntactic relations directly
Dependency pairs / chains can make good features in classifiers, for information extraction
Parsers can be very fast

Cons:

The assumption of asymmetric binary relations is not always right (e.g. parse dogs and cats)

3.2 Edge labels

Labels: subject, direct object, determiner, adjective modifier, adverbial modifier

3.3 Dependency Paths

For information extraction tasks involving real-world relationships between entities, chains of dependencies can provide good features

Projectivity:

A sentence’s dependency parse is projective if every subtree occupies a contiguous span of the sentence

= The dependency parse can be drawn on top of the sentence without any crossing edges

Nonprojectivity:

Nonprojectivity is rare in English, but quite common in many langauges

4. Dependency Parsing

4.1 CKY

CKY can be adapated, but efficiency is not good: obvious approach is $O(Gn^5)$; Eisner algorithm can bring it down to $O(Gn^3)$

4.2 Transition-based: Shift reduce

3 possible actions:

LeftArc: assign head-dependent relation between $s_1$ and $s_2$; pop $s_2$
RightArc: assign head-dependent relation between $s_2$ and $s_1$; pop $s_1$
Shift: put $w_1$ on top of the stack

4.3 Graph-based

4.4 Comparison

5. Choosing a Parser

Target representation: constituency or dependency?
Efficiency
Incrementality: parse the whole sentence at once, or obtain partial left-to-right analyses / expectations
Retrainable system or not

Two theories of syntax:

Context-free grammar
Dependency grammar

1. Context-free grammer

Symbols:

Terminals (t): words
Non-terminals (NT): phrasal categories (e.g. S, NP, VP, PP) with S being the start symbol

Chomsky Normal Form: only has rules of the form $$ NT \rightarrow NT \ \ NT \ \ \ NT\rightarrow t $$

2. Syntactic analysis

2.1 Global Ambibuity

Structural ambiguity

POS ambiguity:

Attachment ambiguity: Some sentences have structural ambiguity even without part-of-speech ambiguity

Depends on where different phrases attach in the tree
Different attachments have different meanings
I saw the man with the telescope
She ate the pizza on the floor
Good boys and girls get presents from Santa

Why we want parse trees:

compositional semantics: the meaning of a constituent is a function of the meaning of its children
rule-to-rule semantics: that function is determined by the rule which licenses the constituent

Huge problem:

correctness problem: need to ﬁnd the right structure to get the right meaning
efficiency problem: searching all possible structures can be very slow; want to use parsing for large-scale language tasks (for example, as used to create Google’s “infoboxes”)

2.2 Local ambiguity

Local ambiguity is also a big problem: multiple analyses for parts of sentence.

“The dog bit the child”: first three words could be NP (but aren’t)
Building useless partial structures wastes time
Avoiding useless computation is a major issue in parsing

3. Parsing

3.1 Normal Parsing

Top-down: depth-first, breadth-firsts, best-first
Bottom-up
Mixed strategy (e.g. Earley algorithm)

3.1.1 Recursive Descent Parsing

Idea: top-down, depth-first parsing

e.g. Slide 11 pp. 23-24

3.1.2 Shift-reduce Parsing

Idea: bottom-up, depth-first parsing

e.g. Slide 11 pp. 27

3.1.3 CKY algorithm (Cocke, Kasami, Younger)

Idea: bottom-up, breadth-first parsing

Assumption: grammar in Chomsky Normal Form

Pseudocode:

for len = 1 to n:						# number of words in constituent
	for i = 0 to (n-len):			# start position
    j = i + len							# end position
    for each A->B where c[i,j] has B:
      # process unary rules
      add A to c[i,j] with a pointer to B
    for k = (i+1) to (j-1):
      # process binary rules
      for each (A->B C) where c[i,k] has B and c[k,j] has C:
        add A to c[i,j] with pointers to B and C

Time complexity: $O(Gn^3)$

e.g. Slide 11 pp.5-11

Problem:

Ambiguity is only clear if we go on to reconstruct the parses using our backpointers

Pros & Cons:

Pros:

much more efficient than depth-first

Cons:

still may compute a lot of unnecessary partial parses
requires converting the grammar to CNF

Can be implemented to probabilistic CKY

3.1.4 Choose parsing algorithm

==Probabilities ???==

3.2 Statistical Parsing

3.2.1 Treebank grammars

Idea: annotate real sentences with parse trees

3.2.1.1 The Penn Treebank (Treebank PCFG)

PCFG (probabilistic context-free grammar) $$ NT \rightarrow \beta \ \text{ with probability } P(\beta \mid NT) $$

3.2.1.2 Create PCFG

Easiest way: MLE

Count all occurrences of $NT \rightarrow \beta$ in treebank
Divide by the count of all rules whose left-hand-side is $NT$ to get $P(\beta \mid NT)$

$$ P(NT \rightarrow C_1, C_2 \dots C_n \mid NT) = \frac{count(NT \rightarrow C_1,C_2 \dots C_n)}{count(NT)} $$

e.g. Slide 12 pp. 21-24

3.2.2 Probability of a parse

$$ P(t) = \prod_{NT \rightarrow \beta \in t} NT \rightarrow \beta $$

3.2.3 Probabilistic CKY

When we find an $NT$ spanning $(i,j)$, store its probability along with its label in cell $(i,j)$
If we later find an $NT$ with the same span but higher probability, replace the probability and the backpointers for $NT$ in cell $(i,j)$

3.2.4 Other algorithms

Inside algorithm
Inside-outside algorithm

3.2.5 Best-first probabilistic parsing

Basic idea: use probabilities of subtrees to decide which ones to build up further

How to score constituents:

Why not probability: smaller constituents will almost always have higher scores
Possible solution:
- divide by the number of words (but not guaranteed to find the best parse first)
- A*

1. Annotation

Penn Treebank: 36 POS tags (excluding punctuation)

Factors in Annotation:

source data (genre? size? licensing?)
annotation scheme (complexity? guidelines?)
annotators (expertise? training?)
annotation software (graphical interface?)
quality control procedures (multiple annotation, adjudication?)

2. Evaluation

2.1 Hypotheses

About existing linguistic objects:

Is this text by Shakespeare or Marlowe?

About output of a language system:

How well does this language model predict the data?
How accurate is this segmenter / tagger / parser?
- Is this segmenter / tagger / parser better than that one?

About human beings:

How reliable is this person’s annotation?
To what extent do these two annotators agree? ( IAA )

2.2 Gold Standard Evaluation

Gold standards used both for training and for evaluation
But testing must be done on unseen data

2.3 Tuning

Lidstone (add-$\lambda$) smoothing
Choosing features for text classification

Overfitting: the test set is no longer a reliable proxy for new data.

Solution: hold out a second set for tuning, called a development set, save the test set for the very end

2.4 Cross-validation

Use when the dataset is too small to have a nice train / test or train / dev / test split

$k$-fold cross-validation

Partition the data into $k$ pieces and treat them as mini held-out sets. Each fold is an experiment with a diﬀerent held-out set, using the rest of the data for training

PS: If tuning the system via cross-validation, still important to have a separate blind test set.

2.5 Measurement

$$ \text{precision} = \frac{TP}{TP+FP} \ \ \ \text{recall (true positive rate)} = \frac{TP}{TP+FN}\ \ \ F_1 = \ \frac{\text{precision} \cdot \text{recall}}{\text{precision}+\text{recall}} $$

Upper bound: Turing test

When using a human Gold Standard, check the agreement of humans against that standard.

Lower bound: performance of baseline model

Confusion matrix

Use Hidden Markov Models to do POS tagging

Notation:

Sequence of observation overtime (sentence): $ O=o_1\dots o_T $
Sequence of states (tags): $Q=q^1\dots q^N$
Sequence states over time: $Q = q_1 \dots q_T$
Vocabulary size: $V$
Time step: $t$, not a tag
Matrix of transition probabilities: $A$
- $a_{ij}$: the prob of transitioning from state $q^i$ to $q^j$
Matrix of output probabilities: $B$
- $b_i(o_t)$: the prob of emitting $o_t$ from state $q^i$

1. Pos tagging

1.1 Introduction

Uses:

First step towards syntactic analysis
Often useful for semantic analysis
Illustrate the use of HMMs

Depends on:

The word to be labeled
The labels of surrounding words

Tags: Penn Treebank POS tags

Difficulties:

Ambiguity
Sparse data

1.2 Probabilistic model for tagging (forward algorithm?)

Assumption:

Each tag depends only on previous tag (bigram tag model)
Words are independent given tags

Finite-state machine: sentences are generated by walking through states in a graph, each state represents a tag

Given:

The transition and output proabilities

$$ P(O,Q) = \prod_{t=1}^n P(o_t \mid q_t) P(q_t \mid q_{t-1}) $$

2. HMM

2.1 Introduction

Purpose: find the best tag sequence for an untagged sentence

Markov: Markov independence assumption (each tag / state only depends on fixed number of previous tags / states)
Hidden: at test time we only see the words / emissions, the tags / states are hidden variables

Elements:

a set of states (e.g. tags)
a set of output symbol (e.g. words)
initial state (e.g. beginning of the sentence
state transition probabilities (e.g. $P(t_i \mid t_{i-1})$)
symbol emission probabilities (e.g. $P(w_i \mid t_i)$)

2.2 Model

$$ \underset{Q}{\operatorname{argmax}} P(Q \mid O) = \underset{Q}{\operatorname{argmax}} P(O \mid Q) P(Q) $$

Using Bayes’ rule
$P(Q) = \prod_t P(q_t \mid q_{t-1})$
$P(O \mid Q) = \prod_t P(o_t \mid q_t)$

Joint probability: $$ P(O,Q\mid \lambda) = \prod_{t=1}^T P(o_t \mid q_t) P(q_{t-1}) = \prod_{t=1}^T b_{t}(o_t)a_{t-1,t} $$

$\lambda = (A,B)$

2.3 Transition and Output Probabilities (Matrix $A,B$)

Transition matrix $A$: $a_{ij} = P(q_j \mid q_i)$
Output matrix $B$: $b_i(o)=P(o \mid q^i)$ for output $o$

2.4 Searching: Viterbi algorithm

Decoding: finding the best tag sequence for a sentence is called decoding

Intuition: the best path of length $t$ ending in state $Q$ must include the best path of length $t-1$ to the previous state

Find $\underset{Q}{\operatorname{argmax}} P(Q \mid O) $: $$ v(j,t) = \max_{i=1}^N v(i,t-1) \cdot a_{ij} \cdot b_j $$

Can also use negative log probabilities and take minimum over sum of costs

Chart: $N \times T$

2.5 Training: Baum-Welch (Forward-Backward) algorithm

Compute the best parameters from unannotated corpus

2.5.1 Expectation-maximisation

Initialise parameters $\lambda^{(0)}$
At each iteration $k$,
- E-step: compute expected counts using $\lambda^{(k-1)}$
- M-step: set $\lambda^{(k)}$ using MLE on the expected counts
Repeat until $\lambda$ does not change

PS:

Can find a local maximum, not guarantee to find the global maximum

2.5.2 Baum-Welch

Idea: computes expected counts using forward probabilities and backward probabilities $$ \beta(j,t) = P(q^t = j, o_{t+1}, o_{t+2}, \dots,o_T \mid \lambda) $$ PS: EM is much more general, can use for many latent variable models

2.6 Evaluation: Forward algorithm (sum instead of max): compute the likelihood

$$ a(j,t) = \sum_{i=1}^N a(i,t-1)\cdot a_{ij}\cdot b_j(o_t) $$

Actually a language model, not just count for the one previous step, but also all the earlier steps

2.7 Three main problems

Evaluation problem
- Evaluation problem answers the question: what is the probability that a particular sequence of symbols is produced by a particular model?
- For evaluation we use two algorithms: the forward algorithm or the backwards algorithm (DO NOT confuse them with the forward-backward algorithm).
Decoding problem
- Decoding problem answers the question: Given a sequence of symbols (your observations) and a model, what is the most likely sequence of states that produced the sequence.
- For decoding we use the Viterbi algorithm.
Training problem
- Training problem answers the question: Given a model structure and a set of sequences, find the model that best fits the data.
- For this problem we can use the following 3 algorithms:
  1. MLE (maximum likelihood estimation)
  2. Viterbi training (DO NOT confuse with Viterbi decoding)
  3. Baum Welch = forward-backward algorithm

Introduce two different methods to do text classification: Naive Bayes and Maximum Entropy

1. Introduction

Application:

spam detection (binary: spam / not spam)
sentiment analysis (binary or multiway)
- review (pos / neg or 1-5 stars)
- political argument (pro / con, or pro / con / neutral)
topic classification (multiway: sport / finance / travel, etc.)
categorise the author
- native language identification
- diagnosis of disease
- gender, dialect, educational background, political, orientation

Why not N-gram:

sequential relationships are largely irrelevant for many tasks (bag of words assumption)
want to include other features (e.g. POS tags) that N-gram models do not include

Two methods:

Naive Bayes
Maximum Entropy (multinomial logistic regression)

2. Naive Bayes

2.1 Model

$$ \hat{c} = \underset{c\in C}{\operatorname{argmax}} P(c\mid d) = \underset{c\in C}{\operatorname{argmax}} \frac{P(d \mid c)P(c)}{P(d)} = \underset{c\in C}{\operatorname{argmax}}P(d \mid c)P(c) $$

Naive Bayes assumption: $P(d \mid c) = P(f_1, f_2,\dots ,f_n \mid c) \approx P(f_1 \mid c)P(f_2\mid c)\dots P(f_n \mid c)$ $$ \ \ \hat{P}(f_i \mid c) = \frac{\text{count}(f_i,c)+\alpha}{\sum_{f \in F}(\text{count}(f,c)+\alpha)} $$ PS: $\alpha$ needs to be added in calculations for every class, so the total count should add $\left |F\right |*\alpha$
- $F$: the set of possible features
- $\alpha$: smoothing parameter
- all features not shown earlier before have $\hat{P}(f_i \mid +) = \frac{\alpha}{68+\alpha \left |F\right |}$
$P(c)$: estimated with MLE $$ \hat{P}(c) = \frac{N_c}{N} $$

2.2 Step

Test document $d$
Calculate $$ P(+ \mid d) \propto P(+) \prod_{i=1}^n P(f_i \mid +) \ \ \ P(-\mid d) \propto P(-) \prod_{i=1}^n P(f_i \mid -) $$
Choose the one with larger value

2.3 Choose features

Binary value for $f_i$
Use subset of $\left | F \right |$ (e.g. ignore stop words)
Use more complex features (e.g. bigrams, etc.)

2.4 Deal with very small numbers (underflow): use costs

Find the lowest cost classification $$ \hat{c} = \underset{c\in C}{\operatorname{argmin}} (-\log P(c) + \sum_{i=1}^n - \log P(f_i \mid c)) $$

Costs are negative log probabilities
Can avoid underflow by turning multiplication to sum
Look for the lowest cost overall

2.5 Implementation

Naive Bayes is a linear classifier since it uses a linear function of the input feature
As is logistic regression

2.6 Problems, Pros & Cons

Problem:

might need domain-specific non-sentiment words
e.g. memory, CPU for computer product reviews
stopwords might be very useful features for some tasks
probably better to use too many irrelevant features than not enough relevant features

Pros:

Easy to implement
Fast to train
Do not need much training data compared with other methods (good for small datasets)
Usually works reasonably well
Can be a baseline method for any classification task

Cons:

Assumption is naive!
- features are not usually independent given the class
- adding feature types often leads to even stronger correlations between features
- accuracy can somtimes be OK, but will be highly overconfident in its decisions

3. Maximum Entropy Model

3.1 Model

$$ P(c \mid \vec{x}) = \frac{1}{Z}\exp(\sum_i w_i f_i(\vec{x},c)) \ \ \ \text{where } Z = \sum_{c’} \exp(\sum_iw_i f_i(\vec{x},c)) $$

Multinomial logistic regression, model $P(c \mid d)$ directly instead of using Bayes’ rule
Use vector to represent the observed data
End up choosing the class for whose $\vec{w} \cdot \vec{f}$ is the highest

PS:

For each class, $\vec{w}$ varies since each word cause different influence on the class

3.2 Training

Conditional maximum likelihood estimation (CMLE):

Given instances $x^{(1)} \dots x^{(N)}$ with labels $c^{(1)}\dots c^{(N)}$ $$ \vec{w} = \underset{\vec{w}}{\operatorname{argmax}} \sum_j \log P(c^{(j)}\mid x^{(j)}) $$ Avoid overfitting: regularisation

3.3 Pros & Cons

Pros:

Better performance
Do not need conditionally independent assumption

Cons:

Time-consuming if there are a large number of classes and / or thousands of features to extract from each training example

3.4 Why is called MaxEnt?

The probabilistic model we are building should follow whatever constraints we impose on it
But beyond these constraints we should follow Occam’s Razor, make the fewest possible assumptions

4. Summary

4.1 Differences

	Naive Bayes	MaxEnt
Model	MLE (maximum likelihood estimation)Bayes’ rule	CMLE (conditional maximum likelihood estimation)Multinomial logistic regression
Features	Directly observed	Use vector representationFeatures are functions that depend on both observation $\vec{x}$ and class $c$
Assumption	Conditionally independent assumption	None

4.2 Similarity

Both are easily available in standard ML toolkits
Both need to choose what features are good to use