1. Semantic Roles

1.1 Commonly used thematic roles

1.2 Issues with thematic roles

  • No universally agreed-upon set of roles
  • Items with the same role may not behave quite the same

Problem avoid:

  • Specific to individual verbs only (PropBank)
  • Specific to small groups of verbs (FrameNet)

#2. Labelling

2.1 PropBank

Frame Roles:

  • Arg0-PAG: loader
  • Arg1-GOL: beast of burden
  • Arg2-PPT: cargo
  • Arg3-MNR: instrument

2.2 FrameNet

==Meanings are reletavised to scenes!==

  • Tries to capture relationships among word and phrase meanings by assigning them the same frame
  • $\approx 1000$ frames represent scenarios
  • Frames are explained with textual descriptions and linguistic examples

Resources:

  • FrameNets for several languages
  • Some data annotated with Frame elements from FrameNet
  • SEMAFOR: a frame-semantic parser

Distributional hypothesis:

  • Can infer meaning just by looking at the contexts a word occurs in
  • Meaning IS the contexts a word occurs in
  • either way , similar contexts imply similar meanings

1. Distributional semantics

1.1 Basic idea

Represent each word $w_i$ as a vector of its contexts

Distributional semantic models also called vector-space models

1.2 Context

1.2.1 Defining the context

  • usually ignore stopwords
  • usually use a large window around the target word

1.2.2 Weights

Mutual information: $$ \text{PMI}(x,y) = \log_2 \frac{P(x,y)}{P(x)P(y)} $$

  • PMI tells us how much more / less likely the cooccurrence is than if the words were independent
  • Problem:
    • in practice, PMI is computed with counts, so it is over-sensitive to the chance co-occurrence of infrequent words
  • Improvement: use positive PMI (PPMI)
    • change all negative PMI values to 0 (because for infrequent words, not enough data to accurately determine negative PMI values)
    • smoothing in PMI computation (see J & M)

1.3 Compositionality in vector space

Compositionality implies some operator $\oplus$ such that $$ \text{meaning}(w_1w_2) = \text{meaning}(w_1) \oplus \text{meaning}(w_2) $$

2. Measure similarity

  • Euclidean distance
  • Dot product
  • Normalised dot product (the cosine of the angle between vectors)
  • Other
    • Jaccard measure
    • Dice measure
    • Jenson-Shannon divergence

3. Evaluation

  • Extrinsic: IR, QA, automatic essay marking, …
  • Intrinsic
    • relatedness judgement
    • word association

3.1 Relatedness judgement

E.g. on a scale of 1-10, how related are the two words lated?

lemon-truth = 1

lemon-orange = 10

  • Answers depend a lot on how the question is asked

3.2 Word association

Say the first word that comes to ming

Collect the data and provide probabilities:

3.3 Dimension reduction: learn a more compact space

  • LSA (Latent Semantic Analysis)
  • Neural network methods

Goal:

  • a machine which can answer questions in natural language
  • can access to KB
  • can access to vast quantities of English text

1. Semantics

1.1 Lexical and Sentential

Lexical semantics: the meaning of individual words

Sentential semantics: how word meanings combine (e.g. who did that to whom; when, how, why…)

1.2 Problems

  • Senses: words may have different meanings (need to disambiguate between them)
  • Synonyms: words may have the same meaning
  • Ontology (hyponym & hypernym): words can refer to a subset or superset of the concept referred to by another word (A is a B relationships)
  • Similarity & gradation: words may be related in other ways
  • Inference

PS: some of these problems can be solved with a good ontology (e.g. WordNet)

2. WordNet

WordNet is a hand-built resource containing 117,000 synsets

Relations:

  • hyponym / hypernym (IS-A: chair-furniture)
  • meronym (PART-WHOLE: leg-chair)
  • antonym (OPPOSITES: good -bad)

3. Word sense ambiguity

Words can have multiple senses

A way to define senses: different sense = different translation

3.1 Semantics ambiguity types

  • Polysemy: a word has related meanings
  • Homonymy: a word has unrelated meanings
  • Synonymy: different words can mean the same thing

3.2 Word sense disambiguation (WSD)

Task: Given a sense ambiguous word, find the sense in a given context

Classifier:

  • Naive Bayes
  • Decision lists
  • Decision trees

3.1.1 Naive Bayes for WSD

3.1.2 Features

  • Directly neighbouring words
  • Any content words in a 50 word window
  • Syntactically related words
  • Syntactic role in sense
  • Topic of the text
  • POS tag, surrounding POS tags

3.3 Evaluation

Extrinsic: test as part of IR, QA, or MT system

Intrinsic: evaluate classification accuracy or precision / recall against gold-standard senses

Baseline: choose the most frequent sense

3.4 Issues with WSD

  • Not always clear how fine-grained the gold-standard should be
  • Difficult / expensive to annotate corpora with fine-grained senses
  • Classifier must be trained separately for each word
    • hard to learn anything for infrequent or unseen words
    • requires new annotations for each new word
    • motivates unsupervised and semi-supervised methods

4. Semantic classes

Methods: define coarse-grained semantic categories like PERSON, LOCATION, ARTIFACT

  • Named entity recognition (NER): recognise and classify perper names in text is important for many applications
  • Supersense tagging

1. Parser evaluation

Correct: if there is a gold constituent that spans the same sentence positions

Hasher measure: also require the constituent labels to match

  • precision

  • recall

  • F-score (balances precision and recall): $$ \frac{2pr}{p+r} $$

2. Lexicalisation

2.1 Introduction

Problem: replacing one word with another with the same POS will never result in a different parsing decision, even though it should

Solution: adding the lexical head of the phrase

2.2 Pros & Cons

Pros:

  • makes the grammar more specific

Cons:

  • leads to huge grammar blowup and very sparse data

2.3 Improvement

  • complex smoothing schemes (similar to N-gram interpolation / backoff)
  • increasing emphasis on automatically learned subcategories

2.4 Lexicalised constituency parse

Direct dependency parsing:

  • Transition-based (shift-reduce)

    Train a classifier as the oracle to predict next action (Shift, LeftArc, or RightArc), and proceed left-to-right through the sentence

    $O(n)$ time

  • Graph-based

    From the fully connected directed graph of all possible edges, choose the best ones that form a tree

    Edge-factored models: classifier assigns a nonne g ative score to each possible edge ; maximum spanning tree algorithm finds the spanning tree with highest total score

    $O(n^2)$ time

Head Rules

3. Phrase Structure & Dependencies

In language with free word order, phrase structure grammars don’t make as much sense

3.1 Pros and Cons

Pros:

  • Sensible framework for free word order languages
  • Identifies syntactic relations directly
  • Dependency pairs / chains can make good features in classifiers, for information extraction
  • Parsers can be very fast

Cons:

  • The assumption of asymmetric binary relations is not always right (e.g. parse dogs and cats)

3.2 Edge labels

Labels: subject, direct object, determiner, adjective modifier, adverbial modifier

3.3 Dependency Paths

For information extraction tasks involving real-world relationships between entities, chains of dependencies can provide good features

Projectivity:

A sentence’s dependency parse is projective if every subtree occupies a contiguous span of the sentence

  • = The dependency parse can be drawn on top of the sentence without any crossing edges

Nonprojectivity:

  • Nonprojectivity is rare in English, but quite common in many langauges

4. Dependency Parsing

4.1 CKY

CKY can be adapated, but efficiency is not good: obvious approach is $O(Gn^5)$; Eisner algorithm can bring it down to $O(Gn^3)$

4.2 Transition-based: Shift reduce

3 possible actions:

  • LeftArc: assign head-dependent relation between $s_1$ and $s_2$; pop $s_2$
  • RightArc: assign head-dependent relation between $s_2$ and $s_1$; pop $s_1$
  • Shift: put $w_1$ on top of the stack

4.3 Graph-based

4.4 Comparison

5. Choosing a Parser

  • Target representation: constituency or dependency?
  • Efficiency
  • Incrementality: parse the whole sentence at once, or obtain partial left-to-right analyses / expectations
  • Retrainable system or not

Two theories of syntax:

  • Context-free grammar
  • Dependency grammar

1. Context-free grammer

Symbols:

  • Terminals (t): words
  • Non-terminals (NT): phrasal categories (e.g. S, NP, VP, PP) with S being the start symbol

Chomsky Normal Form: only has rules of the form $$ NT \rightarrow NT \ \ NT \ \ \ NT\rightarrow t $$

2. Syntactic analysis

2.1 Global Ambibuity

Structural ambiguity

POS ambiguity:

Attachment ambiguity: Some sentences have structural ambiguity even without part-of-speech ambiguity

  • Depends on where different phrases attach in the tree

  • Different attachments have different meanings

    I saw the man with the telescope

    She ate the pizza on the floor

    Good boys and girls get presents from Santa

Why we want parse trees:

  • compositional semantics: the meaning of a constituent is a function of the meaning of its children
  • rule-to-rule semantics: that function is determined by the rule which licenses the constituent

Huge problem:

  • correctness problem: need to find the right structure to get the right meaning
  • efficiency problem: searching all possible structures can be very slow; want to use parsing for large-scale language tasks (for example, as used to create Google’s “infoboxes”)

2.2 Local ambiguity

Local ambiguity is also a big problem: multiple analyses for parts of sentence.

  • “The dog bit the child”: first three words could be NP (but aren’t)
  • Building useless partial structures wastes time
  • Avoiding useless computation is a major issue in parsing

3. Parsing

3.1 Normal Parsing

  • Top-down: depth-first, breadth-firsts, best-first
  • Bottom-up
  • Mixed strategy (e.g. Earley algorithm)

3.1.1 Recursive Descent Parsing

Idea: top-down, depth-first parsing

e.g. Slide 11 pp. 23-24

3.1.2 Shift-reduce Parsing

Idea: bottom-up, depth-first parsing

e.g. Slide 11 pp. 27

3.1.3 CKY algorithm (Cocke, Kasami, Younger)

Idea: bottom-up, breadth-first parsing

Assumption: grammar in Chomsky Normal Form

Pseudocode:

for len = 1 to n:						# number of words in constituent
	for i = 0 to (n-len):			# start position
    j = i + len							# end position
    for each A->B where c[i,j] has B:
      # process unary rules
      add A to c[i,j] with a pointer to B
    for k = (i+1) to (j-1):
      # process binary rules
      for each (A->B C) where c[i,k] has B and c[k,j] has C:
        add A to c[i,j] with pointers to B and C

Time complexity: $O(Gn^3)$

e.g. Slide 11 pp.5-11

Problem:

  • Ambiguity is only clear if we go on to reconstruct the parses using our backpointers

Pros & Cons:

Pros:

  • much more efficient than depth-first

Cons:

  • still may compute a lot of unnecessary partial parses
  • requires converting the grammar to CNF

Can be implemented to probabilistic CKY

3.1.4 Choose parsing algorithm

==Probabilities ???==

3.2 Statistical Parsing

3.2.1 Treebank grammars

Idea: annotate real sentences with parse trees

3.2.1.1 The Penn Treebank (Treebank PCFG)

PCFG (probabilistic context-free grammar) $$ NT \rightarrow \beta \ \text{ with probability } P(\beta \mid NT) $$

3.2.1.2 Create PCFG

Easiest way: MLE

  • Count all occurrences of $NT \rightarrow \beta$ in treebank
  • Divide by the count of all rules whose left-hand-side is $NT$ to get $P(\beta \mid NT)$

$$ P(NT \rightarrow C_1, C_2 \dots C_n \mid NT) = \frac{count(NT \rightarrow C_1,C_2 \dots C_n)}{count(NT)} $$

e.g. Slide 12 pp. 21-24

3.2.2 Probability of a parse

$$ P(t) = \prod_{NT \rightarrow \beta \in t} NT \rightarrow \beta $$

3.2.3 Probabilistic CKY

  • When we find an $NT$ spanning $(i,j)$, store its probability along with its label in cell $(i,j)$
  • If we later find an $NT$ with the same span but higher probability, replace the probability and the backpointers for $NT$ in cell $(i,j)$

3.2.4 Other algorithms

  • Inside algorithm
  • Inside-outside algorithm

3.2.5 Best-first probabilistic parsing

Basic idea: use probabilities of subtrees to decide which ones to build up further

How to score constituents:

  • Why not probability: smaller constituents will almost always have higher scores

  • Possible solution:

    • divide by the number of words (but not guaranteed to find the best parse first)
    • A*

1. Annotation

Penn Treebank: 36 POS tags (excluding punctuation)

Factors in Annotation:

  • source data (genre? size? licensing?)
  • annotation scheme (complexity? guidelines?)
  • annotators (expertise? training?)
  • annotation software (graphical interface?)
  • quality control procedures (multiple annotation, adjudication?)

2. Evaluation

2.1 Hypotheses

About existing linguistic objects:

  • Is this text by Shakespeare or Marlowe?

About output of a language system:

  • How well does this language model predict the data?
  • How accurate is this segmenter / tagger / parser?
    • Is this segmenter / tagger / parser better than that one?

About human beings:

  • How reliable is this person’s annotation?
  • To what extent do these two annotators agree? ( IAA )

2.2 Gold Standard Evaluation

  • Gold standards used both for training and for evaluation
  • But testing must be done on unseen data

2.3 Tuning

  • Lidstone (add-$\lambda$) smoothing
  • Choosing features for text classification

Overfitting: the test set is no longer a reliable proxy for new data.

Solution: hold out a second set for tuning, called a development set, save the test set for the very end

2.4 Cross-validation

Use when the dataset is too small to have a nice train / test or train / dev / test split

$k$-fold cross-validation

Partition the data into $k$ pieces and treat them as mini held-out sets. Each fold is an experiment with a different held-out set, using the rest of the data for training

PS: If tuning the system via cross-validation, still important to have a separate blind test set.

2.5 Measurement

$$ \text{precision} = \frac{TP}{TP+FP} \ \ \ \text{recall (true positive rate)} = \frac{TP}{TP+FN}\ \ \ F_1 = \ \frac{\text{precision} \cdot \text{recall}}{\text{precision}+\text{recall}} $$

Upper bound: Turing test

  • When using a human Gold Standard, check the agreement of humans against that standard.

Lower bound: performance of baseline model

Confusion matrix

Use Hidden Markov Models to do POS tagging

Notation:

  • Sequence of observation overtime (sentence): $ O=o_1\dots o_T $

  • Sequence of states (tags): $Q=q^1\dots q^N$

    Sequence states over time: $Q = q_1 \dots q_T$

  • Vocabulary size: $V$

  • Time step: $t$, not a tag

  • Matrix of transition probabilities: $A$

    • $a_{ij}$: the prob of transitioning from state $q^i$ to $q^j$
  • Matrix of output probabilities: $B$

    • $b_i(o_t)$: the prob of emitting $o_t$ from state $q^i$

1. Pos tagging

1.1 Introduction

Uses:

  • First step towards syntactic analysis
  • Often useful for semantic analysis
  • Illustrate the use of HMMs

Depends on:

  • The word to be labeled
  • The labels of surrounding words

Tags: Penn Treebank POS tags

Difficulties:

  • Ambiguity
  • Sparse data

1.2 Probabilistic model for tagging (forward algorithm?)

Assumption:

  • Each tag depends only on previous tag (bigram tag model)
  • Words are independent given tags

Finite-state machine: sentences are generated by walking through states in a graph, each state represents a tag

Given:

  • The transition and output proabilities

$$ P(O,Q) = \prod_{t=1}^n P(o_t \mid q_t) P(q_t \mid q_{t-1}) $$

2. HMM

2.1 Introduction

Purpose: find the best tag sequence for an untagged sentence

  • Markov: Markov independence assumption (each tag / state only depends on fixed number of previous tags / states)
  • Hidden: at test time we only see the words / emissions, the tags / states are hidden variables

Elements:

  • a set of states (e.g. tags)
  • a set of output symbol (e.g. words)
  • initial state (e.g. beginning of the sentence
  • state transition probabilities (e.g. $P(t_i \mid t_{i-1})$)
  • symbol emission probabilities (e.g. $P(w_i \mid t_i)$)

2.2 Model

$$ \underset{Q}{\operatorname{argmax}} P(Q \mid O) = \underset{Q}{\operatorname{argmax}} P(O \mid Q) P(Q) $$

  • Using Bayes’ rule
  • $P(Q) = \prod_t P(q_t \mid q_{t-1})$
  • $P(O \mid Q) = \prod_t P(o_t \mid q_t)$

Joint probability: $$ P(O,Q\mid \lambda) = \prod_{t=1}^T P(o_t \mid q_t) P(q_{t-1}) = \prod_{t=1}^T b_{t}(o_t)a_{t-1,t} $$

  • $\lambda = (A,B)$

2.3 Transition and Output Probabilities (Matrix $A,B$)

  • Transition matrix $A$: $a_{ij} = P(q_j \mid q_i)$
  • Output matrix $B$: $b_i(o)=P(o \mid q^i)$ for output $o$

2.4 Searching: Viterbi algorithm

Decoding: finding the best tag sequence for a sentence is called decoding

Intuition: the best path of length $t$ ending in state $Q$ must include the best path of length $t-1$ to the previous state

Find $\underset{Q}{\operatorname{argmax}} P(Q \mid O) $: $$ v(j,t) = \max_{i=1}^N v(i,t-1) \cdot a_{ij} \cdot b_j $$

  • Can also use negative log probabilities and take minimum over sum of costs

Chart: $N \times T$

2.5 Training: Baum-Welch (Forward-Backward) algorithm

Compute the best parameters from unannotated corpus

2.5.1 Expectation-maximisation

  • Initialise parameters $\lambda^{(0)}$
  • At each iteration $k$,
    • E-step: compute expected counts using $\lambda^{(k-1)}$
    • M-step: set $\lambda^{(k)}$ using MLE on the expected counts
  • Repeat until $\lambda$ does not change

PS:

  • Can find a local maximum, not guarantee to find the global maximum

2.5.2 Baum-Welch

Idea: computes expected counts using forward probabilities and backward probabilities $$ \beta(j,t) = P(q^t = j, o_{t+1}, o_{t+2}, \dots,o_T \mid \lambda) $$ PS: EM is much more general, can use for many latent variable models

2.6 Evaluation: Forward algorithm (sum instead of max): compute the likelihood

$$ a(j,t) = \sum_{i=1}^N a(i,t-1)\cdot a_{ij}\cdot b_j(o_t) $$

  • Actually a language model, not just count for the one previous step, but also all the earlier steps

2.7 Three main problems

  1. Evaluation problem
    • Evaluation problem answers the question: what is the probability that a particular sequence of symbols is produced by a particular model?
    • For evaluation we use two algorithms: the forward algorithm or the backwards algorithm (DO NOT confuse them with the forward-backward algorithm).
  2. Decoding problem
    • Decoding problem answers the question: Given a sequence of symbols (your observations) and a model, what is the most likely sequence of states that produced the sequence.
    • For decoding we use the Viterbi algorithm.
  3. Training problem
    • Training problem answers the question: Given a model structure and a set of sequences, find the model that best fits the data.
    • For this problem we can use the following 3 algorithms:
      1. MLE (maximum likelihood estimation)
      2. Viterbi training (DO NOT confuse with Viterbi decoding)
      3. Baum Welch = forward-backward algorithm

Introduce two different methods to do text classification: Naive Bayes and Maximum Entropy

1. Introduction

Application:

  • spam detection (binary: spam / not spam)
  • sentiment analysis (binary or multiway)
    • review (pos / neg or 1-5 stars)
    • political argument (pro / con, or pro / con / neutral)
  • topic classification (multiway: sport / finance / travel, etc.)
  • categorise the author
    • native language identification
    • diagnosis of disease
    • gender, dialect, educational background, political, orientation

Why not N-gram:

  • sequential relationships are largely irrelevant for many tasks (bag of words assumption)
  • want to include other features (e.g. POS tags) that N-gram models do not include

Two methods:

  • Naive Bayes
  • Maximum Entropy (multinomial logistic regression)

2. Naive Bayes

2.1 Model

$$ \hat{c} = \underset{c\in C}{\operatorname{argmax}} P(c\mid d) = \underset{c\in C}{\operatorname{argmax}} \frac{P(d \mid c)P(c)}{P(d)} = \underset{c\in C}{\operatorname{argmax}}P(d \mid c)P(c) $$

  • Naive Bayes assumption: $P(d \mid c) = P(f_1, f_2,\dots ,f_n \mid c) \approx P(f_1 \mid c)P(f_2\mid c)\dots P(f_n \mid c)$ $$ \ \ \hat{P}(f_i \mid c) = \frac{\text{count}(f_i,c)+\alpha}{\sum_{f \in F}(\text{count}(f,c)+\alpha)} $$ PS: $\alpha$ needs to be added in calculations for every class, so the total count should add $\left |F\right |*\alpha$

    • $F$: the set of possible features
    • $\alpha$: smoothing parameter
    • all features not shown earlier before have $\hat{P}(f_i \mid +) = \frac{\alpha}{68+\alpha \left |F\right |}$
  • $P(c)$: estimated with MLE $$ \hat{P}(c) = \frac{N_c}{N} $$

2.2 Step

  • Test document $d$

  • Calculate $$ P(+ \mid d) \propto P(+) \prod_{i=1}^n P(f_i \mid +) \ \ \ P(-\mid d) \propto P(-) \prod_{i=1}^n P(f_i \mid -) $$

  • Choose the one with larger value

2.3 Choose features

  • Binary value for $f_i$
  • Use subset of $\left | F \right |$ (e.g. ignore stop words)
  • Use more complex features (e.g. bigrams, etc.)

2.4 Deal with very small numbers (underflow): use costs

Find the lowest cost classification $$ \hat{c} = \underset{c\in C}{\operatorname{argmin}} (-\log P(c) + \sum_{i=1}^n - \log P(f_i \mid c)) $$

  • Costs are negative log probabilities
  • Can avoid underflow by turning multiplication to sum
  • Look for the lowest cost overall

2.5 Implementation

  • Naive Bayes is a linear classifier since it uses a linear function of the input feature
  • As is logistic regression

2.6 Problems, Pros & Cons

Problem:

  • might need domain-specific non-sentiment words

    e.g. memory, CPU for computer product reviews

  • stopwords might be very useful features for some tasks

  • probably better to use too many irrelevant features than not enough relevant features

Pros:

  • Easy to implement
  • Fast to train
  • Do not need much training data compared with other methods (good for small datasets)
  • Usually works reasonably well
  • Can be a baseline method for any classification task

Cons:

  • Assumption is naive!
    • features are not usually independent given the class
    • adding feature types often leads to even stronger correlations between features
    • accuracy can somtimes be OK, but will be highly overconfident in its decisions

3. Maximum Entropy Model

3.1 Model

$$ P(c \mid \vec{x}) = \frac{1}{Z}\exp(\sum_i w_i f_i(\vec{x},c)) \ \ \ \text{where } Z = \sum_{c’} \exp(\sum_iw_i f_i(\vec{x},c)) $$

  • Multinomial logistic regression, model $P(c \mid d)$ directly instead of using Bayes’ rule
  • Use vector to represent the observed data
  • End up choosing the class for whose $\vec{w} \cdot \vec{f}$ is the highest

PS:

  • For each class, $\vec{w}$ varies since each word cause different influence on the class

3.2 Training

Conditional maximum likelihood estimation (CMLE):

Given instances $x^{(1)} \dots x^{(N)}$ with labels $c^{(1)}\dots c^{(N)}$ $$ \vec{w} = \underset{\vec{w}}{\operatorname{argmax}} \sum_j \log P(c^{(j)}\mid x^{(j)}) $$ Avoid overfitting: regularisation

3.3 Pros & Cons

Pros:

  • Better performance
  • Do not need conditionally independent assumption

Cons:

  • Time-consuming if there are a large number of classes and / or thousands of features to extract from each training example

3.4 Why is called MaxEnt?

  • The probabilistic model we are building should follow whatever constraints we impose on it
  • But beyond these constraints we should follow Occam’s Razor, make the fewest possible assumptions

4. Summary

4.1 Differences

Naive BayesMaxEnt
ModelMLE (maximum likelihood estimation)Bayes’ ruleCMLE (conditional maximum likelihood estimation)Multinomial logistic regression
FeaturesDirectly observedUse vector representationFeatures are functions that depend on both observation $\vec{x}$ and class $c$
AssumptionConditionally independent assumptionNone

4.2 Similarity

  • Both are easily available in standard ML toolkits
  • Both need to choose what features are good to use