-
Vocabulary
:$V = {w_1, w_2, ..., w_N}$ -
Query
:$q = q_1 q_2 ... q_m, q_i \in V $ -
Document
:$d_i = d_{i1} d_{i2} ... d_{im_i}, d_{ij} \in V$ -
Collection
:$C = { d_1, d2, ..., d_M }$ -
Set of relevant document
:$R(q) \subseteq C$ -
Task
: Compute$R'(q)$ , an approximation of$R(q)$
- Document Selection
- Absolute relevance
- Binary classification
- Document Ranking
-
$R'(q) = { d \in C | f(d, q) > \theta }$ where$f(d, q)$ is a relevance measure function and$\theta$ is a cutoff. - Relative relevance
- Ranking is preferred as all relevant documentsa are not equally relevant
-
- Similarity-Based
- VSM (Vector Space Model)
- Probabilistic
- Language Model
- Divergence-from-Randomness Model
- Probabilistic Inference
- Axiomatic Model
-
Term Frequency (TF)
, denoted by$c(w, d)$ is a frequency count of word -$w$ in document -$d$ -
Document length
is denoted by$|d|$ -
Document Frequency (DF)
, denoted by$df(w)$ is a count of documents where the word -$w$ is present
Note
These metrics aremeasured after
theinitial preprocessing
of the document/web-page such asstemming
,removal of stopwords
etc.
It uses vector representation of the query
and doc
to determine their similarity. It assumes that N-dimensional space
similar to the size of the vocabulary
.
-
Query
:$q = (x_1, x_2, ..., x_N)$ where$x_i \in \mathbb{R}$ isquery term weight
-
Doc
:$d = (y_1, y_2, ..., y_N)$ where$y_i \in \mathbb{R}$ isdoc term weight
- Bit Vector
-
$x_i, y_i \in { 0, 1 }$ -
1
: word$w_i$ is present -
0
: word$w_i$ is absent
-
-
- Dot Product
$f(q, d) = q.d = \sum_{i=1}^{N}{x_iy_i}$ -
$f(q, d)$ is basically equal to the number ofdistinct
query words matched in d. MN
$x_i, y_i = c(w_i, q/d)$ -
Inverse Document Frequency (IDF)
-
$IDF(w) = log\frac{M+1}{k}$ where$M$ is the total number of docs in the collection and$k = df(w)$ $y_i = c(w_i, d) * IDF(w_i)$ - The idea is to penalize frequently occuring terms such as
the
,a
,about
which do not convey any special meaning about the document
-
- Where are we?
$$f(q, d) = \sum_{i=1}^{N}{x_iy_i} = \sum_{w \in q \cap d}{c(w, q)c(w, d)log{\frac{M+1}{df(w)}}}$$ -
TF Transformation
:BM25
- The idea is to set an asymptotic upper limit on
Term Frequency
-
$y=\frac{(k+1)x}{x + k}$ where$k$ is a constant and$x = c(w, d)$ -
$k = 0$ represents the special case ofBit Vectors
- With a very large
$k$ , the function simulates$y=c(w,d)$
- The idea is to set an asymptotic upper limit on
-
Document Length Normalization
- The idea is to penalize long documents as they have a chance to match any query
-
Pivoted Length Normalization VSM [Singhal et al 96]
$$f(q, d) = \sum_{w \in q \cap d}{c(w, q)\frac{ln(1+ln(1+c(w,d)))}{1-b+b\frac{|d|}{avdl}}log{\frac{M+1}{df(w)}}}$$ $$b \in [0, 1]$$ -
BM25/Okapi [Robertson & Walker 94]
$$f(q, d) = \sum_{w \in q \cap d}{c(w, q)\frac{(k + 1)c(w, d)}{c(w, d) + k(1-b+b\frac{|d|}{avdl})}log{\frac{M+1}{df(w)}}}$$ $$k \in [0, \infty)$$
- BM25F
- BM25+
- Transform everything to lowercase
- Remove stopwords
Stemming
: Mapping similar words to the same root form such ascomputer, computation, computing
should map becompute
- Converting documents to data structures that enable fast search
- Inverted index is the dominating method
Dictionary
- Modest size
- In-memory
Postings
- Huge
- Secondary memory
- Compression is desirable
This tells us that words with lower ranks having huge postings may be dropped all together as they do not meaningfully contribute to the ranking.
- Effectiveness/Accuracy
- Efficiency
- Usability
Retrieved | Not Retrieved | |
---|---|---|
Relavant | a | b |
Irrelevant | c | d |
Ideally,
P | R | |
---|---|---|
D1+ | 1/1 | 1/10 |
D2+ | 2/2 | 2/10 |
D3- | ||
D4- | ||
D5+ | 3/5 | 3/10 |
D6- | ||
D7- | ||
D8+ | 4/8 | 4/10 |
D9- | ||
D10- |
The table above represents the P-R measures for a TR system's ranked list for a query. Notice that only entries with a relevant documents are considered for the PR Curve. The relevant documents have been identified by a +
in the table as opposed to a -
for irrelevant ones. For the rest of the entries in the list which goes on, precision may be assumed to be 0
.
Fig. PR Curve
Arithmetic mean of average precision over a set of queries.
Geometric mean of average precision over a set of queries.
Gain | Cumulative Gain | Discounted Cumulative Gain | |
---|---|---|---|
D1 | 3 | 3 | 3 |
D2 | 2 | 3+2 | 3+2/log2 |
D3 | 1 | 3+2+1 | 3+2/log2+1/log3 |
D4 | 1 | 3+2+1+1 | 3+2/log2+1/log3+1/log4 |
Assuming, there are 9 documents rated 3
in the collection pointed to by the table above,
Query | Sys A | Sys B | Sign Test | Wilcoxon |
---|---|---|---|---|
1 | 0.02 | 0.76 | + | +0.74 |
2 | 0.39 | 0.07 | - | -0.32 |
3 | 0.16 | 0.37 | + | +0.21 |
Average | 0.19 | 0.4 | p=1.0 | p=0.63 |
We cannot afford judging all documents, so can combine the top-k
documents returned by different strategies and only judge them. The rest can be given a default relevance value.
The Query likelihood
ranking function tries to find the probability of a user posing a query q
, given that they wish to retrieve d
.
- Each word is generated independently
$p(w_1 w_2 ...) = p(w_1)p(w_2)...$ - The probabilities of a language may be determined in different contexts such as the
entire english text
,few computer science papers
or afood nutrition paper
The figure above illustrates how LMs may be used for word association such as associating software with computer in the example above.
In a smoothed distribution,
Now,
This final form of the equation enables efficient computation. The last term above may be ignored
for ranking as it is independent of a document. The terms TF
and IDF
weighting respectively. The term Doc-Length Normalization
.
The feedback is used to update the query and get better results.
- Relevance
- Explicit feedback from users
- Pseduo/Blind/Automatic
- Assume top-10 docs to be relevant
- Implicit
- Track user's activity i.e. clicks
Here, relevant
and irrelevant
documents. Over-fitting
must be avoided by keeping a relatively high weight on the original query. Rocchio feedback may be used with relevance and pseudo feedback methods.
- A user randomly jumps to another page with prob.
$\alpha$ - The user randomly picks a link to follow with prob.
$(1 - \alpha)$
The figure above represents the links within a collection of 4 documents. The transition matrix for the same is given below.
We can iterate this matrix multiplication until convergence. Set
The idea is to give an authoritative
and hub
score to each page. A page with many in-links from hubs is considered be a page with vital information whereas, a page with many out-links to promiminent authoritative pages is a called a hub.
The procedure is to iterate and normalize until convergence. For normalization,
So far, we have seen may algorithms which may be combined using simple machine-learning models such as logistic regression.