Next, web search engines were developed: information retrieval
Information retrieval focuses on finding documents from a trusted set.
1.2. Challenges for web search
Who to trust given the massive variety of information sources?
Trustworthy pages may point to each other.
What is the best answer to a keyword search (given a context)?
Pages using the keyword in the same contexts tend to point to each other.
All pages are not equally important.
Web pages/sites on the Internet can be represented as a graph:
Web pages/sites are nodes on the graph.
Links from one page to another represents the incoming/outgoing connections between nodes.
We can compute the importance of nodes in this graph based on the distribution/intensity of these links.
2. PageRank
2.1. Initial formulation
Link as votes:
A page (node) is more important if it has more links.
Incoming or outgoing?
Think of incoming links as votes.
Are all incoming links equals?
Links from important pages count more.
graph TD
B(("B(38.4)")):::bWide --> C(("C(34.3)")):::cWide
C --> B
D(("D(3.9)")):::dWide --> A(("A(3.3)")):::aWide
D --> B
E(("E(8.1)")):::eWide --> D
E --> B
E --> F(("F(3.9)")):::dWide
F --> B
F --> E
G(("1.6")):::wide5 --> B
H(("1.6")):::wide5 --> C
I(("1.6")):::wide5 --> B
I --> E
K(("1.6")):::wide5 --> B
classDef aWide padding:10px
classDef bWide padding:38px
classDef cWide padding:34px
classDef aWide padding:10px
classDef eWide padding:15px
classDef wide5 padding: 5px
style A fill:#FFA500
style B fill:#00FFFF
style C fill:#8B8000
style D fill:#f198b3
style E fill:#ADEBB3
style F fill:#f198b3
Each link’s vote is proportional to the importance of its source page.
If page j with importance $r_j$ has n outgoing links, then each outgoing link has $r_j$ votes.
The importance of page j is the sum of the votes on its incoming links.
$d_i$ is the number of out-degree (outgoing links) of node $i$
From the above link graph, we seem to have three equations and three unknown.
In reality, after some algebraic simplication, we actually onle have two equations.
Therefore, there is no unique solution at this point.
We need to impose one additional constraint to force uniqueness:
$r_{y} + r_{a} + r_{m} = 1$
Solving this system of equations using Gaussian elimination, we have:
$r_{y} = 2/5$
$r_{a} = 2/5$
$r_{m} = 1/5$
Does not scale to Internet-size!
2.3. Matrix formulation
Using the coefficients from the above flow equations, we can set up a stochastic adjacency matrix M
Matrix M of size N: N is the number of nodes in the graph (think number of web pages on the Internet).
Let page i has $d_{i}$ outgoing links.
If there is an outgoing link from page i to another page j,
then $M_{ij} = 1/d_{i}$
else $M_{ij} = 0$
Stochastic: sum of all values in a column of M is equal to 1.
Adjacency: non-zero value indicates the availability of a link.
Rank vector $r$: Each element in $r$ represents the ranking value of one page.
The order of pages in the rank vector should be the same as the order of pages in rows and columns of M.
Combining the stochastic adjacency matrix M (coefficient matrix) and the rank vector $r$, we can formulate a general equation for calculating rank rj for page $j$ with regard to all pages:
\[r_j = \sum\limits_{i=0}^{N-1} M_{ij}r_{j}\]
This equation can be rewritten using the matrix form:
\[r = M \cdot r\] Visualization
Suppose page i has importance $r_{i}$ and has outgoing links to three other pages, including page j.
```python linenums=”1” %%time for i in range(10): votes = ranks.join(links).flatMap(calculateVotes) ranks = votes.reduceByKey(lambda x,y: x + y) print(ranks.take(3))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
- Iterating with a stopping condition
```python linenums="1"
%%time
sum = 1
while sum > 0.01:
old_ranks = ranks
votes = ranks.join(links).flatMap(calculateVotes)
ranks = votes.reduceByKey(lambda x,y: x + y)
errors = old_ranks.join(ranks).mapValues(lambda v: abs(v[0] - v[1]))
sum = errors.values().sum()
print(sum)
print(ranks.take(3))
PageRank only assume one-dimensional notion of importance
HITS: hyperlink-induced topic search
Two flavors of importance:
Authorities: provide information about a topic
Hubs: show where to find out more information about a topic
Example: A typical department at a university maintains a Web page listing all the courses offered by the department, with links to a page for each course.
If you want to know about a certain course, you need the page for that course (authorities).
If you want to find out what courses the department is offering, you need the page with the course list first (hub).
5.2. Formalizing Hubbiness and Authority
For a collection of pages (enumarated)
Two vectors: $h$ and $a$
The $i^{th}$ component of $h$: the hubbiness of the $i^{th}$ page
The $i^{th}$ component of $a$ gives the authority of the same page.
To compute hubbiness and authority:
Add the authority of successors to estimate hubbiness
Add hubbiness of predecessors to estimate authority.
Avoid unlimited growth through scaling:
Scale the values of the vectors $h$ and $a$ so that the largest component is 1.
Scale the values of the vectors $h$ and $a$ so that the sum of all components is 1.
Define link matrix of the Web, $L$ and there are n pages on the Web
$L$ is an n-by-n matrix
$L_{ij} = 1$ if there is a link from page i to page j
$L_{ij} = 0$ if not.
$L^T$ is the transpose of $L$
$L^T_{ij} = 1$ if there is a link from page j to page i
$L^T_{ij} = 0$ if not
5.3. Example
Given a 5-node (5 pages) link graph as follows:
flowchart LR
A((A)) --> B((B))
A --> C((C))
A ---> D((D))
B ---> A
B ---> D
C --> E((E))
D --> B
D --> C
From Section 5.2, for a collection of pages (enumarated)
Two vectors: $h$ and $a$
The $i^{th}$ component of $h$: the hubbiness of the $i^{th}$ page
The $i^{th}$ component of $a$ gives the authority of the same page.
The hubbiness of a page is proportional to the sum of the authority of its successor.
Explanation: If a hub contains pages of high value (high authority), then the value of that hub (hubbiness) will be high.
Given $\lambda$ is an unknown scaling factor:
\[h=\lambda La\]
The authority of a page is proportional to the sum of the hubbiness of its predecessor.
Explanation: If a page is cited by multiple hubs with high value (high hubbiness), then that page must be of high quality (high authority).
Given $\mu$ is an unknown scaling factor:
\[a=\mu L^{T}h\]
Putting the two equations together:
\[\begin{align} h=\lambda La \\ a=\mu L^{T}h \end{align}\]
Ommiting $\lambda$ and $\mu$ and you can see the same self-calculating structure emerges for $h$ and $a$ similar to how the rank vector $r$ was formalized in Page Rank.
In this case, $h$ and $a$ will alternate their roles, with one being used to calculate the other.
5.5. Computation implementation
Construct $L$
Generate $L^T$
Generate $h$ as a vector of all 1’s.
Compute $a = L^{T}h$ then scale so that the largest component is 1
Compute $h = La$ then scale so that the largest component is 1
Repeat the previous two steps until differences of successive values of $h$ and $a$ are small enough.
We demonstrate this procedure using the example from Section 5.3: