Link Analysis

Link Analysis

1. History

1.1. Web page organization in the past


2. PageRank

2.1. Initial formulation

graph TD
    B(("B(38.4)")):::bWide --> C(("C(34.3)")):::cWide
    C --> B
    D(("D(3.9)")):::dWide --> A(("A(3.3)")):::aWide
    D --> B
    E(("E(8.1)")):::eWide --> D
    E --> B
    E --> F(("F(3.9)")):::dWide
    F --> B
    F --> E
    G(("1.6")):::wide5 --> B
    H(("1.6")):::wide5 --> C
    I(("1.6")):::wide5 --> B
    I --> E
    K(("1.6")):::wide5 --> B
        
    classDef aWide padding:10px
    classDef bWide padding:38px
    classDef cWide padding:34px
    classDef aWide padding:10px
    classDef eWide padding:15px
    classDef wide5 padding: 5px

    style A fill:#FFA500
    style B fill:#00FFFF
    style C fill:#8B8000
    style D fill:#f198b3
    style E fill:#ADEBB3
    style F fill:#f198b3
graph TD
    k((k)) -->|$$r_k/4$$| j((j))
    i((i)) -->|$$r_i/3$$| j((j))
    s1((" ")) --> i
    i --> s2((" "))
    i --> s3((" "))
    s4((" ")) --> k
    k --> s5((" "))
    k --> s6((" "))
    k --> s7((" "))
    j -->|$$r_j/3$$| s8((" "))
    j -->|$$r_j/3$$| s9((" "))
    j -->|$$r_j/3$$| s10((" "))
    s8 --> s81((" "))
    s8 --> s82((" "))
    s8 --> s83((" "))
    s9 --> s91((" "))
    s9 --> s92((" "))
    s9 --> s93((" "))
    s10 --> s101((" "))
    s10 --> s102((" "))
    s10 --> s103((" "))
    style j fill:#FF0000
    style i fill:#ADEBB3
    style k fill:#ADEBB3
    style s1 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s2 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s3 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s4 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s5 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s6 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s7 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s81 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s82 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s83 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s91 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s92 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s93 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s101 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s102 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s103 fill:#FFFFFF00, stroke:#FFFFFF00;

\[r_j = \frac{r_i}{3} + \frac{r_k}{4}\]

2.2. The flow model

flowchart TD
  Y((y)) -->|y/2| Y((y));
  Y -->|y/2| A((a));
  A -->|a/2| Y;
  A -->|a/2| M((m));
  M -->|m| A;
General equation for rank calculation
  • The above example demonstrates the idea of a general equation for calculating rank $r_{j}$ for page j:
\[r_j = \sum\limits_{i\rightarrow j}^{n} \frac{r_i}{d_i}\]
  • $d_i$ is the number of out-degree (outgoing links) of node $i$

2.3. Matrix formulation

\[r_j = \sum\limits_{i=0}^{N-1} M_{ij}r_{j}\] \[r = M \cdot r\]
Visualization
  • Suppose page i has importance $r_{i}$ and has outgoing links to three other pages, including page j.

2.4. Example of matrix formulation

\[\begin{align} \left \{ \begin{aligned} r_{y} &= r_{y}/2 + r_{a}/2\\ r_{a} &= r_{y}/2 + r_{m}\\ r_{m} &= r_{a}/2 \end{aligned} \right. \end{align}\] \[\begin{align} \left \{ \begin{aligned} r_{y} &= r_{y}\times 1/2 + r_{a}\times 1/2 + r_{m}\times 0\\ r_{a} &= r_{y}\times 1/2 + r_{a}\times 0 \ \ \ \ + r_{m}\\ r_{m} &= r_{y}\times 0 \ \ \ \ + r_{a}\times 1/2 + r_{m}\times 0 \end{aligned} \right. \end{align}\] \[\left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right]\] \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right]\] \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right] \left[ \begin{array}{c} y \\ a \\ m \end{array} \right]\]

2.5. Power iteration

2.6. Example of power iteration

Iteration 0
  • Initial rank vector r at iteration 0: $r^{(0)} = [1/3,1/3,1/3]^{T}$
\[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{c} 1/3 \\ 1/3 \\ 1/3 \end{array} \right]\]
Iteration 1 \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right] \left[ \begin{array}{c} 1/3 \\ 1/3 \\ 1/3 \end{array} \right] = \left[ \begin{array}{c} 1/3 \\ 3/6 \\ 1/6 \end{array} \right]\]
Iteration 2 \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right] \left[ \begin{array}{c} 1/3 \\ 3/6 \\ 1/6 \end{array} \right] = \left[ \begin{array}{c} 5/12 \\ 1/3 \\ 3/12 \end{array} \right]\]
Iteration 3 \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right] \left[ \begin{array}{c} 5/12 \\ 1/3 \\ 3/12 \end{array} \right] = \left[ \begin{array}{c} 9/24 \\ 11/24 \\ 1/6 \end{array} \right]\]

Final iteration \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right] \left[ \begin{array}{c} ... \\ ... \\ ... \end{array} \right] = \left[ \begin{array}{c} 6/15 \\ 6/15 \\ 3/15 \end{array} \right]\]

3. PageRank: the Google formulation

3.1. Scenarios

\[r_j = \sum\limits_{i=0}^{N-1} M_{ij}r_{j}\]

3.2. Does it converge?

graph LR
  A((a)) --> B((b));
  B -->A;
Solution \[\left[ \begin{array}{c} r_a \\ r_b \end{array} \right] = \begin{array}{cccc} 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 \end{array}\]

3.3. Does it converge to what we want?

graph LR
  A((a)) --> B((b));
Solution \[\left[ \begin{array}{c} r_a \\ r_b \end{array} \right] = \begin{array}{cccc} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{array}\]

3.4. The solution: random teleport

flowchart LR
  subgraph ORIGINAL
      direction TB
      Y1((y)) --> Y1((y));
      Y1 --> A1((a));
      A1 --> Y1;
      A1 --> M1((m));
  end
  subgraph RANDOM_TELEPORT
      direction TB
      Y2((y)) --> Y2((y));
      Y2 --> A2((a));
      A2 --> Y2;
      A2 --> M2((m));
      M2 .-> STOP[ ]
  end
  ORIGINAL --> RANDOM_TELEPORT

  style STOP  fill:#FFFFFF00, stroke:#FFFFFF00;
\[\left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 0 \\ 0 & 1/2 & 0 \end{array} \right]\]

to

\[\left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ \textbf{1/3} & \textbf{1/3} & \textbf{1/3} \end{array} \right]\] \[r = \beta M \cdot r + \left[ \frac{1 - \beta}{N}\right]_{N}\]

where $\left[ \frac{1 - \beta}{N}\right]_{N}$ is a vector with all $N$ entries have the same value $\frac{1-\beta}{N}$.


4. Hands-on: Page Rank in Spark

4.1. Small example data

```python linenums=”1” %%file small_graph.dat y y y a a y a m m a

1
2
3
4
5
6
- Load and view the data

```python linenums="1"
graph_data = sc.textFile("small_graph.dat")
graph_data.take(5)

```python linenums=”1” links = graph_data.map(lambda line: (line.split(“ “)[0], line.split(“ “)[1])) print(links.take(10)) links = links.groupByKey() print(links.take(10)) links = links.mapValues(list) print(links.take(10))

1
2
3
4
5
6
7
8
- Setup initial weight


```python linenums="1"
N = links.count()
ranks = links.map(lambda line: (line[0], 1/N))
ranks.take(3)

```python linenums=”1” def calculateVotes(t): res = [] for item in t[1][1]: count = len(t[1][1]) res.append((item, t[1][0] / count)) return res ############################### calculateVotes((‘y’, (0.3333333333333333, [‘y’, ‘a’])))

1
2
3
4
5
6
- Vote distribution

```python linenums="1"
votes = ranks.join(links)
votes.take(10)

```python linenums=”1” votes = votes.flatMap(calculateVotes) print(votes.take(10))

1
2
3
4
5
6
- Vote tally

```python linenums="1"
ranks = votes.reduceByKey(lambda x,y: x + y)
ranks.take(10)

```python linenums=”1” %%time for i in range(10): votes = ranks.join(links).flatMap(calculateVotes) ranks = votes.reduceByKey(lambda x,y: x + y) print(ranks.take(3))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
- Iterating with a stopping condition

```python linenums="1"
%%time
sum = 1
while sum > 0.01:
    old_ranks = ranks
    votes = ranks.join(links).flatMap(calculateVotes)
    ranks = votes.reduceByKey(lambda x,y: x + y)
    errors = old_ranks.join(ranks).mapValues(lambda v: abs(v[0] - v[1]))
    sum = errors.values().sum()
    print(sum)
    print(ranks.take(3))

4.2. Hollins dataset


5. Hub and Authority

5.1. Intuition

5.2. Formalizing Hubbiness and Authority

5.3. Example

flowchart LR
    A((A)) --> B((B))
    A --> C((C))
    A ---> D((D))
    B ---> A
    B ---> D
    C --> E((E))
    D --> B
    D --> C
\[L = \left[ \begin{array}{ccccc} 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{array} \right] \ \ \ \ \ \ \ \ \ L^T = \left[ \begin{array}{ccccc} 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \end{array} \right]\]

5.4. Formal equation

\[\begin{align} h=\lambda La \\ a=\mu L^{T}h \end{align}\]

5.5. Computation implementation

Iteration 0
  • Initialize $h$:
\[h= \left[ \begin{array}{c} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{array} \right]\]
  • Compute $a = L^{T}h$
\[a = \left[ \begin{array}{ccccc} 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \end{array} \right] \left[ \begin{array}{c} 1 \\ 1 \\ 1 \\ 1 \\ 1 \end{array} \right] = \left[ \begin{array}{c} 1 \\ 2 \\ 2 \\ 2 \\ 1 \end{array} \right]\]
  • Scale so that the largest component of $a$ is 1
    • Since max value in $a$ is 2, we divide all values in $a$ by 2.
\[a= \left[ \begin{array}{c} 1/2 \\ 1 \\ 1 \\ 1 \\ 1/2 \end{array} \right]\]
  • Compute $h = La$
\[h = \left[ \begin{array}{ccccc} 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{array} \right] \left[ \begin{array}{c} 1/2 \\ 1 \\ 1 \\ 1 \\ 1/2 \end{array} \right] = \left[ \begin{array}{c} 3 \\ 3/2 \\ 1/2 \\ 2 \\ 0 \end{array} \right]\]
  • Scale so that the largest component of $h$ is 1
    • Since max value in $h$ is 3, we divide all values in $h$ by 3.
\[h= \left[ \begin{array}{c} 1 \\ 1/2 \\ 1/6 \\ 2/3 \\ 0 \end{array} \right]\]
Iteration 1
  • This is $h$ value from previous iteration:
\[h= \left[ \begin{array}{c} 1 \\ 1/2 \\ 1/6 \\ 2/3 \\ 0 \end{array} \right]\]
  • Compute $a = L^{T}h$
\[a = \left[ \begin{array}{ccccc} 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \end{array} \right] \left[ \begin{array}{c} 1 \\ 1/2 \\ 1/6 \\ 2/3 \\ 0 \end{array} \right] = \left[ \begin{array}{c} 1/2 \\ 5/3 \\ 5/3 \\ 3/2 \\ 1/6 \end{array} \right]\]
  • Scale so that the largest component of $a$ is 1
    • Since max value in $a$ is $5/3$, we divide all values in $a$ by $5/3$.
\[a= \left[ \begin{array}{c} 3/10 \\ 1 \\ 1 \\ 9/10 \\ 1/10 \end{array} \right]\]
  • Compute $h = La$
\[h = \left[ \begin{array}{ccccc} 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{array} \right] \left[ \begin{array}{c} 3/10 \\ 1 \\ 1 \\ 9/10 \\ 1/10 \end{array} \right] = \left[ \begin{array}{c} 29/10 \\ 6/5 \\ 1/10 \\ 2 \\ 0 \end{array} \right]\]
  • Scale so that the largest component of $h$ is 1
    • Since max value in $h$ is $29/10$, we divide all values in $h$ by $29/10$.
\[a= \left[ \begin{array}{c} 1 \\ 12/29 \\ 1/29 \\ 20/29 \\ 0 \end{array} \right]\]
\[\begin{align} h = [1,0.3583,0,0.7165,0] \\ a = [0.2087,1,1,0.7913,0] \end{align}\]

6. Hands-on: HITS

6.1. Example data

python linenums="1" %%file small_web.dat A B A D A C B A B D C E D C D B

6.2. Redo Hollins data