Link Analysis

Link Analysis

1. History

1.1. Web page organization in the past


2. PageRank

2.1. Initial formulation

graph TD
    B(("B(38.4)")):::bWide --> C(("C(34.3)")):::cWide
    C --> B
    D(("D(3.9)")):::dWide --> A(("A(3.3)")):::aWide
    D --> B
    E(("E(8.1)")):::eWide --> D
    E --> B
    E --> F(("F(3.9)")):::dWide
    F --> B
    F --> E
    G(("1.6")):::wide5 --> B
    H(("1.6")):::wide5 --> C
    I(("1.6")):::wide5 --> B
    I --> E
    K(("1.6")):::wide5 --> B
        
    classDef aWide padding:10px
    classDef bWide padding:38px
    classDef cWide padding:34px
    classDef aWide padding:10px
    classDef eWide padding:15px
    classDef wide5 padding: 5px

    style A fill:#FFA500
    style B fill:#00FFFF
    style C fill:#8B8000
    style D fill:#f198b3
    style E fill:#ADEBB3
    style F fill:#f198b3
graph TD
    k((k)) -->|$$r_k/4$$| j((j))
    i((i)) -->|$$r_i/3$$| j((j))
    s1((" ")) --> i
    i --> s2((" "))
    i --> s3((" "))
    s4((" ")) --> k
    k --> s5((" "))
    k --> s6((" "))
    k --> s7((" "))
    j -->|$$r_j/3$$| s8((" "))
    j -->|$$r_j/3$$| s9((" "))
    j -->|$$r_j/3$$| s10((" "))
    s8 --> s81((" "))
    s8 --> s82((" "))
    s8 --> s83((" "))
    s9 --> s91((" "))
    s9 --> s92((" "))
    s9 --> s93((" "))
    s10 --> s101((" "))
    s10 --> s102((" "))
    s10 --> s103((" "))
    style j fill:#FF0000
    style i fill:#ADEBB3
    style k fill:#ADEBB3
    style s1 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s2 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s3 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s4 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s5 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s6 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s7 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s81 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s82 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s83 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s91 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s92 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s93 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s101 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s102 fill:#FFFFFF00, stroke:#FFFFFF00;
    style s103 fill:#FFFFFF00, stroke:#FFFFFF00;

\[r_j = \frac{r_i}{3} + \frac{r_k}{4}\]

2.2. The flow model

flowchart TD
  Y((y)) -->|y/2| Y((y));
  Y -->|y/2| A((a));
  A -->|a/2| Y;
  A -->|a/2| M((m));
  M -->|m| A;

???tip “General equation for rank calculation” - The above example demonstrates the idea of a general equation for calculating rank $r_{j}$ for page j:

1
2
3
$$r_j = \sum\limits_{i\rightarrow j}^{n} \frac{r_i}{d_i}$$

- $d_i$ is the number of out-degree (outgoing links) of node $i$

2.3. Matrix formulation

\[r_j = \sum\limits_{i=0}^{N-1} M_{ij}r_{j}\] \[r = M \cdot r\]

???tip “Visualization” - Suppose page i has importance $r_{i}$ and has outgoing links to three other pages, including page j.

1
![](fig/05-pagerank/09.png)

2.4. Example of matrix formulation

\[\begin{align} \left \{ \begin{aligned} r_{y} &= r_{y}/2 + r_{a}/2\\ r_{a} &= r_{y}/2 + r_{m}\\ r_{m} &= r_{a}/2 \end{aligned} \right. \end{align}\] \[\begin{align} \left \{ \begin{aligned} r_{y} &= r_{y}\times 1/2 + r_{a}\times 1/2 + r_{m}\times 0\\ r_{a} &= r_{y}\times 1/2 + r_{a}\times 0 \ \ \ \ + r_{m}\\ r_{m} &= r_{y}\times 0 \ \ \ \ + r_{a}\times 1/2 + r_{m}\times 0 \end{aligned} \right. \end{align}\] \[\left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right]\] \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right]\] \[\left[ \begin{array}{c} y \\ a \\ m \end{array} \right] = \left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ 0 & 1/2 & 0 \end{array} \right] \left[ \begin{array}{c} y \\ a \\ m \end{array} \right]\]

2.5. Power iteration

2.6. Example of power iteration

???example “Iteration 0” - Initial rank vector r at iteration 0: $r^{(0)} = [1/3,1/3,1/3]^{T}$

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$$
\left[
\begin{array}{c}
y \\
a \\
m
\end{array}
\right]
=
\left[
\begin{array}{c}
1/3 \\
1/3 \\
1/3
\end{array}
\right]
$$

???example “Iteration 1”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$$
\left[
\begin{array}{c}
y \\
a \\
m
\end{array}
\right]
=
\left[
\begin{array}{ccc}
1/2 & 1/2 & 0 \\
1/2 & 0 & 1 \\
0 & 1/2 & 0
\end{array}
\right]
\left[
\begin{array}{c}
1/3 \\
1/3 \\
1/3
\end{array}
\right]
=
\left[
\begin{array}{c}
1/3 \\
3/6 \\
1/6
\end{array}
\right]
$$

???example “Iteration 2”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$$
\left[
\begin{array}{c}
y \\
a \\
m
\end{array}
\right]
=
\left[
\begin{array}{ccc}
1/2 & 1/2 & 0 \\
1/2 & 0 & 1 \\
0 & 1/2 & 0
\end{array}
\right]
\left[
\begin{array}{c}
1/3 \\
3/6 \\
1/6
\end{array}
\right]
=
\left[
\begin{array}{c}
5/12 \\
1/3 \\
3/12
\end{array}
\right]
$$

???example “Iteration 3”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$$
\left[
\begin{array}{c}
y \\
a \\
m
\end{array}
\right]
=
\left[
\begin{array}{ccc}
1/2 & 1/2 & 0 \\
1/2 & 0 & 1 \\
0 & 1/2 & 0
\end{array}
\right]
\left[
\begin{array}{c}
5/12 \\
1/3 \\
3/12
\end{array}
\right]
=
\left[
\begin{array}{c}
9/24 \\
11/24 \\
1/6
\end{array}
\right]
$$

???example “Final iteration”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$$
\left[
\begin{array}{c}
y \\
a \\
m
\end{array}
\right]
=
\left[
\begin{array}{ccc}
1/2 & 1/2 & 0 \\
1/2 & 0 & 1 \\
0 & 1/2 & 0
\end{array}
\right]
\left[
\begin{array}{c}
... \\
... \\
...
\end{array}
\right]
=
\left[
\begin{array}{c}
6/15 \\
6/15 \\
3/15
\end{array}
\right]
$$

3. PageRank: the Google formulation

3.1. Scenarios

\[r_j = \sum\limits_{i=0}^{N-1} M_{ij}r_{j}\]

3.2. Does it converge?

graph LR
  A((a)) --> B((b));
  B -->A;

???failure “Solution”

1
2
3
4
5
6
7
8
9
10
11
12
13
$$
\left[
\begin{array}{c}
r_a \\
r_b
\end{array}
\right]
=
\begin{array}{cccc}
1 & 0 & 1 & 0\\
0 & 1 & 0 & 1
\end{array}
$$

3.3. Does it converge to what we want?

graph LR
  A((a)) --> B((b));

???failure “Solution”

1
2
3
4
5
6
7
8
9
10
11
12
13
$$
\left[
\begin{array}{c}
r_a \\
r_b
\end{array}
\right]
=
\begin{array}{cccc}
1 & 0 & 0 & 0\\
0 & 1 & 0 & 0
\end{array}
$$

3.4. The solution: random teleport

flowchart LR
  subgraph ORIGINAL
      direction TB
      Y1((y)) --> Y1((y));
      Y1 --> A1((a));
      A1 --> Y1;
      A1 --> M1((m));
  end
  subgraph RANDOM_TELEPORT
      direction TB
      Y2((y)) --> Y2((y));
      Y2 --> A2((a));
      A2 --> Y2;
      A2 --> M2((m));
      M2 .-> STOP[ ]
  end
  ORIGINAL --> RANDOM_TELEPORT

  style STOP  fill:#FFFFFF00, stroke:#FFFFFF00;
\[\left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 0 \\ 0 & 1/2 & 0 \end{array} \right]\]

to

\[\left[ \begin{array}{ccc} 1/2 & 1/2 & 0 \\ 1/2 & 0 & 1 \\ \textbf{1/3} & \textbf{1/3} & \textbf{1/3} \end{array} \right]\] \[r = \beta M \cdot r + \left[ \frac{1 - \beta}{N}\right]_{N}\]

where $\left[ \frac{1 - \beta}{N}\right]_{N}$ is a vector with all $N$ entries have the same value $\frac{1-\beta}{N}$.


4. Hands-on: Page Rank in Spark

4.1. Small example data

```python linenums=”1” %%file small_graph.dat y y y a a y a m m a

1
2
3
4
5
6
- Load and view the data

```python linenums="1"
graph_data = sc.textFile("small_graph.dat")
graph_data.take(5)

```python linenums=”1” links = graph_data.map(lambda line: (line.split(“ “)[0], line.split(“ “)[1])) print(links.take(10)) links = links.groupByKey() print(links.take(10)) links = links.mapValues(list) print(links.take(10))

1
2
3
4
5
6
7
8
- Setup initial weight


```python linenums="1"
N = links.count()
ranks = links.map(lambda line: (line[0], 1/N))
ranks.take(3)

```python linenums=”1” def calculateVotes(t): res = [] for item in t[1][1]: count = len(t[1][1]) res.append((item, t[1][0] / count)) return res ############################### calculateVotes((‘y’, (0.3333333333333333, [‘y’, ‘a’])))

1
2
3
4
5
6
- Vote distribution

```python linenums="1"
votes = ranks.join(links)
votes.take(10)

```python linenums=”1” votes = votes.flatMap(calculateVotes) print(votes.take(10))

1
2
3
4
5
6
- Vote tally

```python linenums="1"
ranks = votes.reduceByKey(lambda x,y: x + y)
ranks.take(10)

```python linenums=”1” %%time for i in range(10): votes = ranks.join(links).flatMap(calculateVotes) ranks = votes.reduceByKey(lambda x,y: x + y) print(ranks.take(3))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
- Iterating with a stopping condition

```python linenums="1"
%%time
sum = 1
while sum > 0.01:
    old_ranks = ranks
    votes = ranks.join(links).flatMap(calculateVotes)
    ranks = votes.reduceByKey(lambda x,y: x + y)
    errors = old_ranks.join(ranks).mapValues(lambda v: abs(v[0] - v[1]))
    sum = errors.values().sum()
    print(sum)
    print(ranks.take(3))

4.2. Hollins dataset


5. Hub and Authority

5.1. Intuition

5.2. Formalizing Hubbiness and Authority

5.3. Example

flowchart LR
    A((A)) --> B((B))
    A --> C((C))
    A ---> D((D))
    B ---> A
    B ---> D
    C --> E((E))
    D --> B
    D --> C
\[L = \left[ \begin{array}{ccccc} 0 & 1 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \end{array} \right] \ \ \ \ \ \ \ \ \ L^T = \left[ \begin{array}{ccccc} 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 \end{array} \right]\]

5.4. Formal equation

\[\begin{align} h=\lambda La \\ a=\mu L^{T}h \end{align}\]

5.5. Computation implementation

???example “Iteration 0” - Initialize $h$:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
$$
h=
\left[
\begin{array}{c}
1 \\
1 \\
1 \\
1 \\
1 
\end{array}
\right]
$$

- Compute $a = L^{T}h$

$$
a = 
\left[
\begin{array}{ccccc}
0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 & 0 \\
1 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0
\end{array}
\right]
\left[
\begin{array}{c}
1 \\
1 \\
1 \\
1 \\
1 
\end{array}
\right]
=
\left[
\begin{array}{c}
1 \\
2 \\
2 \\
2 \\
1 
\end{array}
\right]
$$

- Scale so that the largest component of $a$ is 1
    - Since max value in $a$ is 2, we divide all values in $a$ by 2.

$$
a=
\left[
\begin{array}{c}
1/2 \\
1 \\
1 \\
1 \\
1/2 
\end{array}
\right]
$$

- Compute $h = La$ 

$$
h = 
\left[
\begin{array}{ccccc}
0 & 1 & 1 & 1 & 0 \\
1 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0
\end{array}
\right]
\left[
\begin{array}{c}
1/2 \\
1 \\
1 \\
1 \\
1/2
\end{array}
\right]
=
\left[
\begin{array}{c}
3 \\
3/2 \\
1/2 \\
2 \\
0 
\end{array}
\right]
$$

- Scale so that the largest component of $h$ is 1
    - Since max value in $h$ is 3, we divide all values in $h$ by 3.

$$
h=
\left[
\begin{array}{c}
1 \\
1/2 \\
1/6 \\
2/3 \\
0 
\end{array}
\right]
$$

???example “Iteration 1” - This is $h$ value from previous iteration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
$$
h=
\left[
\begin{array}{c}
1 \\
1/2 \\
1/6 \\
2/3 \\
0 
\end{array}
\right]
$$

- Compute $a = L^{T}h$

$$
a = 
\left[
\begin{array}{ccccc}
0 & 1 & 0 & 0 & 0 \\
1 & 0 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 & 0 \\
1 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 0
\end{array}
\right]
\left[
\begin{array}{c}
1 \\
1/2 \\
1/6 \\
2/3 \\
0 
\end{array}
\right]
=
\left[
\begin{array}{c}
1/2 \\
5/3 \\
5/3 \\
3/2 \\
1/6 
\end{array}
\right]
$$

- Scale so that the largest component of $a$ is 1
    - Since max value in $a$ is $5/3$, we divide all values in $a$ by $5/3$.

$$
a=
\left[
\begin{array}{c}
3/10 \\
1 \\
1 \\
9/10 \\
1/10 
\end{array}
\right]
$$

- Compute $h = La$ 

$$
h = 
\left[
\begin{array}{ccccc}
0 & 1 & 1 & 1 & 0 \\
1 & 0 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 \\
0 & 1 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & 0
\end{array}
\right]
\left[
\begin{array}{c}
3/10 \\
1 \\
1 \\
9/10 \\
1/10
\end{array}
\right]
=
\left[
\begin{array}{c}
29/10 \\
6/5 \\
1/10 \\
2 \\
0 
\end{array}
\right]
$$

- Scale so that the largest component of $h$ is 1
    - Since max value in $h$ is $29/10$, we divide all values in $h$ by $29/10$.

$$
a=
\left[
\begin{array}{c}
1 \\
12/29 \\
1/29 \\
20/29 \\
0 
\end{array}
\right]
$$
\[\begin{align} h = [1,0.3583,0,0.7165,0] \\ a = [0.2087,1,1,0.7913,0] \end{align}\]

6. Hands-on: HITS

6.1. Example data

python linenums="1" %%file small_web.dat A B A D A C B A B D C E D C D B

6.2. Redo Hollins data