Distributed Machine Learning with Spark

Application

Example: Spam filtering

	viagra	learning	the	nigeria	spam?
X1	1	0	1	0	$y_1$ = 1
X2	0	1	1	0	$y_2$ = -1
X3	0	0	0	1	$y_3$ = 1

Instance spaces X1, X2, X3 belong to set X (data points)
- Binary or real-valued feature vector X of word occurrences
- d features (words and other things, d is approximately 100,000)
Class Y
- Spam = 1
- Ham = -1

Overview

\[f(x) = \begin{cases} +1 & \text{if } w_{1}x_{1}+w_{2}x_{2}+...+w_{d}x_{d} \ge \theta \\ -1 & \text{otherwise} \end{cases}\]

Vector $X_j$ contains real values
- The Euclidean norm is 1.
- Each vector has a label $y_j$
The goal is to find a vector W = ($w_1$, $w_2$, …, $w_d$) with $w_j$ is a real number such that:
- The labeled points are clearly separated by a line:

\[w_{1}x_{1}+w_{2}x_{2}+...+w_{d}x_{d} = \theta \\\]

Linear classifiers

Overview

Originally developed by Vapnik and collaborators as a linear classifier.
Could be modified to support non-linear classification by mapping into high-dimensional spaces.
Problem statement:
- We want to separate + from - using a line.
- Training examples:

 - Each example `i`:
 

 - Inner product:
 

Support Vector Machine: largest margin

Distance from the separating line corresponds to the confidence of the prediction.
For example, we are more sure about the class of A and B than of C.

Maximizing the margin while identifying w is good according to intuition, theory, and practice.

Support Vector Machine: what is the margin?

Notation:
- Gamma is the distance from point A to the linear separator L: d(A,L) = |AH|
- If we select a random point M on line L, then d(A,L) is the projection of AM onto vector w.
- Project
- If we assume the normalized Euclidean value of w, |w|, is equal to one, that bring us to the result in the slide.
In other words, maximizing the margin is directly related to how w is chosen.
For the i^th data point: