Perceptron Learning Algorithm

The Perceptron Learning Algorithm is an iterative method for training a single-layer perceptron to correctly classify linearly separable data.

The Learning Rule

The core of the perceptron algorithm is its learning rule, which updates weights based on classification errors:

w_i^{(new)} = w_i^{(old)} + \eta \cdot (y_{target} - y_{predicted}) \cdot x_i

Where:

$w_i$ is the weight connecting to the $i$ -th input
$\eta$ (eta) is the learning rate
$y_{target}$ is the true label
$y_{predicted}$ is the perceptron's output
$x_i$ is the $i$ -th input feature

Algorithm Steps

Initialize weights $w$ to small random values, and bias $b$ to zero
For each training sample $(x^{(i)}, y^{(i)})$ $(x^{(i)}, y^{(i)})$ :
- Compute the predicted output:
$\hat{y} = \text{activation}(w \cdot x + b)$
- Update weights if there's a misclassification:
$w = w + \eta \cdot (y - \hat{y}) \cdot x$
- Update bias:
$b = b + \eta \cdot (y - \hat{y})$
Repeat until all samples are correctly classified or max epochs reached

Geometric Interpretation

Weight Vector as Decision Boundary

The weight vector $w$ defines a hyperplane that separates the input space into two regions:

w \cdot x + b = 0

The weight vector $w$ is perpendicular (normal) to this decision boundary. Any point on the boundary satisfies this equation.

Angle and Classification

The classification decision depends on the angle between the weight vector $w$ and the input vector $x$ :

\text{sign}(w \cdot x) = \text{sign}(\|w\| \cdot \|x\| \cdot \cos\theta)

Where $\theta$ is the angle between $w$ and $x$ .

If $\theta < 90^\circ$ (acute angle), then $\cos\theta > 0$ , so $w \cdot x > 0$ , and the perceptron outputs +1
If $\theta > 90^\circ$ (obtuse angle), then $\cos\theta < 0$ , so $w \cdot x < 0$ , and the perceptron outputs -1

Geometric View of Weight Updates

When the perceptron makes an error, the weight update rotates the weight vector toward the correctly classified region:

Positive misclassification ( $y = +1$ , $\hat{y} = -1$ ): The input $x$ has an angle $> 90^\circ$ with $w$ . We add $\eta \cdot x$ to $w$ , rotating $w$ toward $x$ .
Negative misclassification ( $y = -1$ , $\hat{y} = +1$ ): The input $x$ has an angle $< 90^\circ$ with $w$ . We subtract $\eta \cdot x$ from $w$ , rotating $w$ away from $x$ .

This geometric intuition explains why the algorithm converges: each update reduces the angle between $w$ and correctly classified positive examples while increasing the angle for negative examples.

The Perceptron Convergence Theorem (also known as the Novikoff theorem) guarantees that for linearly separable data, the perceptron algorithm will converge to a solution in a finite number of updates.

Key Conditions

For convergence to be guaranteed:

Linear Separability: There exists a separating hyperplane with margin $\gamma > 0$
Bounded Inputs: All training samples satisfy $\|x_i\| \leq R$
Existence of Solution: There exists a weight vector $w^*$ such that:

y_i \cdot (w^* \cdot x_i) \geq \gamma \quad \text{for all } i

Upper Bound on Iterations

The number of updates (mistakes) before convergence is bounded by:

M \leq \left(\frac{R}{\gamma}\right)^2

Where:

$M$ is the maximum number of weight updates
$R = \max_i \|x_i\|$ is the maximum norm of input vectors
$\gamma$ is the margin of the separating hyperplane

This bound shows that:

Tighter margins (larger $\gamma$ ) lead to faster convergence
Larger input norms (larger $R$ ) lead to slower convergence

Geometric Interpretation of Convergence

From a geometric perspective:

The weight vector $w$ progressively rotates toward the ideal separator $w^*$
Each update increases the alignment between $w$ and $w^*$
The convergence rate depends on how "difficult" the dataset is (related to $R/\gamma$ )

Rate of Convergence

With learning rate $\eta$ :

Larger $\eta$ means larger weight updates per mistake
However, very large $\eta$ may cause overshooting
Standard practice: $\eta = 1$ (normalization can be applied instead)

Limitations

Only converges on linearly separable data
May oscillate on non-separable data
The specific solution found depends on the order of training samples and initial weights

Perceptron Learning Algorithm

On this page