DLFU
1 week

Perceptron Learning Algorithm

Training single-layer perceptrons with iterative weight updates

The Perceptron Learning Algorithm is an iterative method for training a single-layer perceptron to correctly classify linearly separable data.

The Learning Rule

The core of the perceptron algorithm is its learning rule, which updates weights based on classification errors:

wi(new)=wi(old)+η(ytargetypredicted)xiw_i^{(new)} = w_i^{(old)} + \eta \cdot (y_{target} - y_{predicted}) \cdot x_i

Where:

  • wiw_i is the weight connecting to the ii-th input
  • η\eta (eta) is the learning rate
  • ytargety_{target} is the true label
  • ypredictedy_{predicted} is the perceptron's output
  • xix_i is the ii-th input feature

Algorithm Steps

  1. Initialize weights ww to small random values, and bias bb to zero
  2. For each training sample (x(i),y(i))(x^{(i)}, y^{(i)}):
    • Compute the predicted output:
    y^=activation(wx+b)\hat{y} = \text{activation}(w \cdot x + b)
    • Update weights if there's a misclassification:
    w=w+η(yy^)xw = w + \eta \cdot (y - \hat{y}) \cdot x
    • Update bias:
    b=b+η(yy^)b = b + \eta \cdot (y - \hat{y})
  3. Repeat until all samples are correctly classified or max epochs reached

Geometric Interpretation

Weight Vector as Decision Boundary

The weight vector ww defines a hyperplane that separates the input space into two regions:

wx+b=0w \cdot x + b = 0

The weight vector ww is perpendicular (normal) to this decision boundary. Any point on the boundary satisfies this equation.

Angle and Classification

The classification decision depends on the angle between the weight vector ww and the input vector xx:

sign(wx)=sign(wxcosθ)\text{sign}(w \cdot x) = \text{sign}(\|w\| \cdot \|x\| \cdot \cos\theta)

Where θ\theta is the angle between ww and xx.

  • If θ<90\theta < 90^\circ (acute angle), then cosθ>0\cos\theta > 0, so wx>0w \cdot x > 0, and the perceptron outputs +1
  • If θ>90\theta > 90^\circ (obtuse angle), then cosθ<0\cos\theta < 0, so wx<0w \cdot x < 0, and the perceptron outputs -1

Geometric View of Weight Updates

When the perceptron makes an error, the weight update rotates the weight vector toward the correctly classified region:

  • Positive misclassification (y=+1y = +1, y^=1\hat{y} = -1): The input xx has an angle >90> 90^\circ with ww. We add ηx\eta \cdot x to ww, rotating ww toward xx.

  • Negative misclassification (y=1y = -1, y^=+1\hat{y} = +1): The input xx has an angle <90< 90^\circ with ww. We subtract ηx\eta \cdot x from ww, rotating ww away from xx.

This geometric intuition explains why the algorithm converges: each update reduces the angle between ww and correctly classified positive examples while increasing the angle for negative examples.

Convergence Theorem

The Perceptron Convergence Theorem (also known as the Novikoff theorem) guarantees that for linearly separable data, the perceptron algorithm will converge to a solution in a finite number of updates.

Key Conditions

For convergence to be guaranteed:

  1. Linear Separability: There exists a separating hyperplane with margin γ>0\gamma > 0
  2. Bounded Inputs: All training samples satisfy xiR\|x_i\| \leq R
  3. Existence of Solution: There exists a weight vector ww^* such that:
yi(wxi)γfor all iy_i \cdot (w^* \cdot x_i) \geq \gamma \quad \text{for all } i

Upper Bound on Iterations

The number of updates (mistakes) before convergence is bounded by:

M(Rγ)2M \leq \left(\frac{R}{\gamma}\right)^2

Where:

  • MM is the maximum number of weight updates
  • R=maxixiR = \max_i \|x_i\| is the maximum norm of input vectors
  • γ\gamma is the margin of the separating hyperplane

This bound shows that:

  • Tighter margins (larger γ\gamma) lead to faster convergence
  • Larger input norms (larger RR) lead to slower convergence

Geometric Interpretation of Convergence

From a geometric perspective:

  • The weight vector ww progressively rotates toward the ideal separator ww^*
  • Each update increases the alignment between ww and ww^*
  • The convergence rate depends on how "difficult" the dataset is (related to R/γR/\gamma)

Rate of Convergence

With learning rate η\eta:

  • Larger η\eta means larger weight updates per mistake
  • However, very large η\eta may cause overshooting
  • Standard practice: η=1\eta = 1 (normalization can be applied instead)

Limitations

  • Only converges on linearly separable data
  • May oscillate on non-separable data
  • The specific solution found depends on the order of training samples and initial weights

On this page