Perceptron Learning Algorithm
Training single-layer perceptrons with iterative weight updates
The Perceptron Learning Algorithm is an iterative method for training a single-layer perceptron to correctly classify linearly separable data.
The Learning Rule
The core of the perceptron algorithm is its learning rule, which updates weights based on classification errors:
Where:
- is the weight connecting to the -th input
- (eta) is the learning rate
- is the true label
- is the perceptron's output
- is the -th input feature
Algorithm Steps
- Initialize weights to small random values, and bias to zero
- For each training sample :
- Compute the predicted output:
- Update weights if there's a misclassification:
- Update bias:
- Repeat until all samples are correctly classified or max epochs reached
Geometric Interpretation
Weight Vector as Decision Boundary
The weight vector defines a hyperplane that separates the input space into two regions:
The weight vector is perpendicular (normal) to this decision boundary. Any point on the boundary satisfies this equation.
Angle and Classification
The classification decision depends on the angle between the weight vector and the input vector :
Where is the angle between and .
- If (acute angle), then , so , and the perceptron outputs +1
- If (obtuse angle), then , so , and the perceptron outputs -1
Geometric View of Weight Updates
When the perceptron makes an error, the weight update rotates the weight vector toward the correctly classified region:
-
Positive misclassification (, ): The input has an angle with . We add to , rotating toward .
-
Negative misclassification (, ): The input has an angle with . We subtract from , rotating away from .
This geometric intuition explains why the algorithm converges: each update reduces the angle between and correctly classified positive examples while increasing the angle for negative examples.
Convergence Theorem
The Perceptron Convergence Theorem (also known as the Novikoff theorem) guarantees that for linearly separable data, the perceptron algorithm will converge to a solution in a finite number of updates.
Key Conditions
For convergence to be guaranteed:
- Linear Separability: There exists a separating hyperplane with margin
- Bounded Inputs: All training samples satisfy
- Existence of Solution: There exists a weight vector such that:
Upper Bound on Iterations
The number of updates (mistakes) before convergence is bounded by:
Where:
- is the maximum number of weight updates
- is the maximum norm of input vectors
- is the margin of the separating hyperplane
This bound shows that:
- Tighter margins (larger ) lead to faster convergence
- Larger input norms (larger ) lead to slower convergence
Geometric Interpretation of Convergence
From a geometric perspective:
- The weight vector progressively rotates toward the ideal separator
- Each update increases the alignment between and
- The convergence rate depends on how "difficult" the dataset is (related to )
Rate of Convergence
With learning rate :
- Larger means larger weight updates per mistake
- However, very large may cause overshooting
- Standard practice: (normalization can be applied instead)
Limitations
- Only converges on linearly separable data
- May oscillate on non-separable data
- The specific solution found depends on the order of training samples and initial weights