Neural Networks from Scratch
Neurons
The Motivation
The world isn't perfect, so your data won't be perfect. There will be noise, contradictory and ambiguous data points. For mathematical functions, this is a worst-case scenario. Functions are deterministic in that every input always has a single output. When unleashing functions on real-world data, they become brittle and break. They are too pure for our flawed, sinful world.
Luckily humanity (created or discovered?) the concept of approximation, which thrives in these imperfect scenarios.
What is it?
A neuron (in ML) is basically a function approximator. It takes in inputs, processes them (using weights, biases, and activation functions), and generates an output that estimates some pattern or relationship.
Architecture
A neuron is made up of three components:
Inputs
Inputs are numerical values that represent some features or data (usually from other neurons). In our case, our inputs will be basic coordinates from a 2D plot (for a classification problem), but they can represent any numerical data (e.g. pixels from an image, financial data, genome sequencing data, etc.)
Weights
Weights are applied to the inputs through multiplication () and determine each input's importance. Weights are learnable parameters that constantly change during training.
Bias
Biases are an additional parameter that is unique for each neuron. Like weights, they are learnable and change during training as the model learns. The job of the bias term is to shift the activation function left or right, increasing flexibility.
Mathematically, we can represent this computation of these three components as follows, by adding the incoming weighted sum by the neuron's bias:
Activation Function
Now, this neuron output is fine, but without some sort of non-linear transformation, is just a glorified linear regression. If we stacked only linear functions across multiple layers, the result would still be just a linear function. Without some sort of non-linearity, no matter how deep the network is, it can't model anything more complex than a single-layer linear model.
Applying a non-linear transformation to a neuron's output can generally be defined mathematically as:
Example: The AND Gate Problem
We'll demonstrate how a neuron works with a basic classification problem using four data points representing two classes:
- Class A: , ,
- Class B:
The solution looks pretty trivial, and it actually is. It can be solved with a single neuron which creates a linear decision boundary that perfectly separates the two classes.
Activation Function
We'll use the Heaviside step function for binary classification defined as:
The Heaviside step function returns (Class A) if the value is negative and (Class B) for all else.
Note that is undefined at , which breaks backpropagation in practice. (Safe here.)
Solution
Our inputs are coordinates, so our single neuron will take in two values. We need to find weights and bias that satisfy the conditions:
Point | Output | Classification |
---|---|---|
A | ||
A | ||
A | ||
B |
Usually these parameters are found through training, but for the sake of the example I'm eyeballing them:
Let's verify.
:
:
:
:
Nice! If we substitute the parameters back into the neuron's equation we can solve for the decision barrier function:
This boundary gives us simple classification rules:
- Class A:
- Class B:
These classification rules allow us to generalize across all points.
Point | Calculation | Output | Classification |
---|---|---|---|
A | |||
B | |||
Limitations
We purposely used an easily separable AND gate dataset that was solvable with a single neuron. But what if our dataset was more complex? Let's analyze two cases where our simple architecture fails:
These datasets are not linearly separable. In order to generate more complex decision boundaries, we need hidden layers with more neurons and non-linear, differentiable activation functions like Sigmoid or ReLU.
A Neural Network
While individual neurons have limited capabilities, people discovered that stacking them in layers and combining multiple layers creates systems that are capable of learning more complex, non-linear patterns.
More Activation Functions
Each layer can be assigned a unique activation function, giving the network even more flexibility. Let's look at a couple popular ones.
ReLU (Rectified Linear Unit)
ReLU is a good activation to use for hidden layers as it is less likely to produce vanishing gradients. It's also more efficient to compute compared to other activation functions.
It's defined and derived as follows:
def relu(z):
"""
ReLU activation function
Args:
z: Input tensor
Returns:
Tensor after applying ReLU
"""
return np.maximum(0, z)
def relu_prime(z):
"""
Derivative of ReLU
Args:
z: Input tensor
Returns:
Gradient tensor
"""
return (z > 0).astype(float)
Sigmoid (Logistic Function)
Sigmoid is great for binary classification problems where outputs need to be between and .
It's defined and derived as follows:
def sigmoid(z):
"""
Sigmoid activation function
Args:
z: Input tensor
Returns:
Tensor after applying sigmoid
"""
return 1 / (1 + np.exp(-z))
def sigmoid_prime(z):
"""
Derivative of sigmoid
Args:
z: Input tensor
Returns:
Gradient tensor
"""
s = sigmoid(z)
return s * (1 - s)
(Hyperbolic Tangent)
should be used when layers need some sort of zero-centered output (popular in RNNs) and you need stronger gradients than sigmoid provides.
It's defined and derived as follows:
def tanh(z):
"""
Tanh activation function
Args:
z: Input tensor
Returns:
Tensor after applying tanh
"""
return np.tanh(z)
def tanh_prime(z):
"""
Derivative of tanh
Args:
z: Input tensor
Returns:
Gradient tensor
"""
return 1 - np.tanh(z)**2
The Loss Function
The loss function quantifies how far our network's predictions are from the true targets. There are many types of loss functions, but for our purposes of a binary classification model, the best one is Binary Cross-Entropy (BCE). Logarithms can act funny for exact values of and , so a small constant needs to be added to prevent numerical instability.
It's defined and derived as follows:
Where:
- is the number of samples
- is the true label ( or )
- is the predicted probability (between and )
def loss(p, Y):
"""
Binary Cross-Entropy loss function
Args:
p: Predicted probabilities (0-1)
Y: True labels (0 or 1)
Returns:
Computed loss value
"""
eps = 1e-12 # Small constant for numerical stability
p = np.clip(p, eps, 1 - eps) # Clip values to avoid log(0)
return -np.mean(Y * np.log(p) + (1 - Y) * np.log(1 - p))
def loss_prime(p, Y):
"""
Derivative of Binary Cross-Entropy loss
Args:
p: Predicted probabilities (0-1)
Y: True labels (0 or 1)
Returns:
Gradient of the loss
"""
eps = 1e-12 # Small constant for numerical stability
p = np.clip(p, eps, 1 - eps) # Clip values to avoid division by 0
return (p - Y) / (p * (1 - p)) / len(Y) # Normalized by batch size
Network Training
Let's manually go through an iteration of training to understand the mathematics behind learning.
Network Architecture
- Input Layer: Single node ()
- Hidden Layer: 1 neuron with ReLU activation
- Weight () =
- Bias () =
- Output Layer: 1 neuron with Sigmoid activation
- Weight () =
- Bias () =
- Target: (binary classification)
- Learning Rate:
Forward Propogation
Hidden Layer Calculation:
Output Layer Calculation:
Loss Calculation (BCE):
Our network predicts a probability for the positive class, with a loss of .
Backward Propogation
Unfortunately the fun part is now over. In order to update the parameters of our network based on the accuracy of our predicted output, we have to traverse backwards through the network, calculating all the gradients as we go.
Output Layer Gradient Calculation:
Hidden Layer Gradient Calculation:
Parameter Updates
Now that the painful part is over, we can now update the network parameters using the values found during backpropogation.
We’ve completed one training iteration (forward pass, backward pass, and parameter update). The network adjusts its weights and biases using gradients to minimize the loss function over hundreds or thousands of iterations until the loss converges to a minimum (as close to 0 as possible).