Neural Networks from Scratch

27 May 20251198 words6 minutes read

Neurons

The Motivation

The world isn't perfect, so your data won't be perfect. There will be noise, contradictory and ambiguous data points. For mathematical functions, this is a worst-case scenario. Functions are deterministic in that every input always has a single output. When unleashing functions on real-world data, they become brittle and break. They are too pure for our flawed, sinful world.

Luckily humanity (created or discovered?) the concept of approximation, which thrives in these imperfect scenarios.

What is it?

A neuron (in ML) is basically a function approximator. It takes in inputs, processes them (using weights, biases, and activation functions), and generates an output that estimates some pattern or relationship.

Architecture

Loading SVG...
A single neuron processes weighted inputs, adds a bias, and applies an activation function.

A neuron is made up of three components:

Inputs (x1,x2,...,xn)(x_1, x_2, ..., x_n)

Inputs are numerical values that represent some features or data (usually from other neurons). In our case, our inputs will be basic (x,y)(x,y) coordinates from a 2D plot (for a classification problem), but they can represent any numerical data (e.g. pixels from an image, financial data, genome sequencing data, etc.)

Weights (w1,w2,...,wn)(w_1, w_2, ..., w_n)

Weights are applied to the inputs through multiplication (wixiw_i \cdot x_i) and determine each input's importance. Weights are learnable parameters that constantly change during training.

Bias (b)(b)

Biases are an additional parameter that is unique for each neuron. Like weights, they are learnable and change during training as the model learns. The job of the bias term is to shift the activation function left or right, increasing flexibility.

Mathematically, we can represent this computation of these three components as follows, by adding the incoming weighted sum by the neuron's bias:

z=i=1nwixi+b=w1x1+w2x2++wnxn+b z = \sum_{i=1}^n w_ix_i + b = w_1x_1 + w_2x_2 + \cdots + w_nx_n + b

Activation Function (σ)(\sigma)

Now, this neuron output zz is fine, but without some sort of non-linear transformation, zz is just a glorified linear regression. If we stacked only linear functions across multiple layers, the result would still be just a linear function. Without some sort of non-linearity, no matter how deep the network is, it can't model anything more complex than a single-layer linear model.

Applying a non-linear transformation to a neuron's output can generally be defined mathematically as:

y=σ(z)y = \sigma(z)

Example: The AND Gate Problem

We'll demonstrate how a neuron works with a basic classification problem using four data points representing two classes:

  • Class A: (0,0)(0,0), (1,0)(1,0), (0,1)(0,1)
  • Class B: (1,1)(1,1)

The solution looks pretty trivial, and it actually is. It can be solved with a single neuron which creates a linear decision boundary that perfectly separates the two classes.

Activation Function

We'll use the Heaviside step function for binary classification defined as:

σ(x)={1if x00if x<0\sigma(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{cases}

The Heaviside step function returns 00 (Class A) if the value is negative and 11 (Class B) for all else.

Solution

Our inputs are (x,y)(x,y) coordinates, so our single neuron will take in two values. We need to find weights and bias that satisfy the conditions:

PointOutputClassification
(0,0)(0,0)00A
(1,0)(1,0)00A
(0,1)(0,1)00A
(1,1)(1,1)11B

Usually these parameters are found through training, but for the sake of the example I'm eyeballing them:

  • w1=1w_1 = 1
  • w2=1w_2 = 1
  • b=1.5b = -1.5

Let's verify.

(0,0)(0,0):

z=(1)(0)+(1)(0)1.5=σ(1.5)    y=0 z = (1)(0) + (1)(0) - 1.5 = \sigma(-1.5) \implies y = 0

(1,0)(1,0):

z=(1)(1)+(1)(0)1.5=σ(0.5)    y=0 z = (1)(1) + (1)(0) - 1.5 = \sigma(-0.5) \implies y = 0

(0,1)(0,1):

z=(1)(0)+(1)(1)1.5=σ(0.5)    y=0 z = (1)(0) + (1)(1) - 1.5 = \sigma(-0.5) \implies y = 0

(1,1)(1,1):

z=(1)(1)+(1)(1)1.5=σ(0.5)    y=1z = (1)(1) + (1)(1) - 1.5 = \sigma(0.5) \implies y = 1

Nice! If we substitute the parameters back into the neuron's equation we can solve for the decision barrier function:

z=w1x1+w2x2+bz = w_1x_1 + w_2x_2 + b 0=1x1+1x21.50 = 1x_1 + 1x_2 - 1.5 x1+x2=1.5x_1 + x_2 = 1.5
Loading SVG...
A linear boundary separating Class A from Class B.

This boundary gives us simple classification rules:

  • Class A: x1+x2<1.50x_1 + x_2 < 1.5 \rightarrow 0
  • Class B: x1+x21.51x_1 + x_2 \geq 1.5 \rightarrow 1

These classification rules allow us to generalize across all points.

PointCalculationOutputClassification
(0.5,0)(0.5,0)0.5+0=0.5<1.50.5 + 0 = 0.5 < 1.500A
(1,0.6)(1,0.6)1+0.6=1.61.51 + 0.6 = 1.6 \geq 1.511B
\dots\dots\dots\dots

Limitations

We purposely used an easily separable AND gate dataset that was solvable with a single neuron. But what if our dataset was more complex? Let's analyze two cases where our simple architecture fails:

Loading SVG...
Multiple linear boundaries are neeeded to separate the classes.
Loading SVG...
A non-linear boundary is needed to separate the classes.

These datasets are not linearly separable. In order to generate more complex decision boundaries, we need hidden layers with more neurons and non-linear, differentiable activation functions like Sigmoid or ReLU.

A Neural Network

While individual neurons have limited capabilities, people discovered that stacking them in layers and combining multiple layers creates systems that are capable of learning more complex, non-linear patterns.

Loading SVG...
A neural network made up of layers of neurons.

More Activation Functions

Each layer can be assigned a unique activation function, giving the network even more flexibility. Let's look at a couple popular ones.

ReLU (Rectified Linear Unit)

ReLU is a good activation to use for hidden layers as it is less likely to produce vanishing gradients. It's also more efficient to compute compared to other activation functions.

Loading SVG...
The ReLU function returns 0 for negative inputs and the input value for positive inputs

It's defined and derived as follows:

σ(z)=max(0,z)\sigma (z) = \max(0, z) σ(z)={1if z>00otherwise\sigma '(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}
def relu(z):
    """
    ReLU activation function
    Args:
        z: Input tensor
    Returns:
        Tensor after applying ReLU
    """
 return np.maximum(0, z)
 
 def relu_prime(z):
 """
    Derivative of ReLU
    Args:
        z: Input tensor
    Returns:
        Gradient tensor
    """
    return (z > 0).astype(float)

Sigmoid (Logistic Function)

Sigmoid is great for binary classification problems where outputs need to be between 00 and 11.

Loading SVG...
The sigmoid function maps any real number to a value between 0 and 1

It's defined and derived as follows:

σ(z)=11ez\sigma (z) = \frac{1}{1 - e^{-z}} σ(z)=σ(z)(1σ(z))\sigma '(z) = \sigma (z) (1 - \sigma (z))
def sigmoid(z):
    """
    Sigmoid activation function
    Args:
        z: Input tensor
    Returns:
        Tensor after applying sigmoid
    """
    return 1 / (1 + np.exp(-z))
 
def sigmoid_prime(z):
    """
    Derivative of sigmoid
    Args:
        z: Input tensor
    Returns:
        Gradient tensor
    """
    s = sigmoid(z)
    return s * (1 - s)

tanh(z)\tanh (z) (Hyperbolic Tangent)

tanh(z)\tanh (z) should be used when layers need some sort of zero-centered output (popular in RNNs) and you need stronger gradients than sigmoid provides.

Loading SVG...
The tanh function maps any real number to a value between -1 and 1

It's defined and derived as follows:

σ(z)=ezezez+ez\sigma (z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}} σ(z)=1tanh2(z)\sigma '(z) = 1 - \tanh^{2}(z)
def tanh(z):
    """
    Tanh activation function
    Args:
        z: Input tensor
    Returns:
        Tensor after applying tanh
    """
    return np.tanh(z)
 
def tanh_prime(z):
    """
    Derivative of tanh
    Args:
        z: Input tensor
    Returns:
        Gradient tensor
    """
    return 1 - np.tanh(z)**2

The Loss Function

The loss function quantifies how far our network's predictions are from the true targets. There are many types of loss functions, but for our purposes of a binary classification model, the best one is Binary Cross-Entropy (BCE). Logarithms can act funny for exact values of 00 and 11, so a small constant needs to be added to prevent numerical instability.

It's defined and derived as follows:

L(y,y^)=1Ni=1N[yilog(yi^)+(1yi)log(1yi^)]\mathcal{L}(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} \left[y_i \cdot \log(\hat{y_i}) + (1-y_i) \cdot \log(1-\hat{y_i})\right] Ly^=y^yy^(1y^)\frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}

Where:

  • NN is the number of samples
  • yiy_i is the true label (00 or 11)
  • yi^\hat{y_i} is the predicted probability (between 00 and 11)
def loss(p, Y):
    """
    Binary Cross-Entropy loss function
    Args:
        p: Predicted probabilities (0-1)
        Y: True labels (0 or 1)
    Returns:
        Computed loss value
    """
    eps = 1e-12  # Small constant for numerical stability
    p = np.clip(p, eps, 1 - eps)  # Clip values to avoid log(0)
    return -np.mean(Y * np.log(p) + (1 - Y) * np.log(1 - p))
 
def loss_prime(p, Y):
    """
    Derivative of Binary Cross-Entropy loss
    Args:
        p: Predicted probabilities (0-1)
        Y: True labels (0 or 1)
    Returns:
        Gradient of the loss
    """
    eps = 1e-12  # Small constant for numerical stability
    p = np.clip(p, eps, 1 - eps)  # Clip values to avoid division by 0
    return (p - Y) / (p * (1 - p)) / len(Y)  # Normalized by batch size

Network Training

Let's manually go through an iteration of training to understand the mathematics behind learning.

Network Architecture

  • Input Layer: Single node (x=0.5x = 0.5)
  • Hidden Layer: 1 neuron with ReLU activation
    • Weight (w1w_1) = 0.80.8
    • Bias (b1b_1) = 0.20.2
  • Output Layer: 1 neuron with Sigmoid activation
    • Weight (w2w_2) = 1.0-1.0
    • Bias (b2b_2) = 0.50.5
  • Target: y=1y = 1 (binary classification)
  • Learning Rate: η=0.1\eta = 0.1
Loading SVG...

Forward Propogation

Hidden Layer Calculation:

z1=w1x+b1=0.8×0.5+0.2=0.6z_1 = w_1 \cdot x + b_1 = 0.8 \times 0.5 + 0.2 = 0.6 h1=ReLU(z1)=max(0,0.6)=0.6h_1 = \text{ReLU}(z_1) = \max(0, 0.6) = 0.6

Output Layer Calculation:

z2=w2h1+b2=1.0×0.6+0.5=0.1z_2 = w_2 \cdot h_1 + b_2 = -1.0 \times 0.6 + 0.5 = -0.1 y1=σ(z2)=11+e(0.1)0.475(prediction)y_1 = \sigma(z_2) = \frac{1}{1 + e^{-(-0.1)}} \approx 0.475 \quad \text{(prediction)}

Loss Calculation (BCE):

L=[ylog(y1)+(1y)log(1y1)][1log(0.475)+0]0.744\mathcal{L} = -\left[y\log(y_1) + (1-y)\log(1-y_1)\right] \approx -\left[1\cdot\log(0.475) + 0\right] \approx 0.744

Our network predicts a 47.5%47.5\% probability for the positive class, with a loss of 0.7440.744.

Backward Propogation

Unfortunately the fun part is now over. In order to update the parameters of our network based on the accuracy of our predicted output, we have to traverse backwards through the network, calculating all the gradients as we go.

Output Layer Gradient Calculation:

Ly1=1y12.105\frac{\partial\mathcal{L}}{\partial y_1} = -\frac{1}{y_1} \approx -2.105 Lz2=Ly1y1(1y1)2.105×0.475×0.5250.524\frac{\partial\mathcal{L}}{\partial z_2} = \frac{\partial\mathcal{L}}{\partial y_1} \cdot y_1(1-y_1) \approx -2.105 \times 0.475 \times 0.525 \approx -0.524 Lw2=Lz2h10.524×0.60.314\frac{\partial\mathcal{L}}{\partial w_2} = \frac{\partial\mathcal{L}}{\partial z_2} \cdot h_1 \approx -0.524 \times 0.6 \approx -0.314 Lb2=Lz20.524\frac{\partial\mathcal{L}}{\partial b_2} = \frac{\partial\mathcal{L}}{\partial z_2} \approx -0.524

Hidden Layer Gradient Calculation:

Lh1=Lz2w20.524×1.0=0.524\frac{\partial\mathcal{L}}{\partial h_1} = \frac{\partial\mathcal{L}}{\partial z_2} \cdot w_2 \approx -0.524 \times -1.0 = 0.524 Lz1=Lh1I(z1>0)=0.524×1=0.524\frac{\partial\mathcal{L}}{\partial z_1} = \frac{\partial\mathcal{L}}{\partial h_1} \cdot \mathbb{I}(z_1 > 0) = 0.524 \times 1 = 0.524 Lw1=Lz1x0.524×0.50.262\frac{\partial\mathcal{L}}{\partial w_1} = \frac{\partial\mathcal{L}}{\partial z_1} \cdot x \approx 0.524 \times 0.5 \approx 0.262 Lb1=Lz10.524\frac{\partial\mathcal{L}}{\partial b_1} = \frac{\partial\mathcal{L}}{\partial z_1} \approx 0.524
Loading SVG...

Parameter Updates

Now that the painful part is over, we can now update the network parameters using the values found during backpropogation.

w1w1ηLw10.80.1×0.2620.774w_1 \leftarrow w_1 - \eta \cdot \frac{\partial\mathcal{L}}{\partial w_1} \approx 0.8 - 0.1 \times 0.262 \approx 0.774 b1b1ηLb10.20.1×0.5240.148b_1 \leftarrow b_1 - \eta \cdot \frac{\partial\mathcal{L}}{\partial b_1} \approx 0.2 - 0.1 \times 0.524 \approx 0.148 w2w2ηLw21.00.1×(0.314)0.969w_2 \leftarrow w_2 - \eta \cdot \frac{\partial\mathcal{L}}{\partial w_2} \approx -1.0 - 0.1 \times (-0.314) \approx -0.969 b2b2ηLb20.50.1×(0.524)0.552b_2 \leftarrow b_2 - \eta \cdot \frac{\partial\mathcal{L}}{\partial b_2} \approx 0.5 - 0.1 \times (-0.524) \approx 0.552
Loading SVG...

We’ve completed one training iteration (forward pass, backward pass, and parameter update). The network adjusts its weights and biases using gradients to minimize the loss function over hundreds or thousands of iterations until the loss converges to a minimum (as close to 0 as possible).

Additional Work

Neural Networks from Scratch (Jupyter Notebook)