Gradient Descent - Yousef's Notes

#Overview

Gradient descent is an optimization algorithm that minimizes a function by iteratively moving in the direction of steepest descent (negative gradient).

#Basic Principle

To minimize a loss function $J(\theta)$, update parameters:

$$ \theta_{t+1} = \theta_t - \alpha \nabla J(\theta_t) $$

Where:

$\theta$ = model parameters (weights)
$\alpha$ = learning rate (step size)
$\nabla J(\theta)$ = gradient of loss with respect to parameters

#Standard Gradient Descent (Batch)

#Pseudocode

Algorithm BatchGradientDescent(model, X, y, loss_fn, alpha, epochs)
Input:
    - model: model with parameters θ
    - X: training data (N samples × D features)
    - y: ground truth labels (N samples)
    - loss_fn: loss function J(θ; X, y)
    - alpha: learning rate
    - epochs: number of iterations

Output:
    - trained model with optimized parameters

1. FOR epoch = 1 TO epochs DO:
2.
3.     // Forward pass: compute predictions
4.     y_pred ← model.FORWARD(X)
5.
6.     // Compute loss
7.     loss ← loss_fn(y_pred, y)
8.
9.     // Backward pass: compute gradients
10.    gradients ← COMPUTE_GRADIENT(loss, model.parameters)
11.
12.    // Update parameters
13.    FOR each parameter θ IN model.parameters DO:
14.        θ ← θ - alpha × gradients[θ]
15.    END FOR
16.
17.    // Optional: monitor progress
18.    IF epoch % 10 == 0 THEN:
19.        PRINT("Epoch:", epoch, "Loss:", loss)
20.    END IF
21.
22. END FOR

23. RETURN model

#Stochastic Gradient Descent (SGD)

Uses one sample at a time (or small batches):

Algorithm StochasticGradientDescent(model, X, y, loss_fn, alpha, epochs)
Input:
    - model: model with parameters θ
    - X, y: training data
    - loss_fn: loss function
    - alpha: learning rate
    - epochs: number of passes through data

Output:
    - trained model

1. N ← LENGTH(X)

2. FOR epoch = 1 TO epochs DO:
3.
4.     // Shuffle data
5.     indices ← RANDOM_PERMUTATION(N)
6.
7.     FOR i = 1 TO N DO:           // One sample at a time
8.
9.         idx ← indices[i]
10.        x_i ← X[idx]
11.        y_i ← y[idx]
12.
13.        // Forward pass (single sample)
14.        y_pred ← model.FORWARD(x_i)
15.
16.        // Compute loss for this sample
17.        loss ← loss_fn(y_pred, y_i)
18.
19.        // Backward pass
20.        gradients ← COMPUTE_GRADIENT(loss, model.parameters)
21.
22.        // Update parameters
23.        FOR each parameter θ IN model.parameters DO:
24.            θ ← θ - alpha × gradients[θ]
25.        END FOR
26.
27.    END FOR
28.
29. END FOR

30. RETURN model

#Mini-Batch Gradient Descent

Most commonly used - balances speed and stability:

Algorithm MiniBatchGradientDescent(model, X, y, loss_fn, alpha,
                                    epochs, batch_size)
Input:
    - model: model with parameters θ
    - X, y: training data (N samples)
    - loss_fn: loss function
    - alpha: learning rate
    - epochs: number of passes through data
    - batch_size: number of samples per batch

Output:
    - trained model

1. N ← LENGTH(X)
2. num_batches ← CEIL(N / batch_size)

3. FOR epoch = 1 TO epochs DO:
4.
5.     // Shuffle data
6.     indices ← RANDOM_PERMUTATION(N)
7.
8.     FOR batch = 0 TO num_batches-1 DO:
9.
10.        // Get batch indices
11.        start ← batch × batch_size
12.        end ← MIN(start + batch_size, N)
13.        batch_indices ← indices[start:end]
14.
15.        X_batch ← X[batch_indices]
16.        y_batch ← y[batch_indices]
17.
18.        // Forward pass (batch)
19.        y_pred ← model.FORWARD(X_batch)
20.
21.        // Compute average loss for batch
22.        loss ← MEAN(loss_fn(y_pred, y_batch))
23.
24.        // Zero gradients (important!)
25.        ZERO_GRADIENTS(model.parameters)
26.
27.        // Backward pass
28.        gradients ← COMPUTE_GRADIENT(loss, model.parameters)
29.
30.        // Update parameters
31.        FOR each parameter θ IN model.parameters DO:
32.            θ ← θ - alpha × gradients[θ]
33.        END FOR
34.
35.    END FOR
36.
37. END FOR

38. RETURN model

#Gradient Descent with Momentum

Adds “inertia” to help escape local minima and accelerate convergence:

Algorithm SGDWithMomentum(model, X, y, loss_fn, alpha, epochs,
                          batch_size, beta)
Input:
    - beta: momentum coefficient (typically 0.9)

Output:
    - trained model

1. N ← LENGTH(X)
2.
3. // Initialize velocity for each parameter
4. velocity ← {}
5. FOR each parameter θ IN model.parameters DO:
6.     velocity[θ] ← ZEROS(SHAPE(θ))
7. END FOR

8. FOR epoch = 1 TO epochs DO:
9.
10.    // Shuffle and create batches
11.    indices ← RANDOM_PERMUTATION(N)
12.
13.    FOR each batch IN CREATE_BATCHES(indices, batch_size) DO:
14.
15.        X_batch ← X[batch]
16.        y_batch ← y[batch]
17.
18.        // Forward and backward pass
19.        y_pred ← model.FORWARD(X_batch)
20.        loss ← MEAN(loss_fn(y_pred, y_batch))
21.        gradients ← COMPUTE_GRADIENT(loss, model.parameters)
22.
23.        // Update with momentum
24.        FOR each parameter θ IN model.parameters DO:
25.            // Update velocity
26.            velocity[θ] ← beta × velocity[θ] + gradients[θ]
27.
28.            // Update parameter
29.            θ ← θ - alpha × velocity[θ]
30.        END FOR
31.
32.    END FOR
33.
34. END FOR

35. RETURN model

#Adam Optimizer (Adaptive Moment Estimation)

Combines momentum with adaptive learning rates:

Algorithm AdamOptimizer(model, X, y, loss_fn, alpha, epochs,
                       batch_size, beta1, beta2, epsilon)
Input:
    - alpha: learning rate (typically 0.001)
    - beta1: first moment decay (typically 0.9)
    - beta2: second moment decay (typically 0.999)
    - epsilon: small constant for numerical stability (typically 1e-8)

Output:
    - trained model

1. N ← LENGTH(X)
2. t ← 0    // Time step

3. // Initialize first and second moment estimates
4. m ← {}   // First moment (mean of gradients)
5. v ← {}   // Second moment (uncentered variance)
6. FOR each parameter θ IN model.parameters DO:
7.     m[θ] ← ZEROS(SHAPE(θ))
8.     v[θ] ← ZEROS(SHAPE(θ))
9. END FOR

10. FOR epoch = 1 TO epochs DO:
11.
12.    indices ← RANDOM_PERMUTATION(N)
13.
14.    FOR each batch IN CREATE_BATCHES(indices, batch_size) DO:
15.
16.        t ← t + 1
17.
18.        X_batch ← X[batch]
19.        y_batch ← y[batch]
20.
21.        // Forward and backward pass
22.        y_pred ← model.FORWARD(X_batch)
23.        loss ← MEAN(loss_fn(y_pred, y_batch))
24.        gradients ← COMPUTE_GRADIENT(loss, model.parameters)
25.
26.        // Update each parameter
27.        FOR each parameter θ IN model.parameters DO:
28.            g ← gradients[θ]
29.
30.            // Update biased first moment estimate
31.            m[θ] ← beta1 × m[θ] + (1 - beta1) × g
32.
33.            // Update biased second raw moment estimate
34.            v[θ] ← beta2 × v[θ] + (1 - beta2) × g²
35.
36.            // Compute bias-corrected estimates
37.            m_hat ← m[θ] / (1 - beta1^t)
38.            v_hat ← v[θ] / (1 - beta2^t)
39.
40.            // Update parameter
41.            θ ← θ - alpha × m_hat / (SQRT(v_hat) + epsilon)
42.        END FOR
43.
44.    END FOR
45.
46. END FOR

47. RETURN model

#Learning Rate Scheduling

Algorithm StepDecay(initial_lr, epoch, drop_factor, epochs_drop)
    lr ← initial_lr × drop_factor^FLOOR(epoch / epochs_drop)
    RETURN lr

Algorithm ExponentialDecay(initial_lr, epoch, decay_rate)
    lr ← initial_lr × EXP(-decay_rate × epoch)
    RETURN lr

Algorithm CosineAnnealing(initial_lr, epoch, max_epochs)
    lr ← initial_lr × (1 + COS(π × epoch / max_epochs)) / 2
    RETURN lr

#Comparison of Optimizers

Optimizer	Update Rule	Pros	Cons
SGD	$\theta -= \alpha \nabla J$	Simple, generalizes well	Slow, gets stuck
SGD + Momentum	$v = \beta v + \nabla J$ $\theta -= \alpha v$	Faster convergence	Extra hyperparameter
RMSprop	$\theta -= \alpha \frac{\nabla J}{\sqrt{v} + \epsilon}$	Good for non-stationary	Can diverge
Adam	Adaptive + momentum	Fast, works well	May not generalize as well

#Python Implementation (PyTorch)

import torch
import torch.nn as nn
import torch.optim as optim

# Define model, loss, and optimizer
model = MyModel()
criterion = nn.CrossEntropyLoss()

# Choose optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# or
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# Training loop
for epoch in range(num_epochs):
    for batch_x, batch_y in dataloader:
        # Forward pass
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)

        # Backward pass
        optimizer.zero_grad()    # Clear gradients
        loss.backward()          # Compute gradients
        optimizer.step()         # Update parameters