- Simplest type of artificial neural network, single/multi-layer perceptron
- Use: classification and regression tasks for structured data and simple features.
- Data flow: one direction (input → hidden layers and → output) no loops or cycles.
- Structure: input layer, N hidden layers, output layer.
- Activation function:
- Rectified linear unit (ReLU) for hidden layers
- Sigmoid for binary classification
- Softmax for multi-class classification.
- Loss function:
- MSE
- cross-entropy loss
- Learning: gradient descent + chain of calculus to optimize/compute weight updates
- Pros: simple implementation.
- Cons: not efficient for image, needs large datasets for deep architectures, can suffer from vanishing gradients without careful design.
#Execution Model
- Features are encoded, and each Feature is mapped to an input neuron ($X_n$)
- Each feature is mapped into a neuron of the following layer alongside a weight ($W_n$)
- Each receiving neuron performs two operations:
- Weighted sum: $Z=WX+b$
- Activation function: e.g. $f(z)=max(0,z)$
- This repeats for each neuron of each layer until the output layer outputs prediction $\hat{Y}$.
- Usually, hidden and output layers have different activation functions.
- After the model outputs $\hat{Y}$, we compare it to the true value $Y$ using a Loss Function:
- Binary Classification: Binary cross-entropy
- Multi-class classification: categorical cross-entropy
- Regression: mean squared error (MSE)
- We adjust the weights (Backpropagation) using Gradient Descent
- Compute how much each weight contributed to the error.
- Adjust weights to reduce the error.
- Repeat for many iterations (epochs) until the model improves.
#Examples
- Look at how a FNN processes textual input.
- Sentiment Analysis: Classify the sentiment as positive
- Text Classification: Categorize sentences into topics (e.g. ML vs Science)
- How is input passed to the Neural Network?
- Encoding, e.g. One-Hot Encoding
- Mapping each encoded feature to a set of neurons as large as the feature set
- Dataset
- “I love machine learning”
- “Deep learning is very difficult”
- “Learning architectures is useful”
- Bag of Words encoder: 3 records, each with 10 features
- We use the hyperparameter batch size to decide how many records to pass to the NN at each training iteration. Batch Size = 1
- We pass a single input record, e.g., $X = [1, 0, 0, 0, 1, 1, 1, 0, 0, 0]$.
- We choose a weight initialization method (another hyperparameter) and get a weight vector, e.g., $W = [-0.26, -0.23, -0.25, -0.41, -1.16, 0.42, 0.36, -0.68, -0.19, -0.33]$.
- We choose a bias initialization method (yep, another hyperparameter) and get a bias vector, one bias value for each neuron of a layer.
- We compute the weighted sum ($Z = WX + b$):
Why Bias? With the bias, the decision boundaries can be anywhere in the input space, not just through the origin. More flexible.
Batch Size > 1, e.g. 3
- We pass three input records to the NN as a matrix instead of as a vector.
- Assuming a hidden layer with four neurons, we have a weighted matrix, e.g:
- Compute weighted sums Z for the hidden layer $Z=XW$
- This is why we need GPUs/TPUs. Matrix multiplications can be efficiently parallelized across thousands of cores (e.g. GTX 4090 has 16384 CUDA cores)
- Batch sizes vary depending on hardware, dataset size, and model type:
- CPU: 16-32; GPU: 64-128: Server-grade GPU: 256-1024+; TPU: 512-4096
#Implementation
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
# Generate a synthetic dataset (binary classification)
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
# Split dataset into training (70%), validation (15%), and testing (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)
# Standardize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit on training data and transform
X_val = scaler.transform(X_val) # Transform validation data using same scaler
X_test = scaler.transform(X_test) # Transform test data using same scaler
- Standardization makes feature scale uniform.
- Prevents features with larger ranges from dominating those with smaller ranges.
- NN overfits larger ranges.
- Slow/fail NN training with very different feature scales.
- Exploding/vanishing activations in deeper layers.
- Note: Fits to training set; transforms on validation and test
#Hyperparameters
-
Model hyperparameters:
- Number of layers
- Number of neurons
- Activation function
-
Compile hyperparameters:
- Optimizer
- Loss function
- Metric of training goodness
-
Regularization hyperparmeters:
- Dynamically adjust epochs
- Metric to observe
-
Tuning examples:
- Number of neurons/layer
- Different activations
-
Training hyperparameters
- num. epochs
- batch size
- validation data
-
Measuring whether the model is overfitting
-
Training vs. Validation Accuracy
-
If the validation accuracy follows the training accuracy, the model is learning correctly.
-
If the validation accuracy diverges, the model may be overfitting.
-
Training vs. Validation Loss
-
If validation loss is decreasing, the model generalizes well.
-
If validation loss increases while training loss decreases, the model is overfitting.
-
Final training results:
-
Accuracy on Test Set: 0.8667
-
Loss on Test Set: 0.3260
-
Pytorch and Keras offer multiple optimization methods
-
E.g., EarlyStopping implements a callback for the model.fit method.
-
The callback fires at every epoch, reporting the value of a user-defined (hyperparameter) metric.
-
Here, we use ’loss’ and stop when it does not decrease for 5 consecutive epochs.
-
We space
-
Final training results with 15 epochs instead of 50 :
-
Final Accuracy on Test Set: 0.8533 Vs. 0.8667
-
Final Loss on Test Set: 0.3459 Vs. 0.3260