Model Evaluation

Splitting the Data

Once a neural network is trained, we need a way to evaluate its performance on unseen data. This helps us understand whether the model has truly learned useful patterns or has simply memorized the training data. To achieve this, we split the dataset into two parts:

Training set: this portion of the data is used to train the neural network, allowing it to adjust weights and biases through backpropagation;
Test set: after training, the model is evaluated on this separate dataset to measure how well it generalizes to new, unseen examples.

A typical split is 80% training / 20% testing, though this can vary depending on the size and complexity of the dataset.

Train/test splitting is typically performed using the train_test_split() function from the sklearn.model_selection module:


python

The test_size parameter specifies the proportion of the dataset to be used as the test set. For example, setting test_size=0.1 means that 10% of the data will be used for testing, while the remaining 90% will be used for training.

If a model performs well on the training data but poorly on the test data, it may be overfitting, meaning it has memorized the training set rather than learning generalizable patterns. The goal is to achieve high test accuracy while maintaining good generalization.

Once the model is trained, we need to quantify its performance using metrics. The choice of metric depends on the specific classification task.

Classification Metrics

For classification problems, several key metrics can be used to evaluate the model's predictions:

accuracy;
precision;
recall;
F1-score.

Since a perceptron performs binary classification, creating a confusion matrix will help you understand these metrics better.

Accuracy measures the proportion of correctly classified samples out of the total. If a model correctly classifies 90 out of 100 images, its accuracy is 90%.

While accuracy is useful, it may not always provide a full picture—especially for imbalanced datasets. For example, in a dataset where 95% of samples belong to one class, a model could achieve 95% accuracy just by always predicting the majority class—without actually learning anything useful. In such cases, precision, recall, or the F1-score might be more informative.

Precision is the percentage of correctly predicted positive cases out of all predicted positives. This metric is particularly useful when false positives are costly, such as in spam detection or fraud detection.

Recall (sensitivity) measures how many of the actual positive cases the model correctly identifies. A high recall is essential in scenarios where false negatives must be minimized, such as medical diagnoses.

F1-score is the harmonic mean of precision and recall, providing a balanced measure when both false positives and false negatives are important. This is useful when the dataset is imbalanced, meaning one class appears significantly more than the other.

War alles klar?

Danke für Ihr Feedback!

Abschnitt 2. Kapitel 11

Fragen Sie AI

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

Kursinhalt

Introduction to Neural Networks