PVL 11

⬅️ [Test Format and Hints](<./Test Format and Hints.md>) | ⬆️ [PVL Summaries](<./README.md>) | [PVL 10](<./PVL 10.md>) ➡️

(Note: The instructor explicitly stated that you will not be tested on calculating derivatives/differentiation, the biological analogies of neurons, or the recent deep learning highlights like GANs, RL, and Bayesian nets. Those have been omitted to focus strictly on testable core concepts).

1. Types of Recognition Problems in High-Level Computer Vision

Visual recognition is the problem of designing models to query visual data. Because of variations in appearance, lighting, scale, occlusion, and viewpoint, high-level computer vision is essentially a search for interclass invariance.
Classification: Determines if an image belongs to a specific category (e.g., "Is this a building image?").
Detection: Finds if an object of a certain class is in an image and identifies its location, typically using a bounding box in space or spacetime. * Semantic Segmentation: Assigns a class label to every single pixel in an image (e.g., labeling pixels as 'clock', 'car', or 'person').
Attribute Estimation: Estimates specific semantic or geometric properties, such as transparency, distance to a surface, or incline.
Instance Recognition & Tracking: Identifies specific instances of a class (e.g., "Person 1" vs. "Person 2" or a specific building like the Marshall Field Building) and tracks them over time in video.
Activity/Event Recognition:* Understands the collective action or interaction occurring in the scene, such as people crossing a street.

2. Linear Classifiers

A linear classifier maps vectorized image data to class scores using a weight matrix.
The Formulation: The score vector is computed as $S = Wx$, where $x$ is the vectorized image and $W$ is the weight matrix.
Kernels/Templates: Each row in the weight matrix $W$ represents a template or "kernel" for a specific class. The operation projects the image onto the class basis; for example, a "sky" kernel would have weights that strongly match blue pixels at the top of an image.
Decision Rule: The network chooses the class with the highest score: $t^ = \text{argmax}_t S_t$.

3. Loss Functions and Regularization

To learn the optimal weights ($W$), the model requires a loss function to evaluate performance.
Hinge Loss: A standard loss function that computes the difference between the correct class score and incorrect class scores. * For a single sample: $L_i = \sum_{j \neq y_i} \max(0, S_j - S_{y_i} + \Delta)$.
* $\Delta$ (Delta): A constant margin that forces separation between the correct label's score and incorrect labels' scores. A larger delta makes the learning problem harder.
L2 Regularization: Added to the total loss over the dataset to penalize large weights. It takes the form $\lambda \sum W_{ab}^2$.
* Purpose: It constantly pulls weights down, which encourages generalization and avoids overtraining.

4. Optimization: Gradient Descent

Because direct analytical solutions are impossible, models iteratively descend the solution space to find the best weights.
Update Rule: $\theta_{new} = \theta_{old} - \alpha \frac{\partial L}{\partial \theta}$, where $\alpha$ is the learning rate.
Challenges: If the learning rate is too high, the model might jump over the goal. The solution space is highly complex, so the model can get stuck in local non-linear basins.
Batching: Because datasets are too large to fit into GPU memory, gradient descent is staggered.
* Stochastic Gradient Descent (SGD): Computes the gradient using just one sample at a time.
* Batch Gradient Descent:* Computes the gradient over a small subset of the data (e.g., 10, 16, or 100 samples) before updating weights.

5. Deep Learning Core Concepts

Linear classifiers fail when the boundaries between classes (like cats, dogs, and birds) are complex and non-linear. Deep learning solves this by using a composition of many parameterized functions.
Forward Process (Propagation): The process of passing data through the functions to compute scores and activations.
Backward Process (Backpropagation): The process of applying the chain rule to calculate the gradients of the loss with respect to the parameters at every step, which are then used to update the weights.

6. Common Neural Network Layers

Modern deep networks use a mix of specific functions or "layers".
Fully Connected Layers: Layers where all nodes are connected to all nodes in the subsequent layer. These are usually placed at the end of a network once the representation is condensed, as placing them early would result in an impossibly large weight matrix.
Convolutional Layers: These utilize smaller kernels mapped over the whole image. They share weights, giving the network shift invariance.
Non-linear Activations:
* Sigmoid: $f(s) = \frac{1}{1 + e^{-s}}$.
* Hyperbolic Tangent (tanh): Another standard non-linear function.
* ReLU (Rectified Linear Unit): $f(s) = \max(0, s)$. Empirically shown to perform better than Sigmoid and Tanh in modern networks.
Pooling Layers: Layers that downsample representations. The most common is Max Pooling, which simply outputs the maximum value of its inputs.
Dropout:* A regularization technique used to avoid overtraining where neurons are randomly "killed" (set to zero) during training. This forces the remaining network to spread out its knowledge, acting effectively as an ensemble of networks.

7. Famous Network Architectures

LeNet-5 (Yann LeCun): The most famous early convolutional network, typically used for digit recognition (MNIST 28x28 images). It stacks Convolution $\rightarrow$ Max Pooling $\rightarrow$ Convolution $\rightarrow$ Max Pooling, followed by Fully Connected layers and a ReLU (originally Sigmoid).
AlexNet: The network that re-popularized CNNs. It features five convolutional layers followed by three fully connected layers, utilizing Max Pooling and ReLU.
VGG16 / VGG19: Follows a strict structural theme of sequential convolutional layers and max-pooling to safely condense spatial scale before applying fully connected layers.
ResNet: Introduced a major architectural shift by propagating data along with layered computation. Instead of learning the entire location in space, it learns only the residual changes after each layer. It generally outperforms VGG models.

The instructor specifically highlights two topics that you do not need to remember or worry about for upcoming quizzes or exams:

Differentiation: You will not be asked to perform differentiation on a quiz or exam. The instructor only demonstrated the differentiation of the loss function to show what the gradients look like, but explicitly told students not to worry about doing it themselves.
Recent deep learning highlights: You are not required to memorize or study the recent advancements in deep learning discussed at the end of the lecture (such as Generative Adversarial Networks, Bayesian deep nets, or deep reinforcement learning). The instructor noted that you will not be asked about these recent highlights, as they were only included to give you a sense of what is currently happening in the field.

⬅️ [Test Format and Hints](<./Test Format and Hints.md>) | ⬆️ [PVL Summaries](<./README.md>) | [PVL 10](<./PVL 10.md>) ➡️