PVL 07

⬅️ [PVL 08](<./PVL 08.md>) | ⬆️ [PVL Summaries](<./README.md>) | [PVL 06](<./PVL 06.md>) ➡️

Images as Graphs Aid Sheet

1. Motivation: Why Images as Graphs?

Limitations of "Images as Points": Operating solely on local patches (like 3x3 pixel windows) lacks an understanding of spatial continuity. Using simple point representations can incorrectly group visually similar but physically disconnected regions, such as parts of a background.
Challenges of Image Segmentation:
No Canonical Shape: Objects in the real world (e.g., horses, coffee mugs) have too many degrees of freedom and lack a simple, parameterizable form (unlike lines or circles that can be detected via Hough transforms).
Appearance Overlap: Foreground objects often share similar appearance or gray levels with background regions, making local patch-based descriptors prone to errors.
Applications: Semantic segmentation (e.g., distinguishing roads, pedestrians, and cars for autonomous driving), medical imaging (segmenting brain tissue or bones), and rotoscoping (delineating object boundaries for video editing).

2. Graph Fundamentals & Representations

Graph Definition: A graph $G = (V, E)$ consists of a finite set of vertices/nodes $V$ and a binary relation $E$ representing the edges.
Images to Graphs: Pixels serve as the nodes in the graph. Similarity measures, like $e^{-\text{distance}}$ (e.g., RGB difference), act as the weights on the edges.
Connectivity:
2D images typically use 4-neighbor or 8-neighbor connectivity.
3D volumes or video typically use 6-neighbor connectivity (to connect frames forward and backward in time).
Graph Terminology:
Path: A sequence of connected nodes from vertex $u$ to vertex $u'$.
Clique: A complete subgraph where every pair of nodes is directly connected by an edge. The "order" of a clique is the number of nodes it contains.
Data Structures for Graphs:
Adjacency List: Stores a list of neighbors for each vertex.
- Storage Complexity: $O(V + E)$.
- Lookup Complexity: $O(V)$ in the worst case, as it may require a linear search through a vertex's list of connected nodes.
Adjacency Matrix: A $V \times V$ matrix where entry $(i, j)$ is 1 if an edge exists and 0 otherwise.
- Storage Complexity: $O(V^2)$.
- Lookup Complexity: $O(1)$ constant time.

3. Early Image Segmentation Models

Piecewise Constant (Potts Model): Assumes that regions have a constant intensity. The energy function balances a goodness-of-fit term (how well pixels match a region) against a boundary smoothness term.
Piecewise Smooth (Mumford-Shah Model): A more flexible model that assumes the image function varies smoothly within a region but changes rapidly across boundaries. The energy function minimizes the $L_2$ norm on the gradient (to encourage smooth changes within regions) alongside a boundary smoothness penalty. This model often produces "cartoon-like" segmented images.

4. Graph-Based Segmentation Algorithms

Superpixels

Superpixels are clusters of pixels grouped by simple metrics like color similarity or small size, acting as a pre-processing stage to make complex algorithms faster. They are a strict segmentation of the nodes where their union forms the full image and their pairwise intersection is null.

Graph-Based Merging (Felzenszwalb & Huttenlocher)

A bottom-up, agglomerative technique that optimizes a global grouping metric.
Internal Difference $Int(R)$: The largest edge weight in a region's minimum spanning tree.
Difference between Regions $Dif(R_1, R_2)$: The minimum weight edge connecting the two regions.
Merge Criteria: Merges adjacent regions if their difference is smaller than their minimum internal difference: $Dif(R_1, R_2) < \min(Int(R_1) + \tau(R_1), Int(R_2) + \tau(R_2))$, where $\tau$ is a heuristic penalty based on region size.
Complexity: Extremely fast, running in $O(N \log N)$ time.

Normalized Cuts (Ncut) (Shi & Malik)

A top-down method that separates groups connected by weak affinities.
The Min-Cut Problem: A standard minimum cut seeks to split the graph by finding the smallest sum of cut edges. However, this is heavily biased toward isolating a single, outer node because that simply cuts the fewest edges.
The Ncut Solution: Normalizes the cut penalty by the volume (association) of the regions to encourage balanced segmentation sizes.
$Ncut(A, B) = \frac{cut(A, B)}{assoc(A, V)} + \frac{cut(A, B)}{assoc(B, V)}$.
Computation: Solving Ncut exactly is NP-complete, so it is relaxed into a generalized eigenvalue problem (SVD) and solved using spectral methods.

Segmentation by Weighted Aggregation (SWA)

An $O(N)$ hierarchical, multi-scale technique inspired by algebraic multigrid approaches.
It computes a "region saliency" metric (a per-region version of the normalized cut measure) to identify groups with high internal affinity and low boundary affinity. Nodes are iteratively aggregated into coarser layers to produce a bottom-up segmentation.

Graph Cuts and Energy-Based Methods

Formulates binary object segmentation as a discrete labeling problem (e.g., Foreground vs. Background) via a Markov Random Field (MRF).
Energy Function: $E(f) = \sum E_r(i,j) + E_b(i,j)$.
Region Term ($E_r$): The cost of assigning a pixel to a specific label, based on consistency with region statistics (like color histograms or Gaussian mixture models).
Boundary Term ($E_b$): Measures inconsistency between neighbors, penalizing boundaries that do not align with visible image edges.
Optimization: Solved by finding the minimum-cut / maximum-flow on a graph connecting source ($S$) and sink ($T$) nodes. The user can guide this process with interactive "seeds".
Directed Edges: Edge weights can be made directed (asymmetric) to prefer transitions from light to dark (or vice versa), which is highly useful in medical imaging.

5. Other Classical Segmentation Techniques

Watershed: Interprets the gradient magnitude image as a topological landscape. The algorithm "floods" catchment basins from local minima, forming boundaries where different evolving basins meet. Prone to over-segmentation unless initialized with user-provided seeds or local morphology constraints.
Mean Shift / Mode Finding: A non-parametric method that models pixel features (e.g., Luv* color and position) as a probability density function. It smoothes the distribution and searches for local peaks (modes) to identify clusters without imposing a specific parametric shape.

While the provided materials do not explicitly mention a formal "test" or "exam," the speaker does provide clear, direct guidance on what students should remember and what they will not be evaluated on:

The Most Important Concept to Remember: The speaker explicitly states the "most important thing for me to have you remember" and to "take home with you after the semester" is the underlying analogy about the remarkable capability of human visual perception. By demonstrating that humans can perfectly identify complex actions (like a parent picking up a child or an animal) just from looking at a few moving dots in a space-time video, the speaker emphasizes the power of structured visual representations, joking that human vision is so effective that "no one walked into a wall on the way here today".
Specific Material You Will Not Be Asked About: When discussing the iterative algorithm called "segmentation by weighted aggregation," the speaker explicitly relieves students from needing to memorize its mechanics. The speaker notes that the algorithm is very "bookkeeping oriented and it's not worth going through this and we won't ask you about it," though interested students are encouraged to review the slides on their own.

⬅️ [PVL 08](<./PVL 08.md>) | ⬆️ [PVL Summaries](<./README.md>) | [PVL 06](<./PVL 06.md>) ➡️