PVL 09
3D Computer Vision and Motion
1. 3D Geometric Primitives & Transformations
- Planes: Can be represented by a unit normal vector (direction) and a distance $d$ to the origin.
- Lines in 3D: Can be represented using a parameter $\lambda$ spanning between two points $P$ and $Q$.
- 3D Transformations & Degrees of Freedom (DOF):
- Translation: 3 DOF.
- Rigid/Euclidean: 6 DOF (3 for translation, 3 for axis-angle rotation).
- Similarity: 7 DOF.
- Affine: 12 DOF.
- Projective (Homography in $\mathbb{P}^3$): 15 DOF (4x4 matrix, defined up to scale).
- Spherical Linear Interpolation (Slerp): Used for animating rotations via quaternions. Interpolates between rotation quaternions based on the axis and total angle.
2. Camera Projection Models
- Central Perspective (Pinhole) Projection: Models real cameras where light passes through an aperture to an image plane.
- Fundamental Equations: $x' = f \frac{x}{z}$ and $y' = f \frac{y}{z}$.
- Properties: Highly non-linear because it involves division by depth ($z$). Distant objects appear smaller, points project to points, lines project to lines, and parallel lines meet.
- Matrix Form: In homogeneous coordinates, projection drops the $w$ component, making depth unrecoverable from a single 2D image.
- Orthographic Projection: Simply drops the $z$ component ($x = [I_{2\times2}|0] p$). It is a coarse approximation useful for telephoto lenses or when depth variation is shallow relative to distance. Does not involve per-pixel division.
- Weak Perspective (Scaled Orthography): Assumes all points are at a constant depth $Z_0$. Projects to a fronto-parallel plane, applying a uniform scale $s$ ($x = [sI_{2\times2}|0] p$). Useful when depth variations are dwarfed by absolute distance, like in satellite imagery.
- Para-perspective: Points are projected parallel to the line of sight to the object center onto a local reference plane, followed by scaling. Keeps parallel lines parallel.
3. Camera Parameters & The Camera Matrix
The total camera matrix $P = K[R|t]$ is a $3\times4$ matrix projecting 3D world coordinates to 2D image coordinates.
Extrinsic Parameters (6 DOF): Represents the camera's pose in the 3D world. Consists of a $3\times3$ Rotation matrix $R$ (3 DOF) and a $3\times1$ Translation vector $t$ (3 DOF). It maps 3D world coordinates into the camera's 3D reference frame.
Intrinsic Parameters ($K$ Matrix, 5 DOF): Maps points from the 3D camera frame to 2D pixel coordinates. Modeled as an upper-triangular matrix. The 5 parameters are:
1. Focal length $f_x$ (or $\alpha$).
2. Focal length $f_y$ (or $\beta$).
3. Principal point/optical center $c_x$ (or $u_0$).
4. Principal point/optical center $c_y$ (or $v_0$).
5. Skew angle $s$ (or $\theta$).
4. Camera Calibration
The goal of calibration is to estimate the intrinsic and extrinsic parameters.
3D Rig Method (Direct Linear Transform): Uses a highly calibrated 3D geometric rig with known 3D point locations.
* Minimum points required: 6 points. Each point provides 2 equations (x and y locations), yielding 12 equations to solve for the 11 unknown parameters (6 extrinsic + 5 intrinsic).
* Solved by setting up a homogeneous linear system and using Singular Value Decomposition (SVD).
* Degenerate case: Fails if all calibration points lie on a single 2D plane.
Zhang's Multiplane Method: Uses a simple 2D printed checkerboard imaged from multiple angles.
* By assuming the checkerboard plane lies at $Z=0$, the $3\times4$ projection matrix simplifies to a $3\times3$ Homography ($H$) mapping points on the physical plane to the image plane.
* Each homography has 8 DOF. Since extrinsics use 6 DOF, each image provides 2 geometric constraints on the intrinsic parameters.
* Constraints: Derived using the orthonormal properties of rotation matrices (columns $r_1, r_2$ are orthogonal so $r_1^T r_2 = 0$, and have unit length so $r_1^T r_1 = r_2^T r_2 = 1$).
* Requires a final non-linear refinement step (e.g., Levenberg-Marquardt) to optimize the SVD solution and handle lens distortion.
5. Lens Distortion
- Lenses introduce non-linear radial distortion, commonly manifesting as "pin cushion" or "barrel" (fisheye) distortion, which bends straight lines.
- It is modeled using polynomial functions based on the distance from the image center ($D^2 = au^2 + buv + c...$).
- Because the coefficients depend non-linearly on one another, it must be solved using non-linear optimization algorithms like Gradient Descent or Newton's method.
6. Motion and Optical Flow
- Motion Field vs. Optical Flow: The motion field is the actual 3D motion in the real world, while optical flow is the 2D apparent motion field projected onto the image plane. They are not always identical due to lighting changes, lack of texture, or non-rigid deformations.
- Velocity Projection: By taking the derivative of the perspective projection equation, the 2D image velocity $v_x$ is a function of the 3D velocity $V_x$, the 3D point $X$, and inversely proportional to depth $Z$ ($v_x = f \frac{Z V_x - V_z X}{Z^2}$).
- Depth Cue (Motion Parallax): The length of the flow vectors is inversely proportional to depth; closer objects appear to move faster than distant objects.
- Optical Flow Assumptions:
- Brightness Constancy: The projection of a point looks the same (same intensity) across frames.
- Small Motion: Motion between frames is relatively small.
- Spatial Coherence: Neighboring pixels belong to the same surface and undergo similar motion.
- The Aperture Problem: Given a straight edge, you can only perceive the motion parallel to the gradient; the component of flow perpendicular to the image gradient is fundamentally unrecoverable from a local window.
- Lucas-Kanade Method: Overcomes the aperture problem by evaluating a patch/window around a pixel, creating an overdetermined least-squares problem based on the structure tensor (identical to the math used in Harris corner detection). Flow can only be fully recovered in areas with gradients varying in multiple directions (e.g., corners).
Yes, the speaker highlights several specific concepts to remember and mentions topics that relate to exams or quizzes:
- Fundamental Equations of Perspective Projection: The speaker emphasizes, "If you remember anything from today, remember those two equations". These dictate that a projected point on the image plane is calculated as $x' = f \cdot x / z$ and $y' = f \cdot y / z$.
- Camera Calibration Problem: The speaker notes that if they were to put the 3D grid camera calibration problem on the final exam, students should know how to set it up and solve it using "least squares magic".
- Zenya Zang's Multiplane Calibration Method: When introducing this method—which uses a single planar pattern (like a checkerboard) imaged multiple times to calibrate intrinsic parameters—the speaker explicitly states, "You are responsible for the method".
- Diagnosing Motion Fields: The speaker mentions that they could have given a quiz on looking at different motion fields and diagnosing what is physically happening to the camera in the scene, such as zooming in, zooming out, or strafing.