Question 4 example 1
Problem 4: Integration and System Design (25 pts)
You have been hired to design the visual perception system for a new autonomous delivery robot operating on the North Campus Diag. The robot is equipped with only a single, uncalibrated digital camera. Its objective is to locate a specific student (who is wearing a highly textured Michigan engineering jacket), separate them from a crowded background, and navigate to stop exactly 1.5 meters in front of them to hand off a package.
Provide a comprehensive, end-to-end computer vision pipeline to accomplish this task. Your response should reflect a conceptual understanding of the entire course. There are many possible valid solutions, but you must clearly integrate at least three distinct conceptual paradigms covered in the course (e.g., Images as Functions, Features/Stitching, Images as Points, Images as Graphs, 3D Vision, or Deep Learning).
In your response, you must explicitly propose solutions for:
1. How you will uniquely identify and track the student in the crowd.
2. How you will cleanly segment the student from the background.
3. How you will estimate the exact distance to the student to stop at 1.5 meters, despite having only a single camera.
Answer / Example Rubric:
Note to Grader: Because this question is highly unconstrained and has tons of possible answers, students should be awarded full credit for any logically sound pipeline that demonstrates deep conceptual understanding of the course and integrates at least three major paradigms.
Below is one optimal example solution:
1. Identifying and Tracking the Student (Local Features / Images as Functions):
Because the background is crowded, the student should propose using local image features, as they are highly robust to clutter and occlusion. The pipeline could utilize a Harris corner detector, which analyzes the structure tensor's eigenvalues to find points with strong gradients in multiple orientations (corners), thereby avoiding the aperture problem. To track the highly textured jacket robustly, they would extract SIFT descriptors—which use Difference of Gaussians (DoG) for scale invariance and gradient histograms for rotation invariance—and match these features across consecutive frames. To filter out false matches from the moving crowd, the system must employ RANSAC to find the true geometric transformation between the frames. (Alternatively, students could legitimately propose using a Deep Learning approach, such as a Convolutional Neural Network like ResNet or AlexNet, to perform instance recognition and bounding-box tracking).
2. Segmenting the Student (Images as Graphs):
To accurately isolate the student from the background crowd, the pipeline should treat the image as a graph where pixels are nodes and similarities (like RGB differences) are edge weights. The student could propose the Max-Flow Min-Cut algorithm to perform a two-class (foreground/background) segmentation. They would define a unary energy function based on the known visual features of the jacket (to reward foreground classification) and a pairwise energy regularizer that penalizes placing boundaries between pixels with similar intensities, enforcing spatial continuity. The graph is constructed with a Source (foreground) and Sink (background), and the segmentation is finalized by finding the augmenting paths in a residual network until the maximum flow is reached, mathematically guaranteeing an optimal minimum cut.
3. Distance Estimation (3D Vision):
Because the robot lacks a calibrated stereo camera setup, it cannot rely on disparity to calculate depth. The student must correctly identify the "scale ambiguity" inherent in single-view perspective projection, noting that depth is unrecoverable from a single image alone because a large object far away projects exactly the same as a small object nearby. To successfully stop exactly 1.5 meters away, the student must explain that the robot needs prior information to constrain the scale. They should propose using the known physical dimensions of the specific Michigan jacket (e.g., the exact size of a logo or a standard object the student is holding) to calculate the absolute scale of the reconstruction and determine the real-world distance.