Unit-2
Topic 1- Perspective
Perspective in computer vision refers to the way objects and scenes are perceived in
images or videos, taking into account the principles of perspective projection and depth
perception.
Perspective in computer vision is fundamental for making sense of the 3D world from
2D images and videos.
Perspective is crucial for various computer vision tasks, such as object recognition, 3D
reconstruction, and scene understanding.
Here are some key aspects of perspective in computer vision:
Perspective Projection: Perspective projection is the mathematical model used to simulate
how a 3D scene is projected onto a 2D image or sensor plane. It takes into account the
properties of cameras, including focal length, sensor size, and the position of the camera
relative to the scene. This projection results in objects appearing smaller as they move farther
from the camera and creates the illusion of depth.
Depth Perception: Perspective in computer vision enables the estimation of depth or distance
information from 2D images. This can be done through techniques such as stereo vision (using
two or more cameras to calculate depth by triangulation), monocular depth estimation (using a
single camera and various cues like texture, shading, and motion), and LiDAR (Light Detection
and Ranging) sensors, which directly measure distances to objects in the scene.
Camera Calibration: To accurately interpret perspective in computer vision, it's essential to
calibrate the camera. Camera calibration involves determining camera parameters like the
intrinsic matrix (focal length and optical center) and extrinsic matrix (position and orientation
of the camera). This information is used to transform 2D image coordinates into 3D world
coordinates and vice versa.
Vanishing Points: In images with linear perspective, parallel lines appear to converge at one
or more points on the horizon. These are called vanishing points. Detecting vanishing points
can be useful for understanding the orientation and layout of objects in the scene.
Homography: A homography is a 3x3 matrix that represents a projective transformation
between two images or scenes. It's used to map points from one perspective to another, which
is valuable for tasks like image stitching and panorama creation.
Pose Estimation: Perspective information is used to estimate the pose (position and
orientation) of objects or the camera itself in a 3D environment. Pose estimation is crucial for
augmented reality, robotics, and object tracking applications.
Depth Maps: In depth sensing, perspective information is often used to create depth maps,
which represent the distances of objects from the camera. Depth maps can be used for various
applications, including object recognition, scene understanding, and obstacle avoidance.
Camera Pose Estimation: Determining the camera's position and orientation (its pose) is
essential for applications like simultaneous localization and mapping (SLAM), where a mobile
robot or device needs to navigate and build a map of its surroundings.
Topic 2 - Binocular stereopsis
Binocular stereopsis is a fundamental concept in computer vision and 3D perception, inspired
by how the human visual system uses two eyes to perceive depth and create a three-dimensional
representation of the world. It involves the computation of depth information from the disparity
(difference in the apparent position) between corresponding points in the left and right images
captured by two cameras, mimicking the human left and right eyes.
Here's how binocular stereopsis works in computer vision:
Image Capture: Two cameras (or image sensors) capture images of the same scene from
slightly different viewpoints, just like our two eyes capture slightly different views of the world.
Feature Matching: In each of the left and right images, computer vision algorithms identify
corresponding features or points. These corresponding points are typically identified by finding
the same pattern or object in both images. Common algorithms for feature matching include
SIFT (Scale-Invariant Feature Transform) or SURF (Speeded-Up Robust Features).
Disparity Calculation: After identifying corresponding points, the disparity is computed for
each pair of corresponding points. Disparity represents how much an object's position differs
between the left and right images. Larger disparities correspond to objects that are closer to the
cameras, while smaller disparities correspond to objects that are farther away.
Depth Estimation: Using the calculated disparities and the known baseline distance (the
distance between the two camera lenses), computer vision systems can estimate the depth of
objects in the scene. This is often done using simple triangulation methods, where the depth is
inversely proportional to the disparity.
3D Reconstruction: By combining the depth information from multiple points in the scene, a
3D representation of the objects in the scene can be constructed. This 3D reconstruction can be
used for various applications, such as object detection, tracking, scene understanding, and
more.
Binocular stereopsis has a wide range of applications in computer vision, robotics, augmented
reality, and virtual reality. It allows machines to perceive the world in three dimensions, which
is crucial for tasks like object detection, depth sensing, and navigation.
Topic 3- Camera geometry and Epipolar geometry
Camera geometry and epipolar geometry are fundamental concepts in computer vision that are
used to understand and model the relationship between multiple cameras and their images.
These concepts play a crucial role in tasks such as stereo vision, structure-from-motion, and
3D reconstruction.
Camera Geometry:
Pinhole Camera Model: In computer vision, the pinhole camera model is often used
to represent how cameras capture images. In this model, each camera is represented as
a simple geometric pinhole through which light rays pass to create an image on the
camera's image plane. This model provides a way to relate 3D world points to 2D image
points.
Intrinsic Parameters: These parameters characterize the internal properties of a
camera and include focal length (f), principal point (c_x, c_y), and distortion
coefficients (radial and tangential distortion). These parameters are crucial for
projecting 3D points into 2D image coordinates.
Extrinsic Parameters: Extrinsic parameters describe the camera's position and
orientation in the world coordinate system. They typically consist of the rotation matrix
(R) and translation vector (t) that transform 3D world points into the camera's
coordinate system.
Projection Matrix: The projection matrix (P) combines intrinsic and extrinsic
parameters and is used to project 3D points into 2D image coordinates. It is often
represented as P = K[R|t], where K is the camera's intrinsic matrix.
Epipolar Geometry:
Epipolar geometry deals with the geometric relationship between two cameras and their
corresponding images. Here's how it works:
Epipolar Lines: When you have two cameras capturing images of the same scene, any point
in one image corresponds to a line in the other image. These lines are called epipolar lines. The
epipolar lines are essential because they constrain the possible locations of corresponding
points in the second image, reducing the search space for feature matching.
Epipole: The epipole is a point in one camera's image that corresponds to the position of the
other camera in the first camera's image. It is where all epipolar lines intersect.
Epipolar Constraint: The epipolar constraint states that a point in one image, its
corresponding point in the other image, and the epipole all lie on the same plane. This constraint
simplifies the problem of finding correspondences between the two images.
Essential Matrix: The essential matrix (E) encodes the epipolar geometry between two
cameras. It relates the corresponding points in two images while taking into account the
calibration of the cameras. E = [t]_x R, where [t]_x is the skew-symmetric matrix of the
translation vector t.
Fundamental Matrix: The fundamental matrix (F) is similar to the essential matrix but
without considering camera calibration. It relates the corresponding points in two images in a
projective way. F = K' E K^(-1), where K and K' are the intrinsic matrices of the two cameras.
Topic 4- Homograph
In computer vision, a homography, also known as a projective transformation or
perspective transformation, used to map points in one image to corresponding points in
another image of the same scene, assuming a planar (flat) surface.
Homographies are particularly useful in tasks like image stitching, object tracking, and
augmented reality.
homographies are a powerful tool in computer vision for mapping points between
images of planar scenes. They are essential in various applications that involve multiple
images, such as creating panoramic images, tracking objects, and enhancing augmented
reality experiences.
Here are the key aspects of homography in computer vision:
Planar Scene Assumption: The fundamental assumption behind homographies is that the
scene being captured is planar. This means that objects in the scene lie on a flat surface, such
as a wall, a floor, or a tabletop.
Mapping Points: A homography maps points from one image (source image) to corresponding
points in another image (destination image). These corresponding points should lie on the same
planar surface.
Eight Degrees of Freedom: A homography matrix has eight degrees of freedom, meaning that
it can be uniquely determined with at least four corresponding points. However, using more
points can provide a more accurate estimation.
Homography Estimation: Estimating a homography typically involves solving a system of
linear equations based on the coordinates of corresponding points in both images. Techniques
like Direct Linear Transform (DLT) or RANSAC (Random Sample Consensus) are commonly
used for homography estimation. RANSAC is especially robust in the presence of outliers or
incorrect correspondences.
Use Cases:
Image Stitching: Homographies are used to align and stitch together multiple images to create
a panoramic view.
Object Tracking: In object tracking, homographies can be used to update the position and
orientation of a tracked object as it moves in the scene.
Augmented Reality: Homographies are employed to overlay virtual objects onto the real
world in augmented reality applications.
Limitations:
Homographies are suitable for planar scenes but may not accurately represent scenes with
significant depth variation or non-planar surfaces. For non-planar scenes, more complex
transformations, such as affine transformations or perspective transformations, may be
required.
Topic- 5 Rectification
Rectification refers to the process of transforming and aligning images in a way that
simplifies further analysis or processing.
This alignment typically involves correcting distortions, such as perspective
distortions, and ensuring that specific features or structures in the images are aligned.
Rectification is commonly used in tasks like stereo vision, object recognition, and
image stitching.
Here are some key aspects of rectification:
Stereo Vision: In stereo vision, rectification is often used to ensure that corresponding
points in two or more images captured by different cameras are aligned along the same
scanlines (rows). This alignment simplifies the process of finding disparities and depth
information in stereo pairs.
Perspective Correction: Rectification can correct perspective distortions caused by
the camera's position and orientation relative to the scene. It makes lines that are
parallel in the 3D world appear parallel in the 2D image, simplifying subsequent
processing steps.
Epipolar Geometry: In stereo vision, rectification is essential for ensuring that the
epipolar lines (lines connecting corresponding points between the two images) are
horizontal, which simplifies the search for matching points.
Image Stitching: In panoramic image stitching, rectification is used to align images
taken from different viewpoints so that they can be seamlessly combined to create a
wide-angle or panoramic view.
Homography Transformation: Rectification is often achieved by applying a
homography transformation to the image. The homography matrix is calculated based
on known correspondences between points in the image. The transformation warps the
image to achieve the desired alignment.
Rectification Methods:
There are various methods for rectification, including:
Homography-Based Rectification: This method estimates a homography matrix that
maps points in the distorted image to their corrected positions.
Piecewise Planar Rectification: For scenes with non-planar surfaces, a piecewise
planar rectification approach can be used, where different regions of the image are
rectified separately.
Polynomial Rectification: In some cases, polynomial transformations are applied to
correct distortions.
Applications:
Rectification is widely used in computer vision applications, including 3D
reconstruction, object tracking, and more, where accurate alignment and distortion
correction are crucial for reliable results.
Challenges:
Rectification may introduce some challenges, especially when dealing with complex
scenes, non-planar surfaces, or scenes with significant depth variations. In such cases,
more advanced techniques may be necessary to achieve accurate rectification.
Topic 6- Distributed Ledger Technology (DLT)
Distributed Ledger Technology (DLT) is a technology that has primarily been
associated with blockchain and cryptocurrency, but it has potential applications in
various fields, including computer vision. Computer vision involves the use of
algorithms and technology to enable computers to interpret and understand visual
information from the world, typically using images or videos as input.
Here are some ways DLT can be related to computer vision:
Data Provenance and Integrity: DLT, particularly blockchain, can be used to ensure
the authenticity and integrity of the data used in computer vision tasks. In computer
vision applications, the quality and trustworthiness of the data are crucial. By storing
image metadata, data source information, and processing history on a blockchain, it
becomes difficult to tamper with or manipulate the data. This can enhance the
trustworthiness of the data used for training machine learning models in computer
vision.
Supply Chain and Object Tracking: DLT can be used in conjunction with computer
vision for supply chain and logistics applications. Computer vision can track and
identify objects in real-time, and DLT can provide a decentralized and immutable
ledger to record the history and movement of these objects throughout the supply chain.
This combination can enhance transparency and traceability, reducing fraud and
ensuring product authenticity.
Decentralized Image Storage: DLT can be used to create decentralized image storage
systems. In such systems, images can be stored across a distributed network of nodes,
and their metadata and access control can be managed through smart contracts. This
can provide more secure and censorship-resistant storage for computer vision
applications that rely on large image datasets.
Data Sharing and Collaboration: Computer vision often involves collaboration
among different entities or organizations. DLT can facilitate secure and controlled data
sharing between these parties. Smart contracts can be used to define the terms of data
sharing agreements, and DLT ensures that data access and usage are recorded and
auditable.
Decentralized AI Models: DLT can be used to distribute and manage machine
learning models used in computer vision tasks. This can enable more efficient and
decentralized AI applications, where models can be trained and improved
collaboratively by multiple participants while ensuring the transparency and fairness
of model updates.
Intellectual Property and Licensing: In computer vision, there are concerns related
to intellectual property and licensing of image and video content. DLT can be used to
manage and enforce licensing agreements automatically. When someone uses an image
or video, smart contracts can ensure that the content creator receives the appropriate
compensation.
Data Monetization: DLT can provide a transparent and secure way for individuals or
organizations to monetize their computer vision data. Data owners can specify who has
access to their data and under what conditions, and smart contracts can automate the
payment process when the data is used.
Topic 7- RANSAC
RANSAC, which stands for "Random Sample Consensus," is a widely used algorithm in
computer vision and image processing for solving problems related to robust parameter
estimation from a set of data points contaminated with outliers. It was originally introduced by
Martin Fischler and Robert Bolles in 1981 for solving problems like fitting models to noisy
data or robustly estimating transformation parameters between images.
Here's how RANSAC works in the context of computer vision:
1-Problem Description: Let's say you have a set of data points, and you want to estimate a
model (e.g., a line, a plane, a transformation matrix, etc.) that best fits these points. However,
some of the data points may be outliers, which means they do not conform to the underlying
model.
2-Algorithm Steps:
Initialization: Choose a randomly selected subset of data points (minimum required to
estimate the model) and fit a model to this subset. For example, if you're fitting a line,
select two random points and fit a line through them.
Inlier Selection: Calculate the error or distance between the model and all other data
points. Data points that are within a certain threshold (inliers) are considered to support
the model, while those outside the threshold (outliers) are considered to be noise.
Model Evaluation: Check if the number of inliers (supporting data points) for the
current model is above a certain threshold. If it is, the model is considered good enough.
Iteration: Repeat the above steps for a specified number of iterations or until a
termination condition is met. Each iteration produces a model and a corresponding set
of inliers.
Model Selection: Choose the model that has the largest number of inliers as the best
estimate of the underlying model.
3-Termination: The algorithm may terminate after a fixed number of iterations or when a
sufficient number of inliers are found, depending on the application and desired accuracy.
4-Refinement: Optionally, you can refine the estimated model using all the inliers to improve
its accuracy.
Applications of RANSAC
Image Registration: Estimating the transformation (rotation, translation, scaling)
between two images, even when there are outliers or occlusions.
Feature Matching: Matching features in different images despite variations in lighting,
viewpoint, or partial occlusions.
Homography Estimation: Estimating the homography matrix for planar object
detection and image stitching.
Line and Shape Fitting: Fitting lines, circles, or other geometric shapes to noisy data.
3D Reconstruction: Estimating the pose of a camera and the positions of 3D points
from a set of 2D image points.
Topic 8- 3-D reconstruction framework
A 3-D reconstruction framework is a set of methods, algorithms, and tools used to create a
three-dimensional representation of an object or scene from two-dimensional images or other
data sources. This technology is widely used in various fields, including computer vision,
computer graphics, medical imaging, archaeology, robotics, and more. The goal is to create a
3D model that accurately represents the shape, appearance, and structure of the real-world
object or environment.
Here are the key components and steps involved in a typical 3-D reconstruction framework:
Data Acquisition: Collect 2D images or sensor data of the object or scene from
multiple viewpoints. This can be done using cameras, LiDAR (Light Detection and
Ranging), depth sensors, or other data sources.
Camera Calibration (if applicable): If cameras are used, calibrate them to obtain
accurate intrinsic and extrinsic parameters. This step is crucial for accurate 3D
reconstruction.
Feature Extraction: Identify and extract distinctive features or keypoints from the 2D
images or data. These features serve as reference points for matching and aligning
images.
Matching and Correspondence: Match corresponding features between pairs of
images. This step helps establish relationships between different viewpoints.
Pose Estimation: Determine the camera poses (positions and orientations) for each
image relative to a common coordinate system. This can be done using techniques like
structure-from-motion (SfM) or visual odometry.
Depth Estimation: Calculate depth information for each pixel or feature point in the
images. Depth estimation can be achieved using stereo vision, depth sensors, or other
depth sensing methods.
Point Cloud Generation: Combine the 2D images and depth information to create a
3D point cloud representation of the scene. Each point in the cloud corresponds to a 3D
position in the real world.
Surface Reconstruction: Create a surface mesh or voxel representation from the point
cloud data. There are various algorithms for this step, including Marching Cubes,
Poisson reconstruction, and Delaunay triangulation.
Texture Mapping (optional): If available, project texture information from the
original images onto the reconstructed 3D surface to add color and appearance details.
Refinement and Post-processing: Refine the 3D model by addressing issues like
noise, outliers, and holes. Post-processing may also include smoothing and
simplification of the 3D mesh.
Visualization and Analysis: Visualize the reconstructed 3D model and analyze it for
various applications, such as 3D printing, augmented reality, virtual reality, simulation,
or scientific research.
Topic 9- Auto-calibration
Auto-calibration in computer vision and image processing refers to the process of automatically
determining and adjusting the parameters of a camera or imaging system without human
intervention. This is a crucial step in various computer vision and image processing
applications, as it ensures that the images captured by the camera accurately represent the real-
world scene.
Here are some key aspects and methods related to auto-calibration in computer vision and
image processing:
Camera Parameters: Auto-calibration typically involves estimating or refining the
camera's intrinsic and extrinsic parameters.
Intrinsic Parameters: These include parameters such as the focal length, principal
point, and lens distortion coefficients. Techniques like Zhang's camera calibration
algorithm and Bouguet's calibration toolbox are commonly used for intrinsic parameter
estimation.
Extrinsic Parameters: These parameters define the camera's position and orientation
in the world coordinate system. They are often estimated through techniques like
Structure from Motion (SfM) or simultaneous localization and mapping (SLAM).
Auto-Focus: In some applications, auto-calibration can also involve adjusting the focus
of the camera lens automatically to achieve sharp images. This is often done using
contrast-based or phase-based auto-focus algorithms.
Auto-Exposure: Auto-calibration can include adjusting the exposure settings of the
camera to ensure that images are correctly exposed. This is especially important when
dealing with varying lighting conditions.
Auto-White Balance: Automatic white balance correction ensures that colors in the
image are accurately represented under different lighting conditions. Auto-calibration
may involve adjusting the white balance settings to achieve this.
Marker-Based Calibration: In some cases, calibration can be performed using known
markers or patterns placed in the scene. These markers can help the system estimate
camera parameters more accurately.
Self-Calibration: Self-calibration methods aim to estimate camera parameters without
the need for any external calibration objects or markers. This is often achieved through
the analysis of the scene geometry or image correspondences over multiple views.
Online Calibration: In real-time applications, auto-calibration may need to be
performed continuously as the camera or scene conditions change. Online calibration
methods adapt to these changes in real-time.
Machine Learning-Based Calibration: Machine learning techniques, such as deep
neural networks, can also be employed for auto-calibration tasks. These methods can
learn to estimate camera parameters or correct distortions directly from images.
Multi-Camera Systems: In scenarios involving multiple cameras (e.g., stereo or multi-
view setups), auto-calibration may involve estimating the relative pose and calibration
parameters between the cameras.