Autonomous Vehicle/Video Geometry
Direct Linear Transform (DLT)
Naranjito
2025. 4. 21. 15:53
- Direct Linear Transform (DLT)
- A linear method to compute P from image–object point correspondences.
- What is a P? The DLT matrix or camera projection matrix
- Purpose : Project 3D → 2D with camera geometry (pose + intrinsics)

The equation shows how a 3D point π is projected to a 2D image point π₯:
\(\mathbf{x} = K R [I_3 \mid -\mathbf{X}_O] \mathbf{X}\)
or more compactly:
π₯=ππ
\(\mathbf{x} = K R [I_3 \mid -\mathbf{X}_O] \mathbf{X}\)
or more compactly:
π₯=ππ
\(\mathbf{X}\in\mathbb{R}^4\) (homogeneous) : A 3D point in world coordinates (with an extra 1 appended:\([\mathrm{X},\mathrm{Y},\mathrm{Z},1]^\mathrm{T})\)
\(\mathbf{X}\in\mathbb{R}^3\) (homogeneous) : A 2D point in the image plane (projected pixel coordinate)
K : Intrinsic matrix (contains camera parameters like focal length, principal point)
R : Rotation matrix (camera orientation in world coordinates)
\(\mathbf{X}_{O}\) : Camera center (i.e., camera position in world coordinates)
\(I_{3}\mid-\mathbf{X}_{O}\) : Extrinsic matrix, transforms world coordinates to camera coordinates
\(\mathbf{P}=KR \begin{bmatrix} I_{3}\mid-\mathbf{X}_{O} \end{bmatrix}\) : The full projection matrix (3×4), combining intrinsic and extrinsic parameters
γ» \(I_{3}\mid-\mathbf{X}_{O}\) : This is a special case of the extrinsic matrix when I assume:
- The world coordinate axes are aligned with the camera axes (i.e., no rotation)
- The only difference is the camera’s position in the world \({X}_{O}\) so,
- \(I_{3}\) : Identity matrix → no rotation between world and camera
- \(-\mathbf{X}_{O}\) : Negative camera center → shifts the world to align with the camera at origin
- But in most general cases, I’ll see this form instead:[π β£−π πΆ] where C is the camera center in world coordinates, −RC translates the world origin to the camera's perspective
\(\mathbf{X}\in\mathbb{R}^3\) (homogeneous) : A 2D point in the image plane (projected pixel coordinate)
K : Intrinsic matrix (contains camera parameters like focal length, principal point)
R : Rotation matrix (camera orientation in world coordinates)
\(\mathbf{X}_{O}\) : Camera center (i.e., camera position in world coordinates)
\(I_{3}\mid-\mathbf{X}_{O}\) : Extrinsic matrix, transforms world coordinates to camera coordinates
\(\mathbf{P}=KR \begin{bmatrix} I_{3}\mid-\mathbf{X}_{O} \end{bmatrix}\) : The full projection matrix (3×4), combining intrinsic and extrinsic parameters
γ» \(I_{3}\mid-\mathbf{X}_{O}\) : This is a special case of the extrinsic matrix when I assume:
- The world coordinate axes are aligned with the camera axes (i.e., no rotation)
- The only difference is the camera’s position in the world \({X}_{O}\) so,
- \(I_{3}\) : Identity matrix → no rotation between world and camera
- \(-\mathbf{X}_{O}\) : Negative camera center → shifts the world to align with the camera at origin
- But in most general cases, I’ll see this form instead:[π β£−π πΆ] where C is the camera center in world coordinates, −RC translates the world origin to the camera's perspective
-
How the Direct Linear Transform (DLT) maps a 3D point X in world coordinates to a 2D point x in the image plane through matrix multiplication?
\(\mathbf{x}=\underbrace{\mathbf{K}}_{\mathrm{Intrinsic}}\cdot\underbrace{\mathbf{R}}_{\mathrm{Rotation}}\cdot\underbrace{[\mathbf{I}_3\mid-\mathbf{X}_O]}_{\text{Translation to camera frame}}\cdot\underbrace{\mathbf{X}}_{\text{Homogeneous world point}}\)
| Homogeneous 3D point (X, Y, Z, 1)^T | 4×1 | |
| Translates world point to camera origin I3 : Identity matrix, 3×3
XO : Camera center, 3x1
|
3×4 | |
| R | Rotates world frame to align with camera frame | 3×3 |
| K | Intrinsic camera matrix (focal length, center, skew) | 3×3 |
| x | Homogeneous 2D point in image (e.g., pixels) | 3×1 |
- Unified Projection Matrix
\(\mathbf{P}=\underbrace{\mathbf{K}}_{\mathrm{Intrinsic}}\cdot\underbrace{\mathbf{R}}_{\mathrm{Rotation}}\cdot\underbrace{[\mathbf{I}_3\mid-\mathbf{X}_O]}_{\text{Translation to camera frame}}\)
So, x=P γ» X
So, x=P γ» X
- P
Exactly the Direct Linear Transform (DLT) matrix or the camera projection matrix.
- Why is P called a Direct Linear Transform? Because :
- It linearly transforms a 3D point in homogeneous world coordinates \(\mathbf{X}\in\mathbb{R}^4\) into a 2D image \(\mathbf{X}\in\mathbb{R}^3\), up to scale : x = P ⋅ X
- This transformation is represented as a single matrix multiplication — that’s why it’s “direct”.
- It encapsulates both:
γ» Extrinsics: camera pose — position t and orientation R
γ» Intrinsics: internal camera parameters (focal length, principal point, skew) in K
- This transformation is represented as a single matrix multiplication — that’s why it’s “direct”.
- It encapsulates both:
γ» Extrinsics: camera pose — position t and orientation R
γ» Intrinsics: internal camera parameters (focal length, principal point, skew) in K
- Compute the 5 Intrinsic Parameters(K) and 6 Extrinsic Parameters([R|t])
\(K= \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}\), \(\begin{bmatrix} \mathbf{R} & \mathbf{t} \end{bmatrix}= \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{bmatrix}\), here the Rotation Matrix must be orthonormal, in other words, \(\mathbb{R}^{\mathrm{~T}}\mathbb{R}=\mathbb{I}\)
- How many points are needed?
Step 1: Understand the Camera Matrix \(\mathbf{P}= \begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}\)
Step 2 : Multiply π⋅π=\(\mathbf{P} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}= \begin{bmatrix} u \\ v \\ w \end{bmatrix}\)
Then : \(\begin{gathered} u=p_{11}X+p_{12}Y+p_{13}Z+p_{14} \\ v=p_{21}X+p_{22}Y+p_{23}Z+p_{24} \\ v=p_{31}X+p_{32}Y+p_{33}Z+p_{34} \end{gathered}\)
Step 3 : Go from Homogeneous to Euclidean Coordinates
\(\large x=\frac{u}{w},\quad y=\frac{v}{w}\)
So finally :
\(\begin{aligned} x & =\frac{p_{11}X+p_{12}Y+p_{13}Z+p_{14}}{p_{31}X+p_{32}Y+p_{33}Z+p_{34}} \\ y & =\frac{p_{21}X+p_{22}Y+p_{23}Z+p_{24}}{p_{31}X+p_{32}Y+p_{33}Z+p_{34}} \end{aligned}\)
Step 2 : Multiply π⋅π=\(\mathbf{P} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}= \begin{bmatrix} u \\ v \\ w \end{bmatrix}\)
Then : \(\begin{gathered} u=p_{11}X+p_{12}Y+p_{13}Z+p_{14} \\ v=p_{21}X+p_{22}Y+p_{23}Z+p_{24} \\ v=p_{31}X+p_{32}Y+p_{33}Z+p_{34} \end{gathered}\)
Step 3 : Go from Homogeneous to Euclidean Coordinates
\(\large x=\frac{u}{w},\quad y=\frac{v}{w}\)
So finally :
\(\begin{aligned} x & =\frac{p_{11}X+p_{12}Y+p_{13}Z+p_{14}}{p_{31}X+p_{32}Y+p_{33}Z+p_{34}} \\ y & =\frac{p_{21}X+p_{22}Y+p_{23}Z+p_{24}}{p_{31}X+p_{32}Y+p_{33}Z+p_{34}} \end{aligned}\)
Each 3D-2D point correspondence provides 2 equations. To solve for 12 unknowns in P, I need at least 6 point correspondences (because 6×2=12), if the condition is uncalibrated camera.
- Decomposition of P
\(\mathbf{P}=\underbrace{\mathbf{K}}_{\mathrm{Intrinsic}}\cdot\underbrace{\mathbf{R}}_{\mathrm{Rotation}}\cdot\underbrace{[\mathbf{I}_3\mid-\mathbf{X}_O]}_{\text{Translation to camera frame}}\) = \([\mathbf{KR}\mid-\mathbf{KRX}_{O}]=[\mathbf{H}\mid\mathbf{h}]\)
H=KR (a 3×3 matrix).
h=\([\mathbf{KR}\mid-\mathbf{KRX}_{O}]\)\
Therefore, \(\mathbf{X}_{O}=-\mathbf{R}^{\top}\mathbf{K}^{-1}\mathbf{h}\)
Therefore, \(X_{o}=-H^{-1}h\)
H=KR (a 3×3 matrix).
h=\([\mathbf{KR}\mid-\mathbf{KRX}_{O}]\)\
Therefore, \(\mathbf{X}_{O}=-\mathbf{R}^{\top}\mathbf{K}^{-1}\mathbf{h}\)
Therefore, \(X_{o}=-H^{-1}h\)
- DLT in a Nutshell
We want to solve : π ⋅ π = 0
Where : M is a 2 πΌ × 12 matrix
- Why 2 πΌ × 12 ?
Each point correspondence \(\mathbf{X}_i,\mathbf{x}_i\) gives two rows of π, because we extract two equations:
γ» One equation from the projection into π₯
γ» One equation from the projection into π¦
So:
γ» For πΌ correspondences → 2 πΌ equations
γ» Each equation is linear in 12 unknowns → 12 columns
Therefore, \(\mathbf{M}\in\mathbb{R}^{2I\times12}\)
- Why 2 πΌ × 12 ?
Each point correspondence \(\mathbf{X}_i,\mathbf{x}_i\) gives two rows of π, because we extract two equations:
γ» One equation from the projection into π₯
γ» One equation from the projection into π¦
So:
γ» For πΌ correspondences → 2 πΌ equations
γ» Each equation is linear in 12 unknowns → 12 columns
Therefore, \(\mathbf{M}\in\mathbb{R}^{2I\times12}\)
\({p}=\mathrm{vec}(\mathbf{P})\in\mathbb{R}^{12}\)
- vec() is the vectorization operator: it takes a matrix and stacks its columns into a single column vector.
- \(\mathbf{P}\in\mathbb{R}^{3\times4}\) : 3 rows and 4 columns = 12 elements total
- \(\mathbf{P}= \begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}\), then : \(\mathrm{vec}(\mathbf{P})=\mathbf{p}= \begin{bmatrix} p_{11} \\ p_{21} \\ p_{31} \\ p_{12} \\ p_{22} \\ p_{32} \\ p_{13} \\ p_{23} \\ p_{33} \\ p_{14} \\ p_{24} \\ p_{34} \end{bmatrix}\in\mathbb{R}^{12\times1}\)
- vec() is the vectorization operator: it takes a matrix and stacks its columns into a single column vector.
- \(\mathbf{P}\in\mathbb{R}^{3\times4}\) : 3 rows and 4 columns = 12 elements total
- \(\mathbf{P}= \begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}\), then : \(\mathrm{vec}(\mathbf{P})=\mathbf{p}= \begin{bmatrix} p_{11} \\ p_{21} \\ p_{31} \\ p_{12} \\ p_{22} \\ p_{32} \\ p_{13} \\ p_{23} \\ p_{33} \\ p_{14} \\ p_{24} \\ p_{34} \end{bmatrix}\in\mathbb{R}^{12\times1}\)
Each 3D point \((X_i,Y_i,Z_i)\) and corresponding image point \((x_i,y_i)\) gives 2 equations (for x and y).
1. Construct two rows per point
For each point π , construct:
\(\mathbf{a}_{x_i}^\top=(-X_i,-Y_i,-Z_i,-1,0,0,0,0,x_iX_i,x_iY_i,x_iZ_i,x_i)\)\(\mathbf{a}_{y_i}^\top=(0,0,0,0,-X_i,-Y_i,-Z_i,-1,y_iX_i,y_iY_i,y_iZ_i,y_i)\)
2. Stack them to form π : \(\mathbf{M}= \begin{bmatrix} \mathbf{a}_{\frac{x_1}{\top}}^\top \\ \mathbf{a}_{y_1}^\top \\ \vdots \\ \mathbf{a}_{\frac{x_I}{\top}}^\top \\ \mathbf{a}_{y_I}^\top \end{bmatrix}\in\mathbb{R}^{2I\times12}\)
1. Construct two rows per point
For each point π , construct:
\(\mathbf{a}_{x_i}^\top=(-X_i,-Y_i,-Z_i,-1,0,0,0,0,x_iX_i,x_iY_i,x_iZ_i,x_i)\)\(\mathbf{a}_{y_i}^\top=(0,0,0,0,-X_i,-Y_i,-Z_i,-1,y_iX_i,y_iY_i,y_iZ_i,y_i)\)
2. Stack them to form π : \(\mathbf{M}= \begin{bmatrix} \mathbf{a}_{\frac{x_1}{\top}}^\top \\ \mathbf{a}_{y_1}^\top \\ \vdots \\ \mathbf{a}_{\frac{x_I}{\top}}^\top \\ \mathbf{a}_{y_I}^\top \end{bmatrix}\in\mathbb{R}^{2I\times12}\)
- Final Form
\(\mathbf{M}\cdot\mathbf{p}=\mathbf{0}\quad\mathrm{where~}\mathbf{p}=\mathrm{vec}(\mathbf{P})\in\mathbb{R}^{12}\)
To solve for π, you’d repeat this for at least 6 point correspondences (12 equations) and solve using SVD.
To solve for π, you’d repeat this for at least 6 point correspondences (12 equations) and solve using SVD.