Autonomous Vehicle/Video Geometry

Direct Linear Transform (DLT)

Naranjito 2025. 4. 21. 15:53
  • Direct Linear Transform (DLT)

 

- A linear method to compute P from image–object point correspondences.

- What is a P? The DLT matrix or camera projection matrix

- Purpose : Project 3D → 2D with camera geometry (pose + intrinsics)


The equation shows how a 3D point 𝑋 is projected to a 2D image point π‘₯:
\(\mathbf{x} = K R [I_3 \mid -\mathbf{X}_O] \mathbf{X}\)

or more compactly:
π‘₯=𝑃𝑋


\(\mathbf{X}\in\mathbb{R}^4\) (homogeneous) : A 3D point in world coordinates (with an extra 1 appended:\([\mathrm{X},\mathrm{Y},\mathrm{Z},1]^\mathrm{T})\)

\(\mathbf{X}\in\mathbb{R}^3\) (homogeneous) : A 2D point in the image plane (projected pixel coordinate)

K : Intrinsic matrix (contains camera parameters like focal length, principal point)

R : Rotation matrix (camera orientation in world coordinates)

\(\mathbf{X}_{O}\) : Camera center (i.e., camera position in world coordinates)

\(I_{3}\mid-\mathbf{X}_{O}\) : Extrinsic matrix, transforms world coordinates to camera coordinates

\(\mathbf{P}=KR \begin{bmatrix} I_{3}\mid-\mathbf{X}_{O} \end{bmatrix}\) : The full projection matrix (3×4), combining intrinsic and extrinsic parameters



・ \(I_{3}\mid-\mathbf{X}_{O}\) : This is a special case of the extrinsic matrix when I assume:

- The world coordinate axes are aligned with the camera axes (i.e., no rotation)

- The only difference is the camera’s position in the world \({X}_{O}\) so,

- \(I_{3}\) : Identity matrix → no rotation between world and camera

- \(-\mathbf{X}_{O}\) : Negative camera center → shifts the world to align with the camera at origin

- But in most general cases, I’ll see this form instead:[π‘…βˆ£−𝑅𝐢] where C is the camera center in world coordinates, −RC translates the world origin to the camera's perspective

  • How the Direct Linear Transform (DLT) maps a 3D point X in world coordinates to a 2D point x in the image plane through matrix multiplication?

 

\(\mathbf{x}=\underbrace{\mathbf{K}}_{\mathrm{Intrinsic}}\cdot\underbrace{\mathbf{R}}_{\mathrm{Rotation}}\cdot\underbrace{[\mathbf{I}_3\mid-\mathbf{X}_O]}_{\text{Translation to camera frame}}\cdot\underbrace{\mathbf{X}}_{\text{Homogeneous world point}}\)

 

Homogeneous 3D point (X, Y, Z, 1)^T 4×1
Translates world point to camera origin
I3 : Identity matrix, 3×3
XO : Camera center, 3x1
3×4
R Rotates world frame to align with camera frame 3×3
K Intrinsic camera matrix (focal length, center, skew) 3×3
x Homogeneous 2D point in image (e.g., pixels) 3×1

  • Unified Projection Matrix

 

\(\mathbf{P}=\underbrace{\mathbf{K}}_{\mathrm{Intrinsic}}\cdot\underbrace{\mathbf{R}}_{\mathrm{Rotation}}\cdot\underbrace{[\mathbf{I}_3\mid-\mathbf{X}_O]}_{\text{Translation to camera frame}}\)

So, x=P ・ X

  • P
Exactly the Direct Linear Transform (DLT) matrix or the camera projection matrix.

 


  • Why is P called a Direct Linear Transform? Because :

 

- It linearly transforms a 3D point in homogeneous world coordinates \(\mathbf{X}\in\mathbb{R}^4\) into a 2D image \(\mathbf{X}\in\mathbb{R}^3\), up to scale : x = P ⋅ X

- This transformation is represented as a single matrix multiplication — that’s why it’s “direct”.

- It encapsulates both:

・ Extrinsics: camera pose — position t and orientation R

・ Intrinsics: internal camera parameters (focal length, principal point, skew) in K

  • Compute the 5 Intrinsic Parameters(K) and 6 Extrinsic Parameters([R|t])

 

\(K= \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}\), \(\begin{bmatrix} \mathbf{R} & \mathbf{t} \end{bmatrix}= \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{bmatrix}\), here the Rotation Matrix must be orthonormal, in other words, \(\mathbb{R}^{\mathrm{~T}}\mathbb{R}=\mathbb{I}\)

  • How many points are needed?

 

Step 1: Understand the Camera Matrix \(\mathbf{P}= \begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}\)

Step 2 : Multiply 𝑃⋅𝑋=\(\mathbf{P} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}= \begin{bmatrix} u \\ v \\ w \end{bmatrix}\)

Then : \(\begin{gathered} u=p_{11}X+p_{12}Y+p_{13}Z+p_{14} \\ v=p_{21}X+p_{22}Y+p_{23}Z+p_{24} \\ v=p_{31}X+p_{32}Y+p_{33}Z+p_{34} \end{gathered}\)

Step 3 : Go from Homogeneous to Euclidean Coordinates

\(\large x=\frac{u}{w},\quad y=\frac{v}{w}\)

So finally :

\(\begin{aligned} x & =\frac{p_{11}X+p_{12}Y+p_{13}Z+p_{14}}{p_{31}X+p_{32}Y+p_{33}Z+p_{34}} \\ y & =\frac{p_{21}X+p_{22}Y+p_{23}Z+p_{24}}{p_{31}X+p_{32}Y+p_{33}Z+p_{34}} \end{aligned}\)

 

Each 3D-2D point correspondence provides 2 equations. To solve for 12 unknowns in P, I need at least 6 point correspondences (because 6×2=12), if the condition is uncalibrated camera.

  • Decomposition of P

 

\(\mathbf{P}=\underbrace{\mathbf{K}}_{\mathrm{Intrinsic}}\cdot\underbrace{\mathbf{R}}_{\mathrm{Rotation}}\cdot\underbrace{[\mathbf{I}_3\mid-\mathbf{X}_O]}_{\text{Translation to camera frame}}\) = \([\mathbf{KR}\mid-\mathbf{KRX}_{O}]=[\mathbf{H}\mid\mathbf{h}]\)

H=KR (a 3×3 matrix).

h=\([\mathbf{KR}\mid-\mathbf{KRX}_{O}]\)\

Therefore, \(\mathbf{X}_{O}=-\mathbf{R}^{\top}\mathbf{K}^{-1}\mathbf{h}\)

Therefore, \(X_{o}=-H^{-1}h\)

  • DLT in a Nutshell

 

We want to solve : 𝑀 ⋅ 𝑝 = 0


Where : M is a 2 𝐼 × 12 matrix

- Why 2 𝐼 × 12 ?

Each point correspondence \(\mathbf{X}_i,\mathbf{x}_i\) gives two rows of 𝑀, because we extract two equations:

・ One equation from the projection into π‘₯

・ One equation from the projection into 𝑦

So:

・ For 𝐼 correspondences → 2 𝐼 equations

・ Each equation is linear in 12 unknowns → 12 columns

Therefore, \(\mathbf{M}\in\mathbb{R}^{2I\times12}\)

\({p}=\mathrm{vec}(\mathbf{P})\in\mathbb{R}^{12}\)

- vec() is the vectorization operator: it takes a matrix and stacks its columns into a single column vector.

- \(\mathbf{P}\in\mathbb{R}^{3\times4}\) : 3 rows and 4 columns = 12 elements total

- \(\mathbf{P}= \begin{bmatrix} p_{11} & p_{12} & p_{13} & p_{14} \\ p_{21} & p_{22} & p_{23} & p_{24} \\ p_{31} & p_{32} & p_{33} & p_{34} \end{bmatrix}\), then : \(\mathrm{vec}(\mathbf{P})=\mathbf{p}= \begin{bmatrix} p_{11} \\ p_{21} \\ p_{31} \\ p_{12} \\ p_{22} \\ p_{32} \\ p_{13} \\ p_{23} \\ p_{33} \\ p_{14} \\ p_{24} \\ p_{34} \end{bmatrix}\in\mathbb{R}^{12\times1}\)


Each 3D point \((X_i,Y_i,Z_i)\) and corresponding image point \((x_i,y_i)\) gives 2 equations (for x and y).

1. Construct two rows per point

For each point 𝑖 , construct:

\(\mathbf{a}_{x_i}^\top=(-X_i,-Y_i,-Z_i,-1,0,0,0,0,x_iX_i,x_iY_i,x_iZ_i,x_i)\)\(\mathbf{a}_{y_i}^\top=(0,0,0,0,-X_i,-Y_i,-Z_i,-1,y_iX_i,y_iY_i,y_iZ_i,y_i)\)

2. Stack them to form 𝑀 : \(\mathbf{M}= \begin{bmatrix} \mathbf{a}_{\frac{x_1}{\top}}^\top \\ \mathbf{a}_{y_1}^\top \\ \vdots \\ \mathbf{a}_{\frac{x_I}{\top}}^\top \\ \mathbf{a}_{y_I}^\top \end{bmatrix}\in\mathbb{R}^{2I\times12}\)


  • Final Form

 

\(\mathbf{M}\cdot\mathbf{p}=\mathbf{0}\quad\mathrm{where~}\mathbf{p}=\mathrm{vec}(\mathbf{P})\in\mathbb{R}^{12}\)

To solve for 𝑝, you’d repeat this for at least 6 point correspondences (12 equations) and solve using SVD.