Deep Learning/Object Detection

Faster RCNN

Naranjito 2024. 1. 30. 18:27

 

  • Faster RCNN

 

 

1. CNN(VGG16) : Input the image to ConvNet, get the Feature map

2. RPN : Extract the Region Proposals through Anchor Box

3. RoI Projection : Get the RoI Feature map through RoI Pooling

4. Fast R-CNN : Input the RoI Feature map by Alternating Training


  • Anchor box

 

 

A method for capturing objects of various sizes, same concept as a bounding box but with a predefined different scale and axpect ratio.

It predefines a total of nine different anchor boxes with three scales ([128, 256, 512]) and three aspect ratios ([1:1, 1:2, 2:1]).


  • Predefined Anchor box
w × h = s 2 w = 1 2 × h 1 2 × h 2 = s 2 h = 2 s 2 w = 2 s 2 2

 

w =width
h =height
s =scale

 

The anchor box is created based on the center of each grid cell in the original image. It fixs the anchor, based on the sub-sampling ratio in the original image. It creates nine predefined anchor boxes based on this anchor.

 

In the figure above, the size of the original image is 600x800, and the sub-sampling ratio=1/16. At this time, the number of anchors is 1900 (=600/16 x 800/16), and the anchor box produced a total of 17100 (=1900 x 9).

 

Using this method, it creates nine times more bounding boxes than using conventionally fixed-sized bounding boxes, and it is possible to capture objects of a wider variety of sizes.


  • RPN(Region Proposal Network)

 

 

A network that extracts region proposals from the original image.

When you create an anchor box from the original image, a lot of region proposals are created. 

It gives a Class Score for region proposals and output Bounding Box Coefficient. The Class Score classifies whether or not an object is contained only


 

Example of the image above.

 

1) VGG16 : Get the Feature Map(8 x 8) after apply the Sub-Sampling Ratio(1/100) of original image(800x800), the number of channels is 512. 8x8x512 

2) Convolution : 3x3 Conv, Convolution operate on the Feature Map(1) with padding as to keep the original feature map size. 

3) Get the Feature Map : 8x8x512 

4) Convolution for Class Score : 1x1 ConvConvolution operate on the Feature Map(3). 8x8x2x9

5) Convolution for a Bounding Box Coefficient : 1x1 Conv, Convolution operate on the Feature Map(3). 8x8x4x9

Result of RPN(Region Proposal Network)


  • Classifier
p = { 1 if  I o U > 0.7 1 if  I o U < 0.3 0 if otherwise

 

1 : Positive, there is an object

-1 : Negative, there is no object

0 : neglect


  • Bounding Box Regression

 

- Bounding box regression t

 

It uses 4 coordinate values( t ), t is a vector. t x = ( x x a ) / w a t y = ( y y a ) / h a t w = log ( w / w a ) t h = log ( h / h a )

 

 

t x , t y : 박스의 center coordinates
t w , t h : 박스의 width, height
x , y , w , h : predicted box
x a , y a , w a , h a : anchor box
- Ground-truth vector t

 

t x = ( x x a ) / w a t y = ( y y a ) / h a t w = log ( w / w a ) t h = log ( h / h a )

 

x , y , w , h : ground-truth box
  • Multi-task loss
L ( { p i } , { t i } ) = 1 N c l s i L c l s ( p i , p i ) + λ 1 N r e g i p i L r e g ( t i , t i )

 

i : mini-batch 내의 anchor의 index

 

p i : anchor i 에 객체가 포함되어 있을 예측 확률

 

p i : anchor가 양성일 경우 1, 음성일 경우 0을 나타내는 index parameter

 

t i : 예측 bounding box의 파라미터화된 좌표(coefficient)

 

t i : ground truth box의 파라미터화된 좌표

 

L c l s : Loss loss

 

L r e g : 경계 박스 회귀 손실, 회귀 손실값은 positive 앵커 박스일 때만(객체 일 때만) 활성화됩니다. negative일 때, 곧 배경일 때는 경계 박스를 구할 필요 없으니까요.

 

N c l s : mini-batch의 크기(논문에서는 256으로 지정)

 

N r e g : anchor 위치의 수(논문에서는 2400으로 지정)

 

λ : balancing parameter(default=10)
  • Training Faster R-CNN

 

 

1) Feature extraction by pre-trained VGG16

 

- Input : 800x800x3 sized image

- Process : feature extraction by pre-trained VGG16, sub-sampling ratio is 1/16

- Output : 50x50x512 sized feature map

 

2) Generate Anchors by Anchor generation layer

 

Before extracting region profiles, the process of creating an anchor box for the original image is required.

- Input : 800x800x3 sized image

- Process : generate anchors

- Output : 22500(=50x50x9) anchor boxes

 

3) Class scores and Bounding box regressor by RPN

 

- Input : 50x50x512 sized feature map

- Process : Region proposal by RPN

- Output : class scores(50x50x2x9 sized feature map) and bounding box regressors(50x50x4x9 sized feature map)

 

4) Select anchors for training Fast R-CNN

 

- Input : top-N ranked anchor boxes(after apply Non maximum suppression to remove inappropriate objects), ground truth boxes(positive if IoU is over 0.5, negative if IoU is between 0.1 and 0.5)

- Process : select region proposals for training Fast R-CNN

- Output : positive/negative samples with target regression coefficients

 

5) Max pooling by RoI pooling

 

- Input : 50x50x512 sized feature map

- positive/negative samples with target regression coefficients

- Process : RoI pooling

- Output : 7x7x512 sized feature map

 

6) Train Fast R-CNN by Multi-task loss

 

- Input : 7x7x512 sized feature map

- Process 

    - feature extraction by fc layer

    - classification by Classifier

    - bounding box regression by Bounding box regressor

    - Train Fast R-CNN by Multi-task loss

- Output : loss(Loss loss + Smooth L1 loss)

 

https://herbwood.tistory.com/10

https://bkshin.tistory.com/entry/%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-Faster-R-CNN-%ED%86%BA%EC%95%84%EB%B3%B4%EA%B8%B0

https://incredible.ai/deep-learning/2018/03/17/Faster-R-CNN/#training-rpn

https://towardsdatascience.com/understanding-and-implementing-faster-r-cnn-a-step-by-step-guide-11acfff216b0

'Deep Learning > Object Detection' 카테고리의 다른 글

(prerequisite-YOLO) One Stage Object Detection  (0) 2024.02.11
(prerequisite-YOLO) DarkNet  (0) 2024.02.11
Fast RCNN  (0) 2024.01.29
(prerequisite-Fast R-CNN) Truncated SVD  (0) 2024.01.26
SPPNet  (0) 2024.01.24