[ML Paper] R2CNN-Rotational Region CNN for Orientation Robust Scene Text Detection

ML Paper

[ML Paper] R2CNN-Rotational Region CNN for Orientation Robust Scene Text Detection

숄구-ml 2022. 5. 23. 16:01

R2CNN은 axis-aligned bounding boxes (모든 면의 법선이 좌표축과 일치하는 박스)를 기반으로 inclined bounding boxes (aabb가 임의의 방향으로 회전된 박스)를 제안해 객체를 탐지하는 네트워크다. Faster R-CNN을 기반으로 만들어진 네트워크이기 때문에 Region proposal network에서 axis-aligned bbox를 찾아내고, Roi pooling 단계에서 정해진 사이즈로 feature들을 뽑아준 후 concatenate 시킨다. 후에 inclined non maximum suppression 을 거쳐서 회전된 boxes의 결과물을 얻을 수 있다. (논문링크)

Region Proposal Network

RPN 단계에서는 axis-aligned bbox 를 뽑아낸다.
Faster R-CNN에서는 anchor scales이 [8, 16, 32]였다면, R2CNN에서는 작은 텍스트를 탐지해 내기 위해 하나의 스케일을 더 추가해 [4, 8, 16, 32]가 되었다.
RPN의 loss는 Faster RCNN의 RPN loss와 동일한 방식으로 구성된다.

ROI Pooling

RPN 단계에서 나온 anchor boxes는 ROI Pooling을 거쳐 [7x7, 11x3, 3x11]의 고정된 사이즈로 pooling된다. 논문에서는 scene text detection을 테스트 케이스로 들고있기 때문에, 글자의 특성상 가로로 길거나 세로로 긴 객체들이 많이 11x3와 3x11이 추가되었다.
Fully connected layer가 존재하기 때문에, 각기 다른 사이즈인 anchor boxes를 고정된 사이즈로 맞추어 주어야 하므로 roi pooling 단계가 필요하다.
pooling된 feature들은 concatenate 하여 fully connected layer에 입력시키고 후에 objectness scores, axis-aligned bbox coordinates, 그리고 inclined bboxes coordinates를 뽑아낸다.

Inclined non-maxinum suppression

normal nms suppression 은 겹쳐진 anchor boxes 중에서 특정 iou값 이상이 되는 boxes를 제거한다.
그런데 이를 inclined anchor box에 적용해 버리면 위의 이미지와 같이 잡았던 객체도 놓치게 되는 현상이 발생한다. 따라서 inclined non-maximum suppression을 따로 적용해 효과적으로 box를 남긴다.

Loss Function

Lcls (p, t) - text / non-text classification loss (log loss), 여기서 p는 softmax function을 거쳐서 계산된 probability 값이다.
Lreg - axis-aligned & inclined box 모두 같은 loss function 을 사용한다. (smooth L1 loss). 그러나 axis-aligned bbox같은 경우는 (center x, center y, width, height) 에 대한 refression 값을 구하고, inclined bbox 같은 경우는 (x point at the left top corner, y point at the left top corner, second x point in clockwise, second y point in clockwise, height) 를 구한다.

Training

훈련 시에 데이터 augmentation을 진행할 때, rotation 회전 값으로 (-90, -75, -60, -45, -30, -15, 0, 15, 30, 45, 60, 75, 90) 를 주었다.
The image's shortest side is set as 720, while the longest side of an image is set as 1280 because the training and testing images in ICDAR 2015 [21] have the size (width: 1280, height: 720)

728x90

'ML Paper' 카테고리의 다른 글

[ML Paper] ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation (0)	2022.05.23
[ML Paper] VGG-Very Deep Convolutional Networks for Large-Scale Image Recognition (0)	2022.05.23
[ML Paper] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks (0)	2022.05.23
[ML Paper] Selective Search for Object Recognition (0)	2022.05.23

현재글[ML Paper] R2CNN-Rotational Region CNN for Orientation Robust Scene Text Detection

머신러닝 숄구