Focal Loss和它背後的男人RetinaNet

加入極市專業CV交流羣，與 1 0000+來自港科大、北大、清華、中科院、CMU、騰訊、百度 等名校名企視覺開發者互動交流！

同時提供每月大咖直播分享、真實項目需求對接、乾貨資訊彙總，行業技術交流。關注 極市平臺 公衆號，回覆 加羣， 立刻申請入羣~

說起Focal Loss，相信做CV的都不會陌生，當面臨正負樣本不平衡時可能第一個想到的就是用Focal Loss試試。但是怕是很多人會不知道這篇論文中所提出的one stage目標檢測模型RetinaNet，這也難怪，就連論文裏面也說了RetinaNet模型層面沒有大的創新，模型效果主要靠Focal Loss。RetinaNet作爲RCNN系中one stage檢測模型的代表，我覺得依然有學習研究的價值，這不僅會讓你加深對RCNN系模型的理解，而且有利於學習後面新的模型，畢竟後面很多模型都是借鑑了RetinaNet。這裏將介紹Focal Loss和RetinaNet（比如FCOS和YOLACT），也會給出一些具體的代碼實現。

Focal Loss

類別不平衡（class imbalance）是目標檢測模型訓練的一大難點（推薦這篇綜述文章Imbalance Problems in Object Detection: A Review，其中最嚴重的是正負樣本不平衡，因爲一張圖像的物體一般較少，而目前大部分的目標檢測模型在FCN上每個位置密集抽樣，無論是基於anchor的方法還是anchor free方法都如此。對於Faster R-CNN這種two stage模型，第一階段的RPN可以過濾掉很大一部分負樣本，最終第二階段的檢測模塊只需要處理少量的候選框，而且檢測模塊還採用正負樣本固定比例抽樣（比如1:3）或者OHEM方法（online hard example mining）來進一步解決正負樣本不平衡問題。對於one stage方法來說，detection部分要直接處理大量的候選位置，其中負樣本要佔據絕大部分，SSD的策略是採用hard mining，從大量的負樣本中選出loss最大的topk的負樣本以保證正負樣本比例爲1:3。其實RPN本質上也是one stage檢測模型，RPN訓練時所採取的策略也是抽樣，從一張圖像中抽取固定數量N（RPN採用的是256）的樣本，正負樣本分開來隨機抽樣N/2，如果正樣本不足，那就用負樣本填充，實現代碼非常簡單：

def subsample_labels(labels, num_samples, positive_fraction, bg_label):

"""

Return `num_samples` (or fewer, if not enough found)

random samples from `labels` which is a mixture of positives & negatives.

It will try to return as many positives as possible without

exceeding `positive_fraction * num_samples`, and then try to

fill the remaining slots with negatives.

Args:

labels (Tensor): (N, ) label vector with values:

* -1: ignore

* bg_label: background ("negative") class

* otherwise: one or more foreground ("positive") classes

num_samples (int): The total number of labels with value >= 0 to return.

Values that are not sampled will be filled with -1 (ignore).

positive_fraction (float): The number of subsampled labels with values > 0

is `min(num_positives, int(positive_fraction * num_samples))`. The number

of negatives sampled is `min(num_negatives, num_samples - num_positives_sampled)`.

In order words, if there are not enough positives, the sample is filled with

negatives. If there are also not enough negatives, then as many elements are

sampled as is possible.

bg_label (int): label index of background ("negative") class.

Returns:

pos_idx, neg_idx (Tensor):

1D vector of indices. The total length of both is `num_samples` or fewer.

"""

positive = torch.nonzero((labels != -1) & (labels != bg_label), as_tuple=True)[0]

negative = torch.nonzero(labels == bg_label, as_tuple=True)[0]

num_pos = int(num_samples * positive_fraction)

# protect against not enough positive examples

num_pos = min(positive.numel(), num_pos)

num_neg = num_samples - num_pos

# protect against not enough negative examples

num_neg = min(negative.numel(), num_neg)

# randomly select positive and negative examples

perm1 = torch.randperm(positive.numel(), device=positive.device)[:num_pos]

perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]

pos_idx = positive[perm1]

neg_idx = negative[perm2]

return pos_idx, neg_idx

與抽樣方法不同，Focal Loss從另外的視角來解決樣本不平衡問題，那就是根據置信度動態調整交叉熵loss，當預測正確的置信度增加時，loss的權重係數會逐漸衰減至0，這樣模型訓練的loss更關注難例，而大量容易的例子其loss貢獻很低。這裏以二分類來介紹Focal Loss（FL），對二分類最常用的是cross entropy （CE）loss，定義如下：

其中爲真實標籤，1表示爲正例，-1表示爲負例；而爲模型預測爲正例的概率值。進一步可以定義：

這樣CE就可以簡寫爲：

一般情形下，還可以爲正例設置權重係數，負例權重係數爲，此時的loss就變爲：

如圖1所示，藍色的曲線表示CE，如果定義的樣本的容易例子，從曲線可以看到，這部分簡單例子的loss值依然不低，而且這部分例子要佔很大比例，加起來後將淹沒難例的loss。這就是CE loss用於目標檢測模型訓練所存在的問題。

爲了解決CE的問題，FL在CE基礎上增加一個調節因子，FL定義如下：

圖1給出了時的FL曲線，可以看到當很小時，此時樣本被分類，調節因子值接近1，loss不受影響，而當趨近於1時，調節因子接近0，這樣已經能正確分類的簡單樣例loss大大降低。超參數爲0時，FL等價於CE，論文中發現取2時是最好的，此時若一個樣本的爲0.9，其對應的CE loss是FL的100倍，可見FL相比CE可以大大降低簡單例子的loss，使模型訓練更關注於難例。如果加上類別權重係數，FL變爲：

FL的實現也非常簡單，這裏給出Facebook的官方實現：

def sigmoid_focal_loss(

inputs: torch.Tensor,

targets: torch.Tensor,

alpha: float = -1,

gamma: float = 2,

reduction: str = "none",

) -> torch.Tensor:

"""

Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002.

Args:

inputs: A float tensor of arbitrary shape.

The predictions for each example.

targets: A float tensor with the same shape as inputs. Stores the binary

classification label for each element in inputs

(0 for the negative class and 1 for the positive class).

alpha: (optional) Weighting factor in range (0,1) to balance

positive vs negative examples. Default = -1 (no weighting).

gamma: Exponent of the modulating factor (1 - p_t) to

balance easy vs hard examples.

reduction: 'none' | 'mean' | 'sum'

'none': No reduction will be applied to the output.

'mean': The output will be averaged.

'sum': The output will be summed.

Returns:

Loss tensor with the reduction option applied.

"""

p = torch.sigmoid(inputs)

ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")

p_t = p * targets + (1 - p) * (1 - targets)

loss = ce_loss * ((1 - p_t) ** gamma)

if alpha >= 0:

alpha_t = alpha * targets + (1 - alpha) * (1 - targets)

loss = alpha_t * loss

if reduction == "mean":

loss = loss.mean()

elif reduction == "sum":

loss = loss.sum()

return loss

RetinaNet

RetinaNet可以看成RPN的多分類升級版，和RPN一樣，RetinaNet的backbone也是採用FPN，anchor機制也是類似的，畢竟都屬於RCNN系列作品。RetinaNet的整體架構如圖2所示，包括FPN backbone以及detection部分，detection部分包括分類分支和預測框分支。

Backbone

RetinaNet的backbone是基於ResNet的FPN，FPN在原始的CNN基礎上增加自上而下的路徑和橫向連接（lateral connections），如圖3d所示。圖3a這種是用圖像金字塔構建特徵金字塔，非常低效；圖3b是隻取最後的特徵；圖3c是取CNN的不同層次的特徵，SSD是這樣的思路，但是FPN更進一步，增加一個自上而下的邏輯，通過橫向連接融合不同層次特徵。

FPN的橫向連接如圖4所示，高層次特徵進行一個2x的上採樣（通過簡單的最近鄰插值實現），然後低層次特徵用一個1x1卷積層降維，這樣低層次特徵和高層次特徵維度一致（h, w, c均一致），直接相加。最後跟一個3x3卷積以消除上採樣帶來的不利影響。

原始的ResNet共有4個stage，其得到的特徵分別記爲，相較於輸入圖像，它們的stride分別爲。FPN的構建從開始，首先採用一個1x1卷積得到channel爲C（FPN中取256，FPN中所有level的channel都是一樣的）的新特徵，然後就可以自上而下生成不同level的新特徵，分別記爲，與ResNet的特徵是一一對應的，另外對直接採用一個stride=2的下采樣得到一個新特徵（基於stride=2的1x1 maxpooling實現），這樣最後FPN實際上得到了5個不同level的特徵，其stride分別爲，特徵維度均爲C。Faster R-CNN是採用這樣的FPN結構，但是RetinaNet卻有稍許變動，第一點是隻用ResNet的，這樣通過FPN得到的特徵是，相當於去掉了，的stride是4，特徵很大，去掉它可以減少計算量，後面會講到RetinaNet的anchor量和detection head都是比RPN更heavy的，這很有必要。另外新增兩個特徵和，在上加一個stride=2的3x3卷積得到，是在後面加ReLU和一個stride=2的3x3卷積得到。這樣RetinaNet的backbone得到特徵也是5個level，分別爲，其stride分別爲。一點題外話就是FCOS的backbone也是取，也算是借鑑了RetinaNet。而YOLOV3的backbone是基於DarkNet-53的FPN，其特徵共提取了3個層次，stride分別是。

Anchor

RetinaNet的anchor和RPN是類似的，RPN的輸入特徵是，每個level的特徵每個位置只放置一種scale的anchor，分別爲，但是卻設置3中長寬比。RetinaNet的輸入特徵是，anchor的設置與RPN一樣，但是每個位置增加3個不同的anchor大小，這樣每個位置共有A=9個anchor，所有level中anchor size的最小值是32，最大值是813。在訓練過程中，RetinaNet與RPN採用同樣的anchor匹配策略，即一種基於IoU的雙閾值策略：計算anchor與所有GT的IoU，取IoU最大值，若大於，則認爲此anchor爲正樣本，且負責預測IoU最大的那個GT；若低於，則認爲此anchor爲負樣本；若IoU值在之間，則忽略不參與訓練。這樣每個GT可能與多個anchor匹配，但可能某個GT與所有anchor的IoU最大值小於，儘管不滿足閾值條件，此時也應該保證這個GT被IoU值最大的anchor匹配。RPN中設定的兩個閾值爲，而RetinaNet設定的閾值爲。實現代碼如下：

# compute IoUs between GT and anchors [M, N]

match_quality_matrix = pairwise_iou(targets_per_image.gt_boxes, anchors_per_image)

BELOW_LOW_THRESHOLD = -1

BETWEEN_THRESHOLDS = -2

low_threshold = 0.4

high_threshold = 0.5

# match_quality_matrix is M (gt) x N (predicted)

# Max over gt elements (dim 0) to find best gt candidate for each prediction

matched_vals, matches = match_quality_matrix.max(dim=0)

all_matches = matches.clone() # for allow_low_quality_matches

# Assign candidate matches with low quality to negative (unassigned) values

below_low_threshold = matched_vals < low_threshold

between_thresholds = (matched_vals >= low_threshold) & (matched_vals < high_threshold)

matches[below_low_threshold] = BELOW_LOW_THRESHOLD

matches[between_thresholds] = BETWEEN_THRESHOLDS

# For each gt, find the prediction with which it has highest quality

highest_quality_foreach_gt, _ = match_quality_matrix.max(dim=1)

# Find highest quality match available, even if it is low, including ties

gt_pred_pairs_of_highest_quality = torch.nonzero(

match_quality_matrix == highest_quality_foreach_gt[:, None]

)

pred_inds_to_update = gt_pred_pairs_of_highest_quality[:, 1]

matches[pred_inds_to_update] = all_matches[pred_inds_to_update]

匹配的結果就是得到N維度（anchor數量） matches ，其值表示與每個anchor匹配的GT index，計算loss時就可以找到對應的label和box，若值爲-1，則是負樣本，若值爲-2，則是需要忽略。

另外，anchors和GT boxes之間的編解碼方案與Faster R-CNN完全一樣。

detection模塊

檢測模塊主要包括分類分支和box迴歸分支，其中分類分支用來預測每個位置的各個anchor（數量爲）的類別概率（類別數爲），而box迴歸分支用來預測每個位置各個anchor和GT之間的offset。分類分支包括4個3x3的卷積（ReLU激活函數，channel是256），最後是一個3x3的卷積，輸出channel爲，最後sigmoid激活就可以得到各個anchor預測每個類別的概率，對於RetinaNet來說，每個位置相當於個二分類問題。box迴歸分支與分類分支類似，只不過最後輸出channel是，這也表明RetinaNet的box迴歸是類別無關的。detection模塊在FPN各個level的特徵是參數共享的，這點和RPN類似，但是RetinaNet的detection模塊是多分類的，而且更deeper。

模型初始化

對於backbone，當然採用的是在ImageNet上預訓練的ResNet，其它新增的卷積層就普通初始化。一個額外要注意的點是分類分支最後的卷積層的偏值初始化爲：

這個相當於是爲模型訓練開始時每個anchor預測爲正例設置一個先驗概率值，論文中採用的是0.01，只用這一條策略，基於ResNet50的RetinaNet在COCO上的AP值就能達到30.2。這是因爲很多anchor是負例，設置先驗值可以大大降低負樣本在開始訓練時的loss，這樣訓練更容易，RetinaNet很容易loss出現nan。另外這個策略也在另外一篇論文Is Sampling Heuristics Necessary in Training Deep Object Detectors?中被詳細研究，經過少許的改進不需要sampling，也不需要focal loss也可以訓練出較好的RetinaNet。

模型訓練與預測

與Faster R-CNN一樣，RetinaNet的box迴歸loss採用smooth L1，但是分類loss採用focal loss，論文中最優參數是。分類loss是sum所有的focal loss，然後除以類別爲正例的anchors總數。論文中FL也和OHEM或者SSD中的OHEM 1:3做了實驗對比，發現採用FL的模型訓練效果更好：

在inference階段，對各個level的預測首先取top 1K的detections，然後用0.05的閾值過濾掉負類，此時得到的detections已經大大降低，此時再對detections的box進行解碼而不是對模型預測所有detections解碼可以提升推理速度。最後把level的detections結果concat在一起，通過IoU=0.5的NMS過濾重疊框就得到最終結果，代碼如下：

boxes_all = []

scores_all = []

class_idxs_all = []

# Iterate over every feature level

for box_cls_i, box_reg_i, anchors_i in zip(box_cls, box_delta, anchors):

# (HxWxAxK,)

box_cls_i = box_cls_i.flatten().sigmoid_()

# Keep top k top scoring indices only.

num_topk = min(self.topk_candidates, box_reg_i.size(0))

# torch.sort is actually faster than .topk (at least on GPUs)

predicted_prob, topk_idxs = box_cls_i.sort(descending=True)

predicted_prob = predicted_prob[:num_topk]

topk_idxs = topk_idxs[:num_topk]

# filter out the proposals with low confidence score

keep_idxs = predicted_prob > self.score_threshold

predicted_prob = predicted_prob[keep_idxs]

topk_idxs = topk_idxs[keep_idxs]

anchor_idxs = topk_idxs // self.num_classes

classes_idxs = topk_idxs % self.num_classes

box_reg_i = box_reg_i[anchor_idxs]

anchors_i = anchors_i[anchor_idxs]

# predict boxes

predicted_boxes = self.box2box_transform.apply_deltas(box_reg_i, anchors_i.tensor)

boxes_all.append(predicted_boxes)

scores_all.append(predicted_prob)

class_idxs_all.append(classes_idxs)

boxes_all, scores_all, class_idxs_all = [

cat(x) for x in [boxes_all, scores_all, class_idxs_all]

]

keep = batched_nms(boxes_all, scores_all, class_idxs_all, self.nms_threshold)

keep = keep[: self.max_detections_per_image]

result = Instances(image_size)

result.pred_boxes = Boxes(boxes_all[keep])

result.scores = scores_all[keep]

result.pred_classes = class_idxs_all[keep]

這裏要注意的是由於採用的個二分類，某個位置的某個anchor可能最後會輸出幾個類別不同但是box一樣的detections。

與其他模型的對比

相比SSD和YOLOV2，RetinaNet效果更優，效果對比如下圖所示：

最後總結一下RetinaNet與其它同類模型的對比：

相比RPN，前面已經說過RetinaNet可以看成RPN的多分類升級版，backbone和FPN設置基本一樣，只不過RPN採用簡單的sampling方法訓練，而RetinaNet採用FL；
相比SSD，SSD也是利用多尺度特徵，不過RetinaNet是FPN，SSD的anchor與Faster R-CNN類似，不過anchor的size和ratio有稍許差異，另外就是SSD是OHEM 1:3訓練，而且採用softmax loss；
相比YOLOV3，YOLOv3的backbone是基於DarkNet-53的類FPN結構，level只有3個，不過整體與RetinaNet的backbone接近；YOLOV3的anchor是基於k-means生成，而且匹配策略是基於center和IoU的策略，訓練loss是普通的sigmoid。

對比之後，其實發現基於anchor的one stage檢測模型差異並沒有多大。