yolo翻译

you only look once

Abstract
We present YOLO, a new approach to object detection.
Prior work on object detection repurposes classifiers to perform detection. （以前的工作都是使用分类器类执行检测工作）Instead, we frame object detection as a regression problem to spatially separated bounding boxes and
associated class probabilities.（本文设计目标检测作为一个回归问题，来单独计算bounding box和相应的类别概率） A single neural network predicts bounding boxes and class probabilities directly from
full images in one evaluation. （通过一次计算，一个单独的神经网络预测bounding box和类别概率）Since the whole detection
pipeline is a single network, it can be optimized end-to-end
directly on detection performance.（由于所有的检测都是一个pipeline，因此属于一种end-to-end方法）
Our unified architecture is extremely fast. Our base
YOLO model processes images in real-time at 45 frames
per second. A smaller version of the network, Fast YOLO,
processes an astounding 155 frames per second while
still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO
makes more localization errors but is less likely to predict
false positives on background. （与以前的方法相比较，yolo定位误差较大，但是误检率较低）Finally, YOLO learns very
general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork。

《2. Unified Detection》
We unify the separate components of object detection
into a single neural network. （我们统一目标检测的部件到一个单独的神经网络中）Our network uses features
from the entire image to predict each bounding box. It also
predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image.（我们的网络使用整张图像来预测bounding box，也可以预测一张图像中所有类别的所有的bounding box。意思是，我们的网络可以推导出整张图像中的所有目标物。）
The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.
Our system divides the input image into an S × S grid.（首先划分输入图像到S*S网格）
If the center of an object falls into a grid cell, that grid cell
is responsible for detecting that object.（如果目标的中心落入那个cell，则那个cell负责预测那个目标物）
Each grid cell predicts B bounding boxes and confidence
scores for those boxes. These confidence scores reflect how
confident the model is that the box contains an object and
also how accurate it thinks the box is that it predicts.（每一个cell预测B个bounding box和那个boxes的置信得分。这些置信得分反应了box包含目标的置信度和他认为是box的准确度） Formally we define confidence as Pr(Object)*IOUtruth pred. If no
object exists in that cell, the confidence scores should be
zero. Otherwise we want the confidence score to equal the
intersection over union (IOU) between the predicted box
and the ground truth.（如果没有目标物，则置信得分为0；否则的话，使用IOU）
Each bounding box consists of 5 predictions: x, y, w, h,
and confidence. The (x; y) coordinates represent the center
of the box relative to the bounds of the grid cell. The width
and height are predicted relative to the whole image. Finally
the confidence prediction represents the IOU between the
predicted box and any ground truth box.（每一个bounding box包含5个参数。x， y是box的中心位置相对于cell。w，h是相对于整张图像的宽和高；置信度为IOU）
Each grid cell also predicts C conditional class probabilities, Pr(ClassijObject). These probabilities are conditioned on the grid cell containing an object. We only predict
one set of class probabilities per grid cell, regardless of the
number of boxes B.（每一个cell，我们预测c个条件类别概率。这些概率是条件的，是cell上包含目标物的概率。我们仅仅预测一套类别概率。）
At test time we multiply the conditional class probabilities and the individual box confidence predictions,

which gives us class-specific confidence scores for each
box.（在测试阶段，我们用此条件类别概率 * 知心分数得到指定类别的置信分数，对于某一个box，应该是某个box是某类别的置信度） These scores encode both the probability of that class
appearing in the box and how well the predicted box fits the
object.
For evaluating YOLO on PASCAL VOC, we use S = 7,
B = 2. PASCAL VOC has 20 labelled classes so C = 20.
Our final prediction is a 7 × 7 × 30 tensor。

2.1. Network Design（设计网络）
We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset
[9].（我们采用卷积神经网络设计网络，采用VOC来测试网络） The initial convolutional layers of the network extract
features from the image while the fully connected layers
predict the output probabilities and coordinates.（网络的初始卷积层用来提取图像的特征，全连接层用来预测输出概率和坐标）
Our network architecture is inspired by the GoogLeNet
model for image classification [34]. Our network has 24
convolutional layers followed by 2 fully connected layers.（网络包含24个卷积层和2个全连接层）
Instead of the inception modules used by GoogLeNet, we
simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [22].（本文没有使用inception模块，而是用1卷积核和3卷积核代替） The full network is
shown in Figure 3.
Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1
convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification
task at half the resolution (224 × 224 input image) and then double the resolution for detection.（我们先使用一半的分辨率进行预训练，然后使用双倍的分辨率来检测目标）
We also train a fast version of YOLO designed to push
the boundaries of fast object detection. Fast YOLO uses a
neural network with fewer convolutional layers (9 instead
of 24) and fewer filters in those layers. Other than the size
of the network, all training and testing parameters are the
same between YOLO and Fast YOLO.（我们也设计了fast yolo，其使用9个卷积层核更少的卷积核）
The final output of our network is the 7 × 7 × 30 tensor
of predictions.
（网络的最后输出为7* 7 *30）

2.2. Training《训练》
We pretrain our convolutional layers on the ImageNet
1000-class competition dataset [30]. （我们预训练卷积层，采用1000类的竞赛数据）For pretraining we use
the first 20 convolutional layers from Figure 3 followed by a
average-pooling layer and a fully connected layer. （在预训练的时候，我们采用前20层的卷积层和一个平均pool层和一个fc层）We train
this network for approximately a week and achieve a single
crop top-5 accuracy of 88% on the ImageNet 2012 validation set, comparable to the GoogLeNet models in Caffe’s
Model Zoo [24]. （我们训练一周，top5的准确率为0.88）We use the Darknet framework for all
training and inference [26].（几个意思哦？？？？？到底为googlenet，还是？？？？）
We then convert the model to perform detection. Ren et
al. show that adding both convolutional and connected layers to pretrained networks can improve performance [29].
Following their example, we add four convolutional layers and two fully connected layers with randomly initialized
weights. （然后我们增加了4个卷积层和2个fc层，并且随机初始化参数）Detection often requires fine-grained visual information so we increase the input resolution of the network
from 224 × 224 to 448 × 448.（由于检测需要更加精细的视觉信息，我们增加输入图像的分辨率从224到448）
Our final layer predicts both class probabilities and
bounding box coordinates. We normalize the bounding box
width and height by the image width and height so that they
fall between 0 and 1. We parametrize the bounding box x
and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.（我们最后的层预测类别概率和bounding box坐标。我们归一化w，h用图像的宽和高，参数化x，y坐标针对cell位置，咋搞？？？）
We use a linear activation function for the final layer and
all other layers use the following leaky rectified linear activation:
（在一层使用后线性激活函数，其余层使用leakly激活函数）
We optimize for sum-squared error in the output of our
model.（模型的输出使用平方和误差进行优化） We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of
maximizing average precision.（平方和误差仅仅是为了容易优化，但是和平均精度美没有关联） It weights localization error equally with classification error which may not be ideal.（位置误差和分类误差的比重一样，这样做并不好）
Also, in every image many grid cells do not contain any
object. This pushes the “confidence” scores of those cells
towards zero, often overpowering the gradient from cells
that do contain objects. This can lead to model instability,
causing training to diverge early on.（由于很多cell没有目标物，这将会导致样本倾斜，请用数学角度分析？？？？?）
To remedy this, we increase the loss from bounding box
coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We
use two parameters, λcoord and λnoobj to accomplish this. We
set λcoord = 5 and λnoobj = :5.（通过权重来纠正样本倾斜）
Sum-squared error also equally weights errors in large
boxes and small boxes. （大box和小box的权重一样，这个也是问题。对，对于一点偏差，大的box比小的box不明感）Our error metric should reflect that
small deviations in large boxes matter less than in small
boxes. To partially address this we predict the square root
of the bounding box width and height instead of the width
and height directly.（用平方根解决）
YOLO predicts multiple bounding boxes per grid cell.（一个cell预测多个box）
At training time we only want one bounding box predictor
to be responsible for each object. We assign one predictor
to be “responsible” for predicting an object based on which
prediction has the highest current IOU with the ground
truth. This leads to specialization between the bounding box
predictors. Each predictor gets better at predicting certain
sizes, aspect ratios, or classes of object, improving overall
recall.（我们只想要一个box负责一个目标，因此在预测的时候，我们使用大的IOU来负责预测。这将会导致box的功能更加明确。因此，在尺寸、长宽比例、类别等方面，提高召回率）
During training we optimize the following, multi-part loss function:

where 1obj
i denotes if object appears in cell i （cell i 是否对于目标进行负责）and 1obj ij denotes that the jth bounding box predictor in cell i is “responsible” for that prediction.（cell i 中的bounding box j是否对目标负责）
Note that the loss function only penalizes classification
error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is
“responsible” for the ground truth box (i.e. has the highest
IOU of any predictor in that grid cell).（因此loss仅仅惩罚：cell中包含物体的分类错误和cell中bounding box负责目标的回归误差）
We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and
2012. When testing on 2012 we also include the VOC 2007
test data for training. Throughout training we use a batch
size of 64, a momentum of 0:9 and a decay of 0:0005.（训练的参数）
Our learning rate schedule is as follows: For the first
epochs we slowly raise the learning rate from 10−3 to 10−2.
If we start at a high learning rate our model often diverges
due to unstable gradients.（如果我们使用很高的学习率，我们的模型会发散。由于梯度不稳定，为啥呀？？？？？） We continue training with 10−2
for 75 epochs, then 10−3 for 30 epochs, and finally 10−4
for 30 epochs.
To avoid overfitting we use dropout and extensive data
augmentation. （为了避免过拟合，我们使用了dropout和数据增强）A dropout layer with rate = .5 after the first
connected layer prevents co-adaptation between layers [18].
For data augmentation we introduce random scaling and
translations of up to 20% of the original image size. We
also randomly adjust the exposure and saturation of the image by up to a factor of 1:5 in the HSV color space。（在数据增强方面，我们引入随机裁剪和堆积调整曝光和图像饱和度）

2.3. Inference（测试阶段）
Just like in training, predicting detections for a test image
only requires one network evaluation. On PASCAL VOC the
network predicts 98 bounding boxes per image and class
probabilities for each box. YOLO is extremely fast at test
time since it only requires a single network evaluation, unlike classifier-based methods.（在测试阶段，一张图像预测98个bounding box，每个box预测一系列类别概率）
The grid design enforces spatial diversity in the bounding box predictions. Often it is clear which grid cell an
object falls in to and the network only predicts one box for
each object. （很明显，我们知道目标落入哪个cell中，每个目标仅仅预测一个box）However, some large objects or objects near
the border of multiple cells can be well localized by multiple cells.
（但是当目标物很大，或是目标落入多个cell中时，会定位到多个cell，这个时候使用nms）Non-maximal suppression can be used to fix these
multiple detections. While not critical to performance as it
is for R-CNN or DPM, non-maximal suppression adds 2-
3% in mAP.

2.4. Limitations of YOLO《yolo的限制》
YOLO imposes strong spatial constraints on bounding
box predictions since each grid cell only predicts two boxes
and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in
groups, such as flocks of birds.（一个cell仅仅预测一个目标，这会影响临近物体的预测。主要时小目标）
Since our model learns to predict bounding boxes from
data, it struggles to generalize to objects in new or unusual
aspect ratios or configurations.（本算法会产生奇怪的ratio） Our model also uses relatively coarse features for predicting bounding boxes since
our architecture has multiple downsampling layers from the
input image.
Finally, while we train on a loss function that approximates detection performance, our loss function treats errors
the same in small bounding boxes versus large bounding
boxes. A small error in a large box is generally benign but a
small error in a small box has a much greater effect on IOU.
Our main source of error is incorrect localizations.（loss在大小目标物上的表现不同）

猜你喜欢