0% found this document useful (0 votes)
69 views10 pages

Yolopdf

The document introduces YOLO (You Only Look Once), a novel approach to real-time object detection that frames the task as a regression problem, allowing a single neural network to predict bounding boxes and class probabilities directly from full images. YOLO achieves remarkable speed, processing images at 45 frames per second, and a faster version, Fast YOLO, at 155 frames per second, while maintaining high accuracy compared to traditional methods. Despite its advantages, YOLO faces challenges in accurately localizing small objects and generalizing to unusual configurations.

Uploaded by

Mc Swathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
69 views10 pages

Yolopdf

The document introduces YOLO (You Only Look Once), a novel approach to real-time object detection that frames the task as a regression problem, allowing a single neural network to predict bounding boxes and class probabilities directly from full images. YOLO achieves remarkable speed, processing images at 45 frames per second, and a faster version, Fast YOLO, at 155 frames per second, while maintaining high accuracy compared to traditional methods. Despite its advantages, YOLO faces challenges in accurately localizing small objects and generalizing to unusual configurations.

Uploaded by

Mc Swathi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
ay 2016 :1506.02640v5 [cs.CV] 9M arXiv You Only Look Once: Unified, Real-Time Object Detection Joseph Redmon’, Santosh Divvala"!, Ross Girshick", Ali Farhadi** University of Washington", Allen Institue for Al", Facebook Al Research® http: //pjreddie.com/yolo/ Abstract, We present YOLO, a new approach to object detection Prior work on object detection repurposes classifiers to per form detection. Instead, we frame object detection as a re- gression problem to spatially separated bounding boxes and ‘associated class probabilities. A single neural network pre- dicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline isa single nenwork, it can be optimized end-to-end directly on detection performance. Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detec- tors. Compared 10 state-of-the-art detection systems, YOLO ‘makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very ‘general representations of objects. It outperforms other de- tection methods, including DPM and R-CNN, when gener- alizing from natural images to other domains like artwork 1. Introduction Humans glance at an image and instantly know what ob- jects are in the image, where they are, and how they inter- fact. The human visual system is fast and accurate, allow= ing us to perform complex tasks like driving with little con- scious thought, Fast, accurate algorithms for object detec: tion would allow computers to drive ears without special- ized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems. Current detection systems repurpose classifiers 10 per- form detection, ‘To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image {10} Mare recent approaches like R-CNN use region proposal Figure 1: The YOLO Detection System. Processing images ‘with YOLO is simple and straightforward. Our system (1) resizes, the input image to 418 > 48, (2) runs a single convolutional net- 'work on the image, and (3) thresholds he resulting detections by the mode's confidence ‘methods to first generate potential bounding boxes in an age and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bound ing boxes, eliminate duplicate detections, and rescore the boxes hased on other objects in the scene [13]. These com- plex pipelines are slow and hard to optimize because each individual component must be trained separately. We reframe object detection as a single regression prob- Jem, straight from image pixels to bounding box coordi- nates and class probabilities. Using our system, you only ook once (YOLO) at an image to predict what objects are present and where they are. YOLO js refreshingly simple: see Figure 1. A sit ‘le convolutional network simultaneously predicts multi- ple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes dete tion performance. This unified model has several benefits, over traditional methods of object detection. First, YOLO is extremely fast. Since we frame detection asa regression problem we don't need a complex pipeline. ‘We simply run our neural network on a new image at test time to predict detections, Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This ‘means we can process streaming video in real-time with Jess than 25 milliseconds of latency. Furthermore, YOLO achieves more than twice the mean average precision of other real-time systems. For a demo of our system running, in real-time on a webcam please see our project webpage. http://pjreddie. Second, YOLO reasons globally about the image when ‘making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image luring training and test time so it implicitly encodes contex- tual information about classes as well as their appearance. Fast R-CNN, a top detection method [11], mistakes back- ‘ground patches in an image for objects because it can't see the larger context. YOLO makes less than half the number ‘of background errors compared to Fast R-CNN, ‘Third, YOLO leams generalizable representations of ob- jects. When trained on natural images and tested on art- ‘work, YOLO outperforms top detection methods like DPM. and R-CNN by a wide margin. Since YOLO is highly gen- cralizable it is less likely to break down when applied to ‘new domains or unexpected inputs, ‘YOLO still lags behind state-of-the-art detection systems, in accuracy. While it can quickly identify objects in im- ages it struggles to precisely localize some objects, espe- cially small ones. We examine these tradeoffs further in our experiments, Al of our training and testing code is open source. A. variety of pretrained models are also available to download. 2. Unified Detection We unify the separate components of object detection into a single neural network. Our network uses features Irom the entire image to pret each bounding box. Hao predicts all bounding boxes across all clases foe an im age simultaneously. Tis means our network reasons glob- ally about the fll image and all the objects inthe image The YOLO design enables end-to-end taining and rea- time speeds while maintaining high average precision (ue system divides the input image into an SS aid Ir he center ofan object als into a grid ell, that grid eel is responsible fr detecting that object ach grid ell predicts 2 bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is thatthe box contains an object and also how accurate it thinks the box is cht it predicts. For mally we define confidence as Pr(Object) + JOU. If no abject exists in that cel, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersetion over union (TOU) between the predicted box fd the sround tah, Each bounding box consists ofS predictions 2 and confidence, The (1) coordinates represent the center ofthe box relative tothe bounds ofthe grid ell. The width and height are predicted relative tothe whole image. Finally the confidence prediction represents the IOU between the predioted box and any ground trath box. Bach grid cell also preits C’ conditional class proba- bildes, Pr(Clas, Objet). These probabilities are condi- tioned on the grid ell eontsining an object, We only predict ‘one set of class probabil number of boxes B. ‘At test time we multiply the conditional class probabili- tics and the individual box confidence predictions, ies per grid cel, regardless of the Pr( Cla Oe) « PO) «OU tah — pecs) HOU which gives us class-specific confidence scores for each bbox. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the f a. Cissy “gure 2: The Model. Our system models detection a a repres- sion problem. It divides the image into an $x 5 grid and for each arid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an SxS x (Bx5+C) tensor, For evaluating YOLO on PASCAL VOC, we use $= B= 2, PASCAL VOC has 20 labelled classes so C = 20, (Our final prediction is a 7 x 7 x 30 tensor. 2.1, Network Design ‘We implement this model as a convolutional neural net- ‘work and evaluate it on the PASCAL VOC detection dataset [9 The initial convolutional layers of the network extract Features from the image while the fully connected layers predict the output probabilities and coordinates, ‘Our network architecture is inspired by the GoogLeNet ‘model for image classification [34]. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1 x 1 reduction layers followed by 3 x 3 convo- lutional layers, similar to Lin etal [22]. The full network is shown in Figure 3. We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are the ssame between YOLO and Fast YOLO. tI “a aoe Goetane Gane tome Comte Can ayn —“- a ee pt Gene tans Com torr Com ort mide) = Figure The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Altemating Lx 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification ‘ask at half the resolution (224 x 224 input image) and then double the ‘The final output of our network isthe 7 x 7 x 30 tensor of predictions 2.2. Training We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [30]. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We train this network for approximately a week and achieve a single crop top-5 accuracy of 88% on the ImageNet 2012 valida tion set, comparable to the GoogLeNet models in Caffe's Model Zoo [2]. We use the Darknet framework for all training and inference [26] ‘We then convert the model fo perform detection, Ren et al, show that adding both convolutional and connected lay crs to pretrained networks can improve performance [29] Following their example, we add four convolutional lay- crs and two fully connected layers with randomly initialized ‘weights. Detection often requires fine-grained visual infor- Imation so we increase the input resolution of the network from 224 x 224 t0 448 x 448, Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We paramettize the bounding box and y coordinates to be offsets of a particular grid cell loca tion so they are also bounded between O and | ‘We use linear activation function forthe final layer and all other layers use the following leaky rectified linear acti- vation! ife>0 otherwise on={> Ose sd ‘We optimize for sum-squared error in the output of our resolution for detection, model. We use sum-squared error because it is easy to op- timize, however it does not perfectly align with our goal of maximizing average precision, It weights localization et ror equally with classification error which may not be Also, in every image many grid cells do not contain any ‘object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This ean lead to model instability, ‘causing training to diverge early on. ‘To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confi dence predictions for boxes that don’t contain objects. We use two parameters, cs ane Aaa to accomplish this. We Set Aso = 5 and Ayan =. Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly YOLO predicts multiple bounding boxes per grid cel. AL training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current TOU with the ground ‘vuth, This leads o specialization between the bounding box predictors. Each predictor gets better at predicting certain Sizes, aspect ratios, or classes of object, improving overall recall During training we optimize the following, multi-part loss function: ant $2 9599 [ls 20" +4 where 12 denotes if object appears in cell ¢ and 19% de notes thatthe jth bounding box predictor in cell #8 “re- sponsible” for that prediction. Note that the loss function only penalizes classification crtor if an object is present in that grid cell (hence the con- ditional class probability discussed earlier). It also only pe- nalizes bounding box coordinate error if that predictor is, “responsible” for the ground truth box (ie. has the highest TOU of any predictor in that grid cell We train the network for about 135 epochs on the train- ing and validation data sets from PASCAL. VOC 2007 and 2012. When testing on 2012 we aso include the VOC 2007 test data for taining. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005, Our learning rate schedule is as follows: For the frst epochs we slowly raise the learning rate from 10- to 10-2 If we start ata high learning rate our model often diverges due to unstable gradients. We continue training with 10? for 75 epochs, then 10 for 30 epochs, and finally 10-4 for 30 epochs. ‘To avoid overfiting we use dropout and extensive data augmentation. A dropout layer with rte = 5 after the frst connected layer prevents co-adaptation between layers [18] For data augmentation we introduce random scaling and translations of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the im- age by up to a factor of 1.5 in the HSV color space 2.3. Inference Just like in training, predicting detections for atestimage only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, un- like classifier-based methods. ‘The grid design enforces spatial diversity in the bound- ing box predictions. Often itis clear which grid cell an ‘object falls in to and the network only predicts one box for each object. However, some large objects or objects near the border of multiple cells can be well localized by multi- ple cells, Non-maximal suppression can be used to fx these multiple detections. While not critical to performance as it is for R-CNN ot DPM, non-maximal suppression adds 2- 3% in mAP, 2.4, Limitations of YOLO YOLO imposes strong spatial constraints on bounding. ‘box predictions since each grid cell only predicts two boxes ‘and can only have one class, This spatial constraint lim- its the number of nearby objects that our model can pre- dict, Our model struggles with small objects that appear in groups, such as flocks of birds Since our model learns to predict bounding boxes from data it struggles to generalize to objects in new or unusual aspect ratios oF configurations. Our model also uses rela- tively coarse features for predicting bounding boxes since ‘our architecture has multiple downsampling layers from the input image. Finally, while we train on a loss function that approxi- mates detection performance, out loss function treals errors the same in small bounding boxes versus large bounding. boxes. A small error in a large box is generally benign but a samall error in a stall box has a much greater effect on LOU. ‘Our main source of error is incorrect localizations. 3. Comparison to Other Detection Systems Object detection is a core problem in computer vision, Detection pipelines generally start by extracting a set of robust features from input images (Haar [25], SIFT [23], HOG [4], convolutional features (6]}. Then, classifiers [36, 21, 13, 10) of localizers (1, 32] are used to identify ‘objects in the feature space. These classifiers or localizers are run either in sliding window fashion over the whole im- age or on some subset of regions in the image [35, 15, 39] ‘We compare the YOLO detection system to several top de- tection frameworks, highlighting key similarities and differ- Deformable parts models, Deformable parts models (DPM) ase a sliding window approach to object detection [10]. DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, ete. Our system replaces all ofthese disparate parts ‘with a single convolutional neural network. The network performs feature extraction, bounding box prediction, non- ‘maximal suppression, and contextual reasoning all concur rently. Instead of static features, the network trains the fea- tures in-line and optimizes them for the detection task. Our unified architecture leads to a faster, more accurate model, than DPM. R-CNN. R-CNN and its variants use region proposals in- stead of sliding windows to find objects in images. Selective ‘Search [35] generates potential hounding boxes, a convolu- tional network extracts features, an SVM scores the boxes, a linear model adjusts the bounding boxes, and non-max sup- pression eliminates duplicate detections. Each stage of this complex pipeline must be precisely tuned independently and the resulting system is very slow, taking more than 40, seconds per image attest time [1.4] YOLO shares some similarities with R-CNN. Each grid cell proposes potential bounding boxes and scores those boxes using convolutional features. However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object. Our system also proposes far fewer bounding boxes, only 98 per image compared to about 2000 from Selective Search Finally, our system combines these individual components into a single, jointly optimized model, Other Fast Detectors Fast and Faster R-CNN focus on speeding up the R-CNN framework by sharing computa- tion and using neural networks to propose regions instead of Selective Search [114] [28]. While they offer speed and accuracy improvements over R-CNN, both still fall short of, real-time performance Many research efforts focus on speeding up the DPM pipeline [51] [38] [5]. They speed up HOG computation, tuse cascades, and push computation to GPUs. However, only 30H DPM [3] actually runs in real-time. Instead of trying to optimize individual components of a Targe detection pipeline, YOLO throws out the pipeline entirely and is fast by design. Detectors for single classes like faces or people can be highly optimized since they have to deal with much less variation [37]. YOLO is a general purpose detector that learns to detect a variety of objects simultaneously. Deep MultiBox. Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [8] instead of using Selective Search. MultiBox ean also perform single object detection by replacing the confidence prediction with a single class prediction, However, Multi Box cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further im- age patch classification, Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an im- age but YOLO is a complete detection system, OverKeat. Sermanet et al. rain a convolutional neural network to perform localization and adapt that localizer to perform detection [32]. OverFeat efficiently performs slid- ing window detection but itis still a disjoint system. Over- Feat optimizes for localization, not detection performance. Like DPM, the localizer only sees local information when making a prediction. OverFeat cannot reason about global context and thus requires significant post-processing to pro- duce coherent detections MultiGrasp. Our work is similar in design to work on grasp detection by Redmon et al [27]. Our grid approach to ‘bounding box prediction is based on the MultiGrasp system. for regression to grasps. However, grasp detection is a much simpler task than object detection. MultiGrasp only needs to predict a single graspable region for an image containing ‘one object. It doesn't have to estimate the size, location, cr boundaries of the object or predict it’s class, only find a region suitable for grasping. YOLO predicts both bounding bboxes and class probabilities for multiple objects of multi- ple classes in an image. 4, Experiments rst we compare YOLO with other real-time detection systems on PASCAL VOC 2007. To understand the differ- cences between YOLO and R-CNN variants we explore the ‘errors on VOC 2007 made by YOLO and Fast R-CNN, one of the highest performing versions of R-CNN [14], Based ‘on the different error profiles we show that YOLO can be used to rescore Fast R-CNN detections and reduce the er- rors from background false positives, giving a significant performance boost. We also present VOC 2012 results and compare mAP to current state-of-the-art methods. Finally, ‘we show that YOLO generalizes to new domains better than other detectars on two aztwork datasets. 4.1, Comparison to Other Real-Time Systems Many research efforts in object detection focus on mak- ing standard detection pipelines fast. [5] [38] [31] [14] [7] [28] However, oaly Sadeghi et al. actually produce a de- tection system that runs in real-time (30 frames per second cr better) [31]. We compare YOLO to their GPU imple- ‘mentation of DPM which runs either at 30Hz or 100Hz. ‘While the other efforts don’t reach the real-time milestone ‘we also compare their relative mAP and speed to examine the accuracy-performance tradeotts available in object de- tection systems, Fast YOLO iis the fastest object detection method on PASCAL; as far as we know, itis the fastest extant object detector. With 52.7% mAB, itis more than twice as accurate as prior work on real-time detection. YOLO pushes mAP to 63.4% while still maintaining real-time performance We also train YOLO using VGG-16. This mode! is more ‘accurate but also significantly slower than YOLO. Iti wse- ful for comparison to other detection systems that rely on VGG-16 bul since itis slower than realtime the rest of the paper Focuses on our faster models Fastest DPM effectively speeds up DPM without sacti- ‘icing much mAP but it still misses real-time performance by a factor of 2 [38]. It also is limited by DPM’s relatively low accuracy on detection compared to neural network ap- proaches. R-CNN minus R replaces Selective Search with static bounding box proposals [20]. While itis much faster than Real-Time Detectors Train mAP_ FPS “Toor: DPM Ts) 3007 16.0100 30Hz DPM [31] 2007 261 30 Fast YOLO 200742012 52.7158 YOLO 200742012 63445 Tess Than Real-Time “Fastest DPM Ds] 007 30 R-CNN Minus R [20] 2007 5356 Fast RCN (1:] 200742012 70.0 05 Faster R-CNN VGG-16(28] 200742012. 73.27 Faster R-CNN ZF [28] 200742012 621 18 YOLO VGG-16 200742012 66.4 21 ‘Table 1: Real-Time Systems on PASCAL, VOC 2007. Compa: ing the performance end speed of fast detectors. Fast YOLO is the Lastest detector on record for PASCAL VOC detection and is still twice as accurate as any other real-time detector, YOLO is 10 mAP more accurate than the fast version while still well ove real-ime in speed. R-CNN, it stil falls short of real-time and takes a significant accuracy hit from not having good proposals, Fast R-CNN speeds up the classification stage of R-CNN, ‘but it sill relies on selective search which can take around 2 seconds per image to generate bounding box proposals. ‘Thus it has high mAP but at 0.5 fps itis stl far from real- time. ‘The recent Faster R-CNN replaces selective search with neural network to propose bounding boxes, similar to Svegedy et al. [8] In out tests, their most accurate model achieves 7 fps while a smaller, less accurate one runs at 18 fps. The VGG-16 version of Faster R-CNN is 10 mAP higher but is also 6 times slower than YOLO. The Zeiler- Fergus Faster R-CNN is only 2.5 times slower than YOLO bout is also less accurate. 4.2. VOC 2007 Error Analysis ‘To further examine the differences between YOLO and state-of-the-art detectors, we look at a detailed breakdown, of results on VOC 2007, We compare YOLO to Fast R- CNN since Fast R-CNN is one of the highest performing detectors on PASCAL and its detections are publicly avail- able. We use the methodology and tools of Hoiem et al. [19] For each category at test time we look at the top N predio- tions for that category. Each prediction is either correct or iLis classified based om the type of error: # Correct: correct class and TOU > 5 + Localization: correct class, .l < IOU < «© Similar: class is similar, OU > 1 Fast R-CCNN cared 18 YOLO. Secigrond 73 Sin 434 Figure 4: Error Analysis: Fast R-CNN vs. YOLO These charts show the percentage of localization and background errors inthe top N detections for various categories (N= # objects in that category). ‘© Other: class is wrong, TOU > 1 ‘* Background: OU < 1 for any object Figure 4 shows the breakdown of each error type aver- aged across all 20 classes. ‘YOLO struggles to localize objects correctly. Localiza- tion errors account for more of YOLO’s errors than all ther sources combined. Fast R-CNN makes much fewer local- ization errors but far more background errors. 13.6% of it’s top detections are false positives that don't contain any objects. Fast R-CNN is almost 3x more likely to predict background detections than YOLO. Combining Fast R-CNN and YOLO ‘YOLO makes far fewer background mistakes than Fast R-CNN, By using YOLO to eliminate background detec tions from Fast R-CNN we get a significant boost in perfor- ‘mance. For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box. If it does, we sive that prediction a boost based on the probability pre- dicted by YOLO and the overlap between the two boxes. ‘The best Fast RCN model achieves a mAP of 71.89% ‘on the VOC 2007 test set. When combined with YOLO, its 4 mAP Combined Gain TS Fast CNN (2007 data) 66.9 7a Fast R-C 592 ns 6 Fast R-CNN (CaffeNet) 57.1 nl 3 YOLO 4 750 3.2 ‘Table 2: Model combination experiments on VOC 2007. We ‘examine the effect of combining various models with the best ver- sion of Fast R-CNN, Other versions of Fast R-CNN provide only 4 small benefit while YOLO provides a significant performance boost ovate aa any ay i _ cot he tt ta st pnt ay als et REeew woo ni] Be as 83 Bi Bi Ba 3h Ws So Hs Ws Ho Hh N03 83 aL ot Not [3] oes | sas 6 mA 143 SL SO $13 9S 22 MS Td SHS 67 EL R.CNN VOC BB [15 a4] 196 as, 653 2 457 20 760 386 4 603 ‘Feature Bait [99] 363 | 146 sat 652 360 466 700 644 711 602 333 G13 464 617 578 Table 3: PASCAL VOC 2012 Leaderboard, YOLO compared with the fall cot (outside data allowed) public leaderboard as of November 6th, 2015. Mean average precision and per-class average precision are shown fora variety of detection methods. YOLO ist only real-time detector, ast R-CNN + YOLO isthe fort highest scoring method, witha 23% boost over Fast R- mAP increases by 3.2% to 75.0%. We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN, Those ensembles produced small increases in mAP between 3 and .6%, see Table 2 for details. ‘The boost from YOLO is not simply a byproduct of model ensembling since there is litle benefit from combin- ing different versions of Fast R-CNN. Rather, itis precisely because YOLO makes different kinds of mistakes at test time that itis so effective at boosting Fast R-CNN's per- formance, Unfortunately, this combination doesn’t benesit from the speed of YOLO since we ran each model seperately and then combine the results, However, since YOLO is so fast it doesn't add any significant computational time compared to Fast R-CNN, 4.4, VOC 2012 Results On the VOC 2012 test set, YOLO scores 57.9% mAP. ‘This is lower than the current state of the art, closer to the original R-CNN using VGG-16, see Table 3. Our sys- {em struggles with small objects compared to its closest competitors. On categories like bottle, sheep, and ‘tv/monitor YOLO scores 8-10% lower than R-CNN or Feature Edit. However, on other categories like cat and train YOLO achieves higher pesformance. ‘Our combined Fast R-CNN + YOLO model is one of the highest performing detection methods, Fast R-CNN gets 23% improvement from the combination with YOLO, boosting it 5 spots up on the public leaderboard. 4.5. Generalizability: Person Detection in Artwork Academic datasets for object detection draw the training and testing data from the same distribution. In real-world applications itis hard to predict all possible use cases and the test data can diverge from what the system has seen be- fore [3]. We compare YOLO to other detection systems on the Picasso Dataset [2] and the People-Art Dataset [3], two datasets for testing person detection on artwork. Figure 5 shows comparative performance between YOLO and other detection methods. For reference, we give VOC 2007 detection AP on person where all models are trained only on VOC 2007 data, On Picasso models are trained on VOC 2012 while on People-Art they are tained ‘on VOC 2010, R-CNN has high AP on VOC 2007. However, RCNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images. The classifier step in R-CNN only sees small regions and needs good proposals, DPM maintains its AP well when applied to artwork, rior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects. ‘Though DPM doesn’t degrade as much as R-CNN, it stats from a lower AP. ‘YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork Like DPM, YOLO models the size and shape of objects, ‘as well as relationships between objects and where objects ‘commonly appear. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict, ‘good bounding boxes and detections 5. Real-Time Detection In The Wild YOLO is a fast, accurate object detector, making it ideal for computer vision applications. We connect YOLO to a ‘webcam and verify that it maintains real-time performance, Precision 04 06 Recall (a) Picasso Dataset precision eal curves. 08 10 Figure §: General VOC 2007 | Picasso _| People-Art AP | _AP_ Best F, AP YOLO B92) SRI 0.590 co R-CNN 542 | 104 0.226 26 DPM 43.2378 0.458 32 Poselets (21 365 | 178 0.271 D&T) =| 19 0.051 () Quantitative results on the VOC 2007, Picasso, and People-Art Datasets, ‘The Picasso Dataset evaluates on both AP and best F score Lion results on Picasso and People-Art datasets. Figure 6: Qualitative Results. YOLO running on sample artwork and natural images fom the internet. Its mostly accurate although it {does think one person isan airplane. including the time to fetch images from the camera and dis- play the detections. ‘The resulting system is interactive and engaging. While YOLO processes images individually, when attached to a ‘webcam it functions like a tracking system, detecting ob- jects as they move around and change in appearance. A demo of the system and the source code can be found on ‘our project website: http: //pjreddie.com/yolo/ 6. Conclusion We introduce YOLO, a unified model for object detec- tion, Our model is simple to construct and can be trained directly on full images. Unlike classifier-based approaches, ‘YOLO is rained on a loss function that directly corresponds. to detection performance and the entire model is trained jointly. Fast YOLO is the fastest general-purpose object detec- ‘or in the literature and YOLO pushes the state-of-the-art in real-time object detection. YOLO also generalizes well to rnew domains making it ideal for applications that rely on fast, robust object detection Acknowledgements: This work is partially supported by ‘ONR NOOO14-13-1-0720, NSF HS-1338054, and The Allen Distinguished Investigator Award, References [1] M.B. Blaschko and C. H, Lampert, Learning o localize ob jects with structured output regression. In Computer Vision ECCV 2008, pages 2-15. Springer, 2008. & [2] L. Bourdey and J. Malik. Poselets: Body past detectors tained using 34 human pose annotations. In International Conference on Computer Vision (CCV), 2009. & DB] HL. Cai, Q. Wo, T. Corradi, and P. Hall, The cross depiction problem: Computer vision algorithms for recog: rising objects in artwork and in photographs. arXiv preprint ‘arXiv:1505,00110, 015. 7 N. Dalal and B. Triggs. Histograms of oriented gradients for hhuman detection. In Computer Vision and Pattern Recogni ion, 2008. CVPR 2005, IEEE Computer Society Conference ‘on, volume 1, pages 886-893, IEEE, 2008. 4, § T. Dean, M. Ruzon, M. Segal, J. Shlens, S. Vijaya narasimban, J. Yagnik, et al. Fast, accurate detection of 100,000 object classes on a single machine. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Confer fence on, pages 1814-1821, IEEE, 2013, 5 J, Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E,Teeng, and T Darrel. Decal: A deep convolutional act vation feature for generic visual recognition. arXiv preprin sarXiv1310.1531, 2013. 4 J. Dong. Q. Chen, S. Yan, and A. Yuille. Towards unified object detection and semantic segmentation, Ia Computer Vision-ECCV 2014, pages 299-314, Springer, 2014. 7 [8] D. Behan, C.Szegedy, A. Toshev, and D, Anguelov, Seaable object detection using deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Confer lence on, pages 2155-2162, IEEE, 2014, 5,6 [9] M. Everingham, S. M.A. Bslami, L. Van Gool, C. K.1 Williams, J. Winn, and A. Zisserman, The pascal visual ob ject lasses challenge: A retrospective, International Journal of Computer Vision, 111(1) 98-136, Jan. 2015. 2 PE Felzenszwalb, R.B. Gitshick, D. Medllester, and D. Ra manan, Object detection with diseriminatively trained part based models, IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9) 1627-1688, 2010. 1,4 11] 8. Gidaris and N. Komodakis. Object detection via @ mult region & semantic segmentation-aware CNN model. CoRR, aby 1505.01749, 2015. 7 u) 15] (6) fy 10) 12 5. Ginosar, DB, Haas, T Brown, and J. Malik. Detecting peo- plein cubist art. In Computer Vision-ECCV 2014 Workshops, pages 101-116. Springer, 2014. 7 R. Girshick, J. Donahue, T Darrell, and J. Malik, Rich fea ture hierarchies for accurate object detection and semantic 13 segmentation, Tn Computer Vision and Pattee Recognition (CVPR), 2014 IEEE Conference on, pages 580-87. TEBE, 2014. 1,4,7 14] RB. Girshick. Fast R-CNN, CoRR, abs/1504.08083, 2015, 25,67 13] S. Gould, T. Gao, and D. Koller. Region-based segmenta: tuon and object detection. In Advances in neural information processing systems, pages 655-663, 2009. 4 16) 07) us) 19) (20) 11 (22) 13) 124) 12s) 126) pn ps] (29) (20) bum, 12) B. Hariharan, Arbelder, R. Gitsiek, and J. Malik, Sinn taneous detection and segmentation. In Computer Vision BCCV 2014, pages 297-312. Springer. 2014. 7 K.He, X. Zhang, Ren, and Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv preprint arXiv:1406 4729, 2014. 5 G.B, Hinton, N, Srivastava, A, Krizhevsky, I, Sutskever, and RR, Salakhutdinov. Improving neural neworks by pre= venting co-adaptation of feature detectors. arXiv preprint carkiv:12070580, 2012, 4 . Hoiem, ¥. Chodpathurnwan, and Q, Dai, Diagnosing exor in object detectors. In Computer Vsion- ECV 2012, pages 340-353. Springer, 2012. 6 K. Lene and A, Vedaldi, Reon aminus ‘arXiv'1506.06981, 2015.5, 6 R, Lienhart and J. Mayet. An extended set of haar-ike fea tures for rapid object detection. In mage Processing. 2002. Proceedings, 2002 International Conference on, volute l, pages 1-900, IEEE, 2002. 4 M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013. 2 1D, G. Lowe. Object recognition from local seal-invariant features. In Computer vision, 1999. The proceedings ofthe seventh IEEE international conference on, volume 2, pages 1150-1157. Teee, 1999. 4 D. Mishkin. Models accuracy on imagenet 2012, val. hetps://github.com/BVLC/catte/wiki Models-accuracy-on-TmageNet—2012-val. Aé~ cessed: 2015-102. 3 €. P, Papageorgiou, M. Oren, and . Poggio. A general Iramework for object detection, Ia Computer vision, 1998, sixth international conference on, pages 555-562, IEEE, 1998, 4 J. Redmon. Darknet: Open source neural networks in ¢ beep://pizeddie. con/darknet /,2013-2016, 3 ‘Redmon and A. Angelova, Real-time grasp detection using ‘convolutional neural networks, CoRR, abs/1412.3128, 2014, 5 S. Ren, K, He, R, Gitshick, and J, Sua, Faster enn To- ‘wards real-time object detection with region proposal net works. arXiv preprint arXin:1506 01497, 2018. 5, 6,7 'S.Ren, K. He, RB, Gitshick, X. Zhang, and J. Sun. Object, {detection networks on convolutional feature maps. CORR, abs/1504.06066, 2015. 3,7 (0. Russakovsky, J. Deng, H. Su, J. Krause, $. Satheesh, SS. Ma, Z Huang, A. Karpathy. A. Khosla, M. Bernstein, ‘A.C. Berg, and L, Fei-Fei, ImageNet Large Seale Visual Recognition Challenge. International Journal of Computer Vision LUCY), 2015. 3 ‘A, Sadeghi and D. Forsyth, 30hz ebjeet detection with dpm v5. In Computer Vision-ECCV 2014, pages 65-79, Springer, 2014. 5.6 P. Sermanet, D. Eigen, X, Zhang. M, Mathieu, R. Fergus, and Y. LeCun, Overfeat: Integrated recognition, localiza- tion and detection using convolutional networks. CoRR, absi1312.6229, 2013. 4,5 sarkiv preprint [33] Z Shen and X. Xue, Do more dropouts in poolS feature maps for better object detection. arXiv preprint arki1409.6911, 2014.7 (34) C. Szegedy, W. Liv, ¥ Tia, P. Sermanet, S. Reed, . Anguelow, D, Bhan, V. Vanhoucke, and A, Rabinovieb, Going deeper with convolutions. CoRR, sbs/1409.4842, 2014. 2 (35) JR. Uijlings, KE. van de Sande, T. Gevers, and A. W. Smeulders, Selective search for object recognition, Inter national journal of computer vision, 1942):154-171, 2013. 4 [36] P Viola and M. Jones. Robust real-time object detection International Journal of Computer Vision, &°34-47, 2001, [B71 P. Viola and M. J. Jones. Robust real-time face detection International journal of computer vision, $1(2):137-154 2008, 5 (38) J. Yan, Lei, L. Wen, and 8.7 Li. The fastest deformable ‘pas model for object detection. In Computer Vision and Pat fern Recognition (CVPR), 2014 IEEE Conference on, pages 2497-2504, IEEE, 2014. 5,6 (39] C.L.Zitnick and P. Doll. Edge boxes: Locating object pro ‘poss from edges. In Computer Vision-ECCV 2014, pages 391-405. Springer, 2014. 4

You might also like