Motivation and context

YOLO is a popular object detection algorithm that I want to experiment with and learn how to use both for research and industrial applications. Computer Vision is part of my domain and background and not having tried out YOLO feels so wrong to me. This reddit post oddly spurred me to begin learning about it Help with YOLOv8: Incorrect Labels Displayed in Detection Boxes

The key difference between YOLO and other object detectors is that it ==frames object detection as a regression problem rather than a classification problem==. Is there an advantage to this? I guess regression networks are simpler than classification networks and if YOLO uses the former, that makes it lighter faster than other models -[ ] YOLO also performs a single pass of the image and predicts the bounding boxes and their associated probabilities (confidence %), making it quite efficient and fast. A single NN is used, which enables users to optimize it end-to-end. No more going down the NN rabbithole thanks to the YOLO model’s simplicity. YOLO does a great job with generalization and does not throw false positives often YOLO is compared to DPM (Deformable Parts Model) and R-CNN

  • R-CNN uses region proposal to generate bounding boxes on an image. Then, a classifier is ran on that image. After the classification, post-processing is employed to eliminate duplicates (this is probably where NMS (Non-Max Suppression is used?) and refine the results. It’s a slow and complex pipeline and one would have to train the individual components separately. YOLO unifies everything into a single pipeline, making it easy to train

  • [?] What is mAP? First guess: It is a mean parameter used to compare the performance of a DL model. mAP is the mean average precision

YOLO sees the entire image at once and doesn’t focus on specific localized parts of the image. This allows it to make use of vital background context of objects in the image to reduce background errors. However, it lags behind other detectors in accuracy and localization

  • The image is divided into an S x S grid (S = number of rows and columns). This is the residual blocks process
    • Each cell in the grid predicts its bounding boxes and their associated confidence scores. After the single pass, the BBs are all averaged out together to reveal the final BBs. The confidence of the BB predicted by each cell is the IOU (Intersection Over Union) between the predicted box and the ground truth making YOLO a supervised learning algorithm
    • The cell can only identify a single class. Each BB has 5 predictions made
      • (x, y) - The coordinates of the center of the BB w.r.t the bounds of the cell
      • (w, h) - The width and height of the BB w.r.t the whole image
      • c - The confidence i.e. the IOU In the end, each grid cell produces a class-specific confidence score for each box (P(Class) * IOU) The predictions are encoded as a S * S * (B * 5 + C) tensor

YOLO uses a CNN to extract features and a fully connected NN to perform predictions. It is inspired from GoogLeNet

  • [?] What is an inception module? GoogLeNet uses these but YOLO does not

YOLO (the original one) has a few limitations though

  • As it imposes a lot of spatial constraints on the bounding box predictions, it struggles to detect background objects such as birds in groups
  • Incorrect localizations

Note to-do

  • Compile the pros and cons into a neat table
  • Write in my own words. Generate personal knowledge and useful thoughts
  • Codify the algorithm into an intuitive flowchart for visual understanding
  • If required, split the note into separate notes of different but connected topics to encourage atomicity

External links