This blog post gives insights on how we build node-moving-things-tracker, a simple algorithm that run on top of any object detection algorithms output (such as YOLO, Detectron …) to track objects persistently across multiple frames.

Cover for Tracking Things in Object Detection Videos

How to track persistently multiple moving things from frame-by-frame object detections inputs

The following blog post gives insights on how we build node-moving-things-tracker, a simple algorithm that run on top of any object detection algorithms output (such as YOLO, Detectron, OpenCV face recognition ...etc ) to track objects persistently across multiple frames.

It was built for the need of two projects:

The goal of the post is to provide some context on how the tracker works, so you are able to fork and improve it to your use-case.

A. Our problem

YOLO (or other object detection algorithms) gives us a list of detections for each frame, but doesn’t assign an unique identifier to those detections. This means that on the next frame you do not know if this red car is the same:

This is our Problem.

What we were looking is to enrich the YOLO detections with an unique id for each object that would track them across the scene.
YOLO input

This is what we get from YOLO.
Desired output
This is what we need.

B. First “intuitive” algorithm

We first implemented an intuitive algorithm [inspired by Daniel Shiffman post] which works this way:

currentlyTrackedObjects = []
For each frame:
  // 1. Try to match the currently tracked object with the detections of this frame
  For each detections:
    doesMatchExistingTrackedObject(detection, currentlyTrackedObjects)
    // if it matches, update the trackedObject position
    matchedTrackedObject.update(detection)
  // 2. Assign unmatched detections to new objects
  currentlyTrackedObjects.add(new trackedObject(detection))
  // 3. Clean up unmatched tracked objects
  For each currentlyTrackedObjects:
    if isUnmatched(trackedObject):
      trackedObject.remove()

We immediately saw that one of the challenge would be to define the doesMatchExistingTrackedObject()function that compare two detections, how to determine if they are tracking the same object ?

This leads to write a distance() function, which compare two detections positions (current detections and candidate for next frame) and determine their relative distance. If they are considered close enough, we can match them. (we used the center of the bbox to compute this distance)

And the distance() function looks like this

function distance(item1, item2) {
  // compute euclidian distance between centers
  euclidianDistance = computeEuclidianDistance(item1, item2)
  if (euclidianDistance > DISTANCE_LIMIT):  
    // Do not match
  else:
    // Potential match
}

This early implementation was already pretty good and matching correctly ~80% of the detections, but still had lots of re-assignments (when we lose track of the object and we assign it a new id even if it is the same object).

At that point we had some ideas on how to improve it:

  • By keeping a memory of the unmatched item for a few frames and avoid removing them directly (sometimes the detection algorithms miss the object for a few frames)
  • By predicting the position on the new frame with a velocity vector
  • By improving this distance function

Keep unmatched object in memory

We first integrated the idea of keeping in memory the unmatched items for a few frames which is simply wait a few frames before removing it from the tracked items.

  if isUnmatched(trackedObject):
      trackedObject.unmatchedThisFrame()

And the unmatchedThisFrame() function looks like this

function unmatchedThisFrame() {
  nbUnmatchedFrame++
  if nbUnmatchedFrames > 5:
    //Effectively delete the item
    this.remove()
}

This made the tracker more resilient to missing detections from YOLO and avoided some reassignments, but in order to have it more effective, we needed also to predict the next position.

Predict position by computing the velocity vector

The idea behind this to be able to predict the next position of the tracked object if it is missing in the next frame, so it moves to its “theorical” position and will be more likely to be re-matched on the next frame.

if isUnmatched(trackedObject):
   trackedObject.predictNextPosition()
   trackedObject.unmatchedThisFrame()

And the predictNextPosition() function looks like this

function predictNextPosition() {
  x: trackedObject.x + velocityVector.dx,
  y: trackedObject.y + velocityVector.dy
}

Improving the distance function

At this point we didn’t have much clue on how to improve it, we felt the need to review some computer vision literature about tracking to make sure we were not missing some good ideas.

C. Literature review

We found some useful papers for our problem statement. Check them out here:

1. General things about tracking

We've identified two types of tracking:

  • single object tracking: you choose one thing on the video and track it across all the frames
  • multiple object tracking (MOT): track multiple object across all frames

And for each of them, there is two possibilities:

  • Not real time tracking: algorithms that run on an existing video slower than real time (ie, takes more than a frame time to compute the next state of the tracker)
  • Real time tracking: the tracking algorithms that can run in real time (necessary for self-driving cars, robot ... )

We noticed that the real time algorithms use only the detections inputs (done by YOLO for ex), versus the non real time trackers use information from the image frames to get more data to do the tracking. Also most of the papers and projects published are focusing on single object tracking and not MOT.

Historically there was almost no algorithm working only on the detections output only as the detections weren't as good / fast as the recent progress of neural network based detector such as YOLO, and they needed to get more data from the image frame to do their job. But now this is changing and the tracking algorithms get simpler and faster as the detections are better. This technique is also called doing "tracking by detections".

Quote of the SORT deep paper:

“Due to recent progress in object detection, tracking-by-detection has become the leading paradigm in multiple object tracking.”

That said there are maybe tracking approaches using image data that lead to better tracking results than just "tracking by detections" , but we haven’t looked into them as they will be more complex to run and may likely not run in real time: for example color based tracking, particle tracking ...

2. Our problem set

For each of our application, we had different needs

  • For the Open Traffic Cam project: MOT + real-time tracking
  • For the Beat the traffic game: MOT + not real-time tracking (we precompute the tracking data beforehand)

We decided to focus on real-time tracking only as if it works with realtime tracking it will works on a pre-recorded video ... (we kept in mind that for the game app we have the option of using non real time tracking algorithms)

Also our problem is simplified because we mainly track cars, and we have a fixed camera viewpoint.

3. Benchmarking existing solutions

A challenge exists for researcher to compare their tracking algorithm, and it is specifically designed for Multiple object tracking: https://motchallenge.net/

There are plenty of algorithm, but we benchmarked two of them that had the following criterias:

  • run at more the 25 FPS
  • open source implementation done in Python / C++

3.a SORT : Simple Online and Realtime Tracking

Example video: https://motchallenge.net/movies/ETH-Linthescher-SORT.mp4

SORT : Simple Online and Realtime Tracking : https://motchallenge.net/tracker/SORT is based on that paper: http://arxiv.org/abs/1602.00763

Code & Perfs

Python implementation, Licence GPL-3.0 : https://github.com/abewley/sort
Fast but no super fast: 100 FPS on a macbook pro 2012 , and may eat some resources needed on the jetson for the open tra

How does it work ?

The basis of the algorithm is the same as ours, it compares a frame with the next using dimensions like position of the bbox, size of the bbox and compute a velocity vector. It does have novelties compared to our approach:

  • It uses Kalman filters to compute the velocity factor: Kalman filter is essentially doing some math to smooth the velocity/direction computation by comparing the predicted state and the real detection given by YOLO. (and I think it smooth out also the size of the bounding box of the predictions)

  • Its uses an assignment cost matrix that is computed as the intersection-over-union (IOU) distance between each detection and all predicted bounding boxes from the existing targets (which is putting all the dimensions in a normalized matrix). Then the best match is computed using the Hungarian Algorithm, which is a way to fastly compute lots of matrices …

  • It also handles the score of the detections (how confident YOLO is of that detection) which we didn't use. Could be interesting to see if it helps, the tracker could choose between two close detections based on that.

Limitations

We also found some limitations during the exploration of this approach:

  • Does not handle re-entering: that means that if the tracker loose track of something (generally because of YOLO not having that detection for a few frames), when the tracker gets the object back, it will give it a new id, which is bad for us as for the game it means that the masking is lost…
  • The velocity computation is not based on several frames: We've found out that with our algorithm it was better to compute velocity model based on the average of few frames back

How does it perform on our use case of tracking cars ?

Out of the box not that great. The main problem being that there is high number of identity switches (as it does not handle re-entering). But it does perform better for some cases where our tracker is losing tracking.

Also, and this is true for all trackers of the MOT benchmark, the are optimized for persons, not cars, we didn't try with persons as we didn't shoot footage of persons yet, but we can hope that it performs way better than our algorithm for this.

Takeaway from the SORT algorithm

Even if we were a bit disappointed with the raw results after playing a bit with it (changing some parameters), we could take away some ideas that would help improve our algorithm, like integrating the Kalman Filters to make better predictions.

We also started to notice that the algorithms from the MOT challenge were optimized for persons, and might now work as well with our use case of cars.

NOTE: some improvement has been made on SORT with https://github.com/nwojke/deep_sort , which using deep learning with a model trained on pedestrian to handle the occlusions scenario / re-entering. We didn't test it but it says that it should run in Realtime also.

3.b IOU Tracker:

IOU Tracker: https://motchallenge.net/tracker/IOU17 , based on that paper: http://elvera.nue.tu-berlin.de/files/1517Bochinski2017.pdf :

Code & Perfs

How does it work ?

It's incredibly simple, it works by comparing the overlapping areas between the two detections between the frames.

This is how IOU-Tracker works

It does this by computing the intersection over union or the areas:

And it does just that (no prediction, no velocity vector computation...)

Limitations

  • Doesn't do any prediction, so if YOLO loses the object for some frames, the tracker will lose it as well and will track it again under a new id.
  • Won't perform well on lower frame rate detections, this is understandable, the overlapping areas at lower frame rates can be non existent as there is no predictions.

How does it perform on our use case of tracking cars ?

Surprisingly great ! We think out of the box it may not be as good as our current algorithm tracker for some case because YOLO is missing detections quite a lot of times and triggers lots of re-assignments with this tracker... but it had a huge potential of improvement if we add prediction + re-entering features.

Takeaways from the IOU tracker

A quote from the paper that summarize it well.

“simple tracking methods like the IOU tracker can lead to better results than complex approaches based on decades of research”

Based on what we learned we revisited our distance() function. It could be improved by using this overlapping area comparison.

D. Finalizing implementation

Based on the previous learning, we simply integrated to the tracker the distance() function of the IOU paper instead of reasoning on euclidean distances, this led to lots of improvements and making the tracker much more reliable.

We also found out that with some objects like trucks, yolo sometimes make double detections for the same objects, we fixed that by no necessarily assigning all the detections to tracked items, this avoided some id-reassignment as well.

E. Limitation and ideas for improvement

Some limitation of node-moving-things-tracker is that it was mostly tested on tracking cars, and could be over optimized for this use case and perform badly on other use cases. Also it was only tested on fixed camera viewpoint.

To improve it further, it could be a good idea to work on the prediction framework, by integrating Kalman filters, and also integrate the confidence on the detection given by YOLO which isn’t used at the moment.

Another path for improvement would be to use some machine learning techniques to learn from a fast variety of traffic footage, the only thing that would be time consuming is to produce a large enough training set.

F. Conclusion

We didn’t take the time to improve the tracker further as it was good enough for our use case, but there is plenty of room to do so.

In this post we simply wanted to share our learning process on tracking things, and we hope it gives you enough context to be able to fork the project and customize it for you use case.