To view the full demo video, please click here.
Object recognition, or the discovery and labeling of an object in an image by a computer, has gone from a difficult research problem to a technology in only five years. Google’s TensorFlow object detection API is a leading tool for developers to create their own object recognition systems. However, in industry, object recognition is only a first step to information discovery. Next steps are application dependent. For example, after finding pipes in an image, where do defects exist, and how severe are they? Or after finding cars in video frames, how do we track each car frame by frame?
In this post, we dissect a popular object recognition application: automatic license plate recognition (ALPR). The task is to detect cars in a video, track their locations with a bounding box (“track”), and annotate them by their license plate number (if showing). In computer vision, these three tasks are examples of “object recognition”, “video tracking”, and “optical character recognition” (OCR) respectively. We decompose this task into a sequence of steps, iterated frame-to-frame:
- use a track’s state (velocity and position of its sides) from the previous frame to predict its new state in the current frame,
- use object recognition to find the tracked car in the current frame (and discover new cars not in the previous frame),
- marry the results of 1 and 2 via a Kalman filter to update the state of the track in the current frame,
- search the track, cropped from the current frame, for license plate text,
- identify the characters in the text to label the car by its license plate number.
Let us discuss each of these steps. The first step is simple: in the current frame, we estimate the position of a track via “velocity x time”, and we estimate the velocity of a track by its velocity in the previous frame. Because the tracked car may accelerate, and because the track’s previous state has error, this “prediction” is not perfect.
The second step has these sub-tasks. First, we find all cars in the current frame. For this object recognition task, we used the single-shot MobileNet detector ssd_mobilenet_v1_coco_2017_11_06_2017 from the TensorFlow object detection zoo, trained on the Microsoft COCO dataset. Then we pair cars with tracks via the Hungarian algorithm, which maximizes overlap of bounding boxes over all pairs. If a car is not paired with a track, then it is assigned a new track. If a track is not paired with a car, then the track persists unless it has not paired through, say, 20 consecutive frames. The bounding box location of the car is the “measurement” of its track’s position, and it is not perfect because our object recognition model does not return perfect results.
The third step merges these predictions (step 1) and measurements (step 2) to update each track’s state in the current frame. The prediction and measurement errors inject error into this merge result, and using the Kalman filter algorithm to do the merge minimizes the variability of this random error.
The fourth step uses the fast contour-finding algorithm of Suzuki and Abe, implemented as findContours in OpenCV, to find the best candidate license plate in each track. The best candidate is the longest collection of closed contours whose bounding boxes are adjacent, similarly-oriented, and have height, width, and aspect ratios within certain bounds. (Thanks to Chris Dahms for this idea.)
The fifth step uses OCR to read the characters of each license plate found in the fourth step. We compared two OCR tools: tesseract and Google Cloud Vision API. Tesseract performed poorly because the resolution of the cropped license plate images was too low. Google Cloud Vision API did better, but it also performed best if it processed the entire cropped image of a car. This is what we used to create the above video demonstration.
While developing this APLR pipeline, we encountered the following challenges:
- Recognition of cars was, at times, inconsistent over frames.
- There were some instances of false detection of cars.
- Tracks could jump between cars.
- License plates had low resolution and were therefore difficult to read.
- Non-license-plate text could be mistakenly discovered (for example, “Division” in the demo video).
We met the first four of these challenges as follows:
- The pre-trained single-shot MobileNet detector we used finds occluded cars well but misses cars in plain sight surprisingly often. To alleviate this defect, we combined images of cars in COCO, many occluded, with non-occluded images of cars in the Stanford Cars dataset. Then we retrained our car recognition model using this augmented car dataset via transfer learning. Our new model still misses some cars in plain sight, although in fewer frames. To not lose those cars, we allowed their tracks to persist unpaired over a set number of consecutive frames so they could reattach to the car if re-discovered.
- The pre-trained model we used recognized 80 different object classes, yet we want to identify cars alone. To improve this capability, we retrained this model (see item 1 immediately above) to recognize two classes only: car versus not-car. This reduced false-positive incidents.
- We reduced track jumps by shrinking the number of consecutive frames through which an unpaired track could live.
- Making many API requests and displaying the most recurrent result improved license plate read accuracy against occasional occlusion and low resolution. The one-minute demo video submitted 724 API requests, at a cost of $1.50 per 1,000 API requests. At this rate, it is worthwhile to develop a custom OCR system next.
The Kalman filter dates back to the late 1950s, and it was used in navigation software on board the Apollo spacecraft that first carried man to the moon. When combined with new object recognition technology, powerful applications become within reach.