Intersection monitoring from video using 3D reconstruction

March 9, 2016 Read time: 7 mins

Table 1 Accuracy of the various counting methods.

Researchers Yuting Yang, Camillo Taylor and Daniel Lee have developed a system to turn surveillance cameras into traffic counters.

Traffic information can be collected from existing inexpensive roadside cameras but extracting it often entails manual work or costly commercial software. Against this background the Delaware Valley Regional Planning Commission (DVRPC) was looking for an efficient and user-friendly solution to extract traffic information from videos captured from road intersections.

We proposed a method that tracks and counts vehicles in the video and to use the tracking information to compute a 3D model for the vehicles and visualise the 2D road situation into 3D. The 3D model can provide feedback on the tracking and counting model for future research.

The proposed method aims to solve two tracking and counting difficulties when working from video. Firstly, unlike normal highway traffic vehicles approaching an intersection can go straight ahead or turn in either direction, which will make our system much less constrained and predictable. The second is the perspective deformation in the image, caused by the low position of camera and distortion, which creates difficulty in recognising true relative distance on road and in reconstructing the 3D model.

To deal with the variety of movement, the system was trained to predict the vehicle’s movement in the next few frames dependent on its current position and velocity. This system, known as a classifier, then categorises each vehicle depending whether it is predicted to go straight, turn left or turn right and is based on the assumption that the vehicles use the dedicated lanes for turning or going straight. This classification makes it easier to separate adjacent vehicles with different driving directions, and can help improve the performance of tracking and counting.

As no camera information was provided with the video, solving the perspective deformation problem required manually calibrating the camera. There is a need to calibrate the principal point position in the x and y direction and the radial distortion parameter, and then adjust these parameters to make the parallel lines in the image intersect at a same point (the vanishing point). The parallel lines are drawn according to the road markings. By calibrating the camera and acquiring the intersection of the parallel lines, it became a simple geometric problem to project a nadir view – that is the equivalent to viewing the scene from directly above. The nadir view in Figure 1 is not a perfect real world restoration as the height of the vehicles was not taken into account, but is done to calculate same-scale distance across the image as a threshold for vehicle tracking. Since the vehicle's height is low compared with the camera's height, this approximation was good enough for the initial stage, leaving the vehicle’s height to be considered in the 3D reconstruction.

For tracking and counting, we established a feature-based tracking system that uses a Kanada-Lucas-Tomasi tracker to track ‘feature points’ detected in the image. Each feature point’s relative distance was then rectified according to the nadir view and grouped into individual vehicles according to properties such as proximity, distance, color, speed and direction. The 3D reconstruction part is then added.

In effect the vehicle’s 3D position is approximated by the 3D position of the feature points that represent the vehicle and is calculated on the assumption that the vehicle is a rigid body. This means the distance between each pair of points on the vehicle does not vary and therefore the orientation between the two is in accordance with the direction of the vehicle. The actual data is acquired by solving an optimisation problem in which the variables are the 3D positions of the feature points, and the objective is to minimise the reprojection error between the points' real position on image and their estimated position on the image. In this system, the optimisation is directly solved using MATLAB built-in function lsqnonlin.

One issue for structure from motion method is that it can only solve for the point positions up to a scale factor – the inability to differentiate between 1m cubic 1m from the camera and a 2m cubic 2m from the camera. To overcome this we needed to arbitrarily assign a factor for the set of points. By assuming that the vehicle is above ground but not flying, this can be done by scaling the result until its lowest point reaches zero (that is the wheels are touching the ground).

Since we calculate the 3D positions of certain feature points, the vehicle is represented as a bounding box moving in the calculated direction.

This method has been tested on six different 10-second video clips covering different trajectories, and the accumulative accuracy in counting the vehicles is shown in Table 1. To differentiate the contribution each part of the method contribution to the overall performance, some subsystems were removed and the results compared with the original findings. Table 1 shows the results of the original method in the blue column; results in the red column are without camera calibration; and the green column shows the results where the grouping algorithm has been replaced with a strategy that grouped nearby points together.

In the original method, over-counted situations occur mostly in extreme cases such as a large truck with relatively few feature points which the system sometimes mistakenly interpreted as two or more vehicles. Missed vehicles are mostly due to occlusion (blocked or overlapping vehicles).

The results were not badly affected by removing the camera calibration as this is mostly needed for 3D reconstruction. However, without such calibration there can be difficulty in determining a distance threshold to group nearby feature points as the ratio between pixel and real world distance varies across the image. As the grouping technique was able to differentiate nearby vehicles travelling in different directions (straight ahead/turn left/turn right), the distance threshold was set at a high level as feature points at some distance from each other could belong to a same vehicle. Setting a high threshold reduces the over-counted vehicles, but slightly increases the number of missed vehicles. This is because the large threshold caused the system to recognise several vehicles as one.

Removing the grouping technique had a far greater impact on the correct counting of vehicles. An increase in over-counted vehicles was caused by the lack of an algorithm to combine small groups of feature points. While the increase in missed vehicles was only a small, the actual result is worse in per-frame demonstration. This is because without the grouping algorithm, adjacent vehicles travelling in different directions could not be differentiated until they were physically separated, which is less efficient.

As this technique only requires video information from the cameras with the computing work carried out at the control centre, we believe this system could be added to the feed from other existing cameras with a minimum of manual setup. However, as the speed of the system is constrained by the 3D reconstruction process, the system can only do off-line ‘snapshots’. Manual camera calibration may not be needed when the user has the camera’s precise location and height.

The training process will be needed at each new intersection because in order to predict which direction the vehicle will be moving, we need to first manually give some ground truth values to the classifier. This enables it to ‘learn’ the correlation between a vehicle’s current speed/position and to judge the possible trajectory.

Currently the system is being used to show how the road situation would look in 3D, not to facilitate the counting system. It is planned to feed the 3D vehicle reconstruction information back into the counting system and information such as whether this vehicle has reasonable size and position will help indicate if the counting is reliable. With a known camera position, the 3D model can be used to predict occlusion and devise a way to detect masked vehicles.

Having shown the proposed method can count vehicles and compute their 3D position using video from cameras installed at road intersection after initial training, we plan to feedback the 3D model into the counting system to increase accuracy.

ABOUT THE AUTHORS:
Yuting Yang is a graduate of the University of Pennsylvania, Camillo Taylor and Daniel Lee are professors at the University.