1. Introduction
Nowadays, mobile robots have played an important role in automatic files such as warehouse and port operations. They have proved their advantages by automatic operation without human control. However, they must overcome certain tackles to improve their accuracy and performance. Especially during the automatic operation in a warehouse where the mobile robot performs their finding path to moving the cargo between the shelves around the warehouse, they need to define the direction, which way it should go, and the distance, which parameters to detect their position. In this case, a distance estimation system could provide accuracy and fast estimation to reduce the lack of time and processing time when choosing the path of mobile robots. This performance enhancement improves mobile robots' limitations, especially when they operate without human control.
In robotics, accurate perception of the environment is crucial for enabling autonomous navigation and interaction with surroundings. Mobile robots must be equipped to sense distances to obstacles or objects for safe and efficient operation. Various sensors such as LIDAR, ultrasonic sensors, and the stereo vision system are commonly used and come with cost, complexity, and computational demand trade-offs. While LIDAR provides high accuracy, it comes with significant cost and power requirements, making it less suitable for low-cost applications (Karthika et al. ,2020). Stereo vision, which utilizes two cameras to estimate depth through triangulation, offers a balance between accuracy and affordability but requires complex calibration and additional computational resources (Liu et al., 2012). Monocular camera-based laser rangefinders offer an economical alternative to expensive laser scanning sensors while providing reliable distance data for mobile robots (Zhang et al., 2013).
Vision-based distance estimation using a single camera has become a promising alternative because of its simplicity and low cost. Mono cameras, however, lack the inherent depth information provided by stereo setups, making distance estimation more challenging. Various methods have been proposed to solve this limitation. Some approaches could be mentioned as depth-from-motion, where camera movement is used to infer depth (Griffin et al., 2021), and size-based estimation, where the apparent size of known objects is used to calculate distance value. One of the applications that could be mentioned by using mono camera calibration is UAV control (Skov et al, 2021), which combines feature detection on a vertical concrete surface with a camera-based distance estimator, enabling a UAV to autonomously track and approach a user-defined target with a limited margin of error. Another result also applies the vision based method for the 3D Mapping to improve the safety of autonomous driving in container terminals (Vinh et al, 2023).
Chessboard calibration has been widely applied in computer vision fields to increase accuracy when determining a camera’s intrinsic parameters, and this parameter can be used to calculate distance using perspective geometry (Xu et al., 2012). This method, which uses images of a chessboard pattern at different angles, has a high performance for the mono camera system and provides a practical solution with high reliability for distance estimation in mobile robots. Several works have focused on and shown that mono camera-based systems can reach trustable results in controlled environments, particularly when calculated properly (Kuramoto, 2018). However, there remain difficulties in extending these methods to more complex environments where real-time performance is required.
This paper describes the problem of distance estimation using a mono camera by applying chessboard-based calibration to find intrinsic camera parameters. These parameters are then used in a perspective geometry framework to estimate the distance between the camera and objects in its field of view. The proposed method provides a cost-effective, lightweight alternative to stereo vision systems, with the added advantage of being easier to implement and integrate into existing robotics platforms. This paper also demonstrates the system provides competitive accuracy with minimal error in controlled environments, making it suitable for a wide range of robotics applications.
The remainder of this paper is organized as follows: Section 2 reviews related work on distance estimation from a mono camera. Section 3 details the approach, methodologies applied, camera calibration, and distance estimation. Section 4 presents experimental results from several tests, while Section 5 concludes with a discussion and potential future work.
2. Literature Review
Distance estimation systems are crucial for performing autonomous navigation in mobile robots. Various methods have recently been developed, from laser-based solutions to vision-only approaches. This paper provides an overview of the relevant research contributing to this area.
Among the well-known approaches for distance estimation in robotics systems, laser rangefinders are among the highest-performance methods. By analyzing the laser beams and their reflections, this system gives LIDAR a high accuracy in distance measurements. Although they can provide great precision, they come at a high cost, making them unsuitable for budget-conscious applications (Muzal et al., 2021). Researchers have investigated alternative methods for tackling this challenge by integrating monocular cameras with laser pointers. For instance, a system for measuring and reconstructing targets using four lasers and a visual camera has been proposed to achieve high-accuracy geometry (Wang et al., 2016). Motion vibrations and computational errors affect the system’s performance despite its effectiveness.
Besides laser-based solutions, camera calibration approaches have been employed to estimate distance extensively in monocular camera setups. Chessboard calibration has been widely used to determine camera intrinsic parameters and allow precise perspective projections (Escalera et al., 2010). Various research has also applied this chessboard calibration to estimate distances by calculating the displacement of the image’s known reference objects. For example, Xu et al. (2017) proposed a novel visual measurement method using a single camera to estimate 3D positions of objects on the floor, leveraging extrinsic camera parameters and a chessboard pattern for calibration, achieving higher accuracy than the traditional estimation method. However, these methods often struggle with lens distortion, which introduces errors at longer distances.
For depth estimation, several types of methods depend on vision-only approaches. One such method is depth-from-motion, which uses a camera’s relative motion to measure depth. Based on that approach, Zhuang et al. (1994) uses a Kalman filter to improve predictions and morphological filtering to lower noise and increase accuracy. This method computes depth maps from monocular image sequences by combining direct depth estimation with optical flow techniques. Although promising, the methods typically involve a sequence of images and extensive calculations, making real-time difficult to execute on mobile robots. Using probabilistic geometry and object, this system combines local object detection, capturing the dependency of objects, surface orientations, and camera pinhole points. This allows for highly accurate objects and distance estimation (Hoiem et al., 2008). However, this approach has drawbacks, especially when operating in difficult environmental conditions, as its performance is limited. They also require known object sizes or detailed knowledge about the scene.
3. Proposed Methodology
The methodology section outlines the process taken to estimate the distance and coordinates of the camera in the mobile robot of known objects in the camera’s field of view. This process is divided into three main stages: camera calibration using a chessboard pattern, calculating the pixel-to-real dimension conversion, and estimating distance in the X, Y, and Z coordinates. These stages are detailed as follows.
3.1 Camera Calibration using a Chessboard Pattern
Camera calibration is an important step in finding the intrinsic parameters of a camera, such as focal length, lens distortion, and optical center, to enhance the accuracy of distance and coordinates measurement. This research applies a chessboard calibration method, a widely used approach due to its simplicity and accuracy.
A classical challenge in computer vision is three-dimensional (3D) reconstruction, which involves extracting 3D structural information from two-dimensional (2D) images of a scene (Forsyth and Ponce, 2015). Since real-world cameras are complex devices, photogrammetry techniques are employed to model the relationship between the measurements captured by the camera’s image sensor and the actual 3D world. In the widely used pinhole camera model, the connection between world coordinates X and the widely used pinhole camera model establishes the connection between world coordinates X and image (pixel) coordinates coordinates x is established through perspective transformation by Eq. (1).
where: P is the projective space of dimension n.
Multiplane calibration is a method of camera auto-calibration that enables the computation of a camera's parameters from two or more views of a flat, planar surface. The foundational work in this area was pioneered by Zhang (2000). The author's technique calibrates cameras by solving a homogeneous linear system that encapsulates the homographic relationships between several perspective views of the same plane. This Multiview approach has gained popularity due to its practical simplicity—it is easier to capture multiple views of a flat surface, such as a chessboard, than to construct a precise 3D calibration rig, which is necessary for Direct Linear Transformation (DLT) calibration. The Figs below illustrate a practical example of multiplane camera calibration using multiple views of a chessboard.
Some pinhole cameras provide considerable distortion to images, with two primary types being radial and tangential. Radial distortion results in curved, straight lines, with the effect becoming more pronounced as points move far from the center of the image. For example, within an image, two edges of a chessboard are marked with red lines. However, the actual border of the chessboard does not align with the red lines, illustrating the radial distortion. The expected straight lines bulge outward, highlighting the curvature caused by the distortion. Then, the radial distortion can be calculated as follows:
When the camera lens is not perfectly parallel to the imaging plane, tangential distortion will happen. This misalignment causes certain areas in the image to appear closer to the farther away than expected. Tangential distortion typically results in a slight shift or tilt in the image, making objects appear distorted along the edges. The amount of tangential distortion can be mathematically illustrated by Eq. (4) and Eq. (5):
where p1 and p2 are tangential distortion coefficients, and r is the radial distance from the center of the image. These formulas account for the deviation caused by the misalignment between the lens and the imaging plane. In short, to correct the distortions in the captured image, there are five distortion coefficients needed to determine, which are typically represented as:
where:
● k1 and k2: radial distortion coefficients that account for the bulging effect in the image.
● p1 and p2: tangential distortion coefficients, which handle the shift due to the lens misalignment
● k3: an additional radial distortion coefficient that further refines the correction, especially for higher-order distortions.
These coefficients transform the distorted image into its undistorted form, allowing for more accurate measurements and 3D reconstructions of the camera’s images.
Intrinsic parameters are specific to a camera and describe its internal characteristics. These include the focal lengths as (fx,fy) and the optical center (cx,cy). The focal lengths determine how the camera converges light onto the image sensor, while the optical center indicates where the principal axis intersects the image plane. These parameters are combined into a camera matrix, which can be used to calculate lens distortion and increase the accuracy of mapping 3D world coordinates to 2D image coordinates. The camera matrix is unique to a particular camera, so once calculated, it can be applied to all images taken, eliminating the need to repeat the calibration process for future photos. The camera matrix K is demonstrated as a 3x3 matrix by Eq. (7):
where:
● fx and fy are the focal lengths in the x and y directions, respectively.
● cx and cy are the optical center coordinates, also known as the principal point.
● The last row maintains the matrix format for homogeneous coordinates.
Extrinsic parameters define the position and orientation of the camera to the world coordinate system
3.2 Calculating Pixel to Real Dimension Conversion
It was necessary to establish a connection between these two scales to convert pixel dimensions into the captured image to real-world units (centimeters). For this process, a label or chessboard pattern with known physical dimensions - wlab in width and hlab in height- was used as a reference object. The camera captured the image of this chessboard pattern (label), and its dimensions in pixels, denoted as Lx (width in pixels) and Ly (height in pixels), were measured from the image.
Using this kind of reference chessboard (label) in Fig 2, the pixel-to-real dimension conversion factors for both the x and y directions were computed. The conversion factor for the x direction was calculated from Eq. (8):
Similarly, the conversion factor for the y direction was computed as Eq. (9):
These conversation factors represented the physical size of one pixel in centimeters for both directions and were used to transform pixel dimensions into centimeters units. This conversion was important for accurately estimating distances in the real world.
3.3 Estimating Distance in Z, X, and Y Coordinates
The camera position estimation is built from the triangular similarity principle and some equations to convert pixels to real-world dimensions. This section provides the concept and equations used to estimate the distances in the z, y, and x coordinates, along with the steps to calculate the camera’s position relative to the detected objects (chessboard pattern, label).
The focal lengths fx and fy are calculated by using the triangular similarity mentioned above, which is presented by Eq. (10) and Eq. (11), which relates the size or dimension of an object (this case is a chessboard or label) in the real world to its image in the camera’s field of view:
where:
● dximg: the dimension of the object is in pixels on the x-axis.
● dxobj: the known real-world I dimension of the object.
● dyimg: the dimension of the object is in pixels on the y-axis.
● dyobj: the known real-world y dimension of the object.
● R: is the known distance from the camera to the object, measured once.
Rearranging the equation, the fx and fy can be solved by using Eq. (12) and Eq. (13):
When the focal length fx and fy are defined, the distance to the chessboard or label, R, can be estimated based on the chessboard’s dimensions in pixels by using Eq. (14):
Once the distance R is calculated, the camera's position is estimated in the x and y axis relative to the detected object.
Determine the difference in pixels between the center of the object and the center of the image for both the x and y axis by Eq. (15) and Eq. (16):
where:
● cxobj: the center of the object by x-axis in the pixel
● cyobj: the center of the object by y-axis in the pixel
Eq. (17) and Eq. (18) convert these pixel differences into real-world distances using triangular similarity:
The distance in the z-axis, dz, represents the object's distance from the camera along the optical axis. In this case, dz also means the position z of the camera on the mobile robot, which can be estimated by using Eq. (19):
where:
● Lx is the known dimension of the object in pixels.
● fx is the focal length by the x-axis of the camera. which is derived using triangular similarity principles.
Eq. (20) calculates the camera’s position based on the real-world differences for the x-coordinate:
Similarly, Eq. (21) calculates the camera position for the y-coordinate:
These computations can determine the camera’s position about the object in real-world coordinates. Knowing the spatial relationship between the camera and the surrounding objects makes more precise 3D object detection and tracking possible.
4. Experiment Results
This experiment focuses on the accuracy of the proposed method for estimating the position of the camera and the chessboard in 3D space. To perform this experiment, the coordinate system with known real-world positions for the camera and chessboard is set up as Fig 3. The camera was placed at various positions, and the real-world coordinates of the camera and the objects were recorded. Using the proposed method, the estimated positions (including the distance from the camera) of the camera were then calculated based on the image captured by the camera and compared with the actual measurements. Also, the camera‘s resolution is 5 Megapixel. All images captured by this camera have the same characteristics as below:
- 1080x1080 pixels (WidthxHeight).
- 192 dpi for horizontal resolution and vertical solution.
The camera’s estimated position was derived using the triangular similarity method described in the methodology section. The pixel-to-centimeter conversion was applied based on the known dimension of the label. The distance in the X, Y, and Z coordinates was estimated for each camera position using the derived focal length and the known real dimensions of the objects. The error between the real and estimated positions was calculated as follows:
The experiment results are summarized in Table 1, which shows the measured and estimated positions of the camera at various locations. These positions are located at different x and z values but have the same y values. Since the mobile robots do not change their y positions, this experiment chooses to keep y-position value the same for all cases.
The average error in the Z-coordinates was 6.5894 cm, while the error in the X and Y coordinates ranged from 3.6312 cm to 9.5887 cm. As seen from the data, the methods provided relatively accurate estimates for positions where the camera was closer to the chessboard; however, the error increased slightly when the camera was positioned at greater distances.
From Fig 4 and the analysis results from Table 1, the estimated positions closely follow the real positions, indicating the proposed method's overall accuracy. The lines connecting the real and estimated points visually represent the error magnitude for each case.
● X-axis: the estimation errors for the X-coordinates are relatively small, and the estimated positions generally remain within a few centimeters of the real positions. The trend line for the X-axis is consistent across the different cases.
● Y-axis: Different from the X-axis, the estimation results differ a larger than X-axis from the real or measured value. These deviations are most noticeable when the camera is far from the chessboard. However, this method was apply for the mobile robot moving on the floors. That means in reality, the y-position robot rarely changes, therefore, this error does not have a strong impact to mobile robot estimation.
● Z-axis (depth): as expected from the above numerical analysis Table 1. The errors on the Z-axis are larger than those on the x and y axes, especially when the camera is farther from the chessboard. However, the overall position trend by the Z-axis still belongs to an acceptable range.
The analysis of the test cases above showed that the difference between the estimation and measurement position by the Z-axis was generally larger than in the X and Y coordinates. One of the reasons could be mentioned that depth (Z-axis) estimation strongly depends on small variations in pixel dimensions, which can be impacted by camera distortion and image solution. For instance, when the camera moves far away from the chessboard/labels, several small changes in pixel dimensions result in larger deviations in the Z-coordinates calculation. This result is consistent with previous studies that underline the drawback of accurate depth estimation from a 2D image.
In contrast, the X and Y coordinate errors were more consistent and comparatively minimal across various camera positions. This can be attributed to the fact that these coordinates rely heavily on the difference between the chessboard's center point and the camera's center point in the image. This calculation is less sensitive to minute pixel changes.
The Fig. 5 illustrates all the test cases with 20% brightness condition serves to visually compare the camera position estimation errors under different lighting conditions. By presenting this figure alongside the numerical data, the purpose is to highlight the impact of reduced brightness on estimation accuracy. The comparison allows for a clearer understanding of how changes in lighting can influence the performance of the estimation model across the X, Y, and Z coordinates.
The results from the camera position estimation evaluation reveal that the average percentage errors for the X, Y, and Z coordinates differ between normal lighting conditions (Table 1) and reduced brightness (20%, Table 2). In Table 1, the average errors across 16 tests are 4.68% for X-axis, 27.81% for Y-axis and 1.71% for Z-axis. Under the 20% brightness condition, as shown in Table 2, the errors across 16 tests are slightly different, with 4.75% for X-axis, a reduced error of 25.01% for Y-axis, and 1.12% for Z-axis.
These results suggest that reducing brightness to 20% had minimal impact on the X-axis estimation accuracy but improved the Y and Z-axis estimations. The significant decrease in error for the Y-axis indicates that lower brightness helped enhance the model's accuracy in estimating positions along this axis. Similarly, the reduced error in Z-axis estimation suggests improved precision in depth estimation under lower brightness. However, since the X-axis error did not show considerable change, it implies that brightness reduction had a limited effect on this coordinate's estimation accuracy. Overall, this analysis demonstrates that brightness conditions can influence camera position estimation accuracy, particularly along the Y and Z axes.
The experiment illustrates the effectiveness of the suggested approach for estimating the camera’s position in 3D spaces. The comparatively higher Z-coordinate error indicates that greater improvement could be useful, especially regarding reducing camera distortion and enhancing image quality for a more accurate depth estimate.
5. Conclusions
This research proposed a method for calculating a camera’s position in 3D space using chessboard calibration and pixel-to-real unit conversion equation. The experiments demonstrated the proposed approach's effectiveness in accurately estimating distance in the x, y, and z coordinates, focusing on analyzing the errors between the measured and estimated position.
In general, the triangulate similarity principle-based and the pixel-to-world dimension conversion proved reliable methods for estimating the camera’s position or camera on mobile robots. Though minor, the error from the experiment indicates the challenges of translating 2D-pixel measurements into accurate 3D world coordinates. Despite these difficulties, this approach proves their accuracy in several scenarios and can be applied to various practical tasks.
In future steps, depth estimation must be improved using high-level approaches, such as multi-point calibration techniques, to minimize estimation errors. Furthermore, some algorithms for updating and optimizing real-time errors could enhance the system's efficiency. These improvements extend the low-cost application's ability to apply to wide and varied scenarios with higher accuracy and performance.