Abstract
For self-driving vehicles, detecting lane lines in changeable scenarios is a fundamental yet challenging task. The rise of deep learning in recent years has contributed to the thriving of autonomous driving. However, existing methods of lane detection based on deep learning have high requirements on computing environment, so their applicability is further restricted. This paper proposed an improved attention deep neural network (DNN), a lightweight semantic segmentation architecture catering for efficient computation in low memory, which contains two branches worked in different resolution. The proposed network integrates fine details captured by local interaction of pixels at high resolution into global contexts at low resolution, computing dense feature maps for prediction task. Based on the attributes of disparate feature resolution characteristics, different attention mechanisms are adopted to guide the network to effectively exploit the model parameters. The proposed network achieves comparable results with state-of-the-art methods on two popular lane detection benchmarks (TuSimple and CULane), with faster calculation efficiency at 259 frames-per-second (FPS) on CULane dataset, and the total number of model parameters only requires 1.57 M. This study provides a practical and meaningful reference for the application of lane detection in memory constrained devices.
1. Introduction
Lane detection is a research hotspot in autonomous driving, and it is a key technology of Advanced Driver Assistance System (ADAS) [1]. The forward-looking camera mounted behind the front windshield is used to collect the surrounding driving environment. The vision-based processing algorithm is embedded in the vehicle to detect lane lines from captured video clips. Then the results of lane detection will be applied to subsequent tasks including lane keeping [2], trajectory planning, and behavior prediction. Therefore, lane detection is an integral part of automatic driving and ADAS.
To extract lane markings from video clips, current studies have mainly investigated two key routes: traditional vision methods and deep learning methods. To be specific, traditional vision-based lane detection algorithm mainly includes two steps: extraction of hand-crafted features and fitting geometrical curves. Researchers distinguish lanes from the prior characteristics including lane shape, color feature, and edge texture information [3]. Then they use polynomial curves or splines to approach lane boundaries model with distinctive features. This traditional vision-based method can obtain satisfactory performance under the condition of clear lane markings and board vision. However, lane detection scenarios are subject to changes in illumination, weather conditions, traffic congestion, and other factors, which varies during driving and poses great challenges to lane estimation. These challenges are inevitable but can hamper the detection accuracy, especially when using traditional methods. Thus, we need a more robust method to meet high precision requirements of lane detection in challenging environments.
Deep learning, especially the convolutional neural network (CNN) [4], has rich feature representation capabilities, which greatly boosts the performance of computer vision (CV) systems [5]. Recent studies on lane detection have shown that methods based on deep learning can deliver strikingly better results than traditional methods that depend on hand-crafted cues. Therefore, deep learning methods have become the preferred option in this field. Compared with traditional algorithms that highly rely on vision cues in specific environments, deep learning-based methods improve the network's scene perception ability by continuously optimizing neural network parameters, which attain higher robustness and applicability. Consequently, lane detection methods based on semantic segmentation and instance segmentation [6–9] have attracted extensive attention in recent years. However, some of these recent approaches have tried to directly adopt a classic segmentation network or its variants to segment lane markings. The results, although very encouraging, appear coarse in detail. The primary reasons come from two aspects: (1) In terms of accuracy, the segmentation of the boundary is less precise, especially when the lines are in distance and suffering occlusion. (2) In terms of computing resources, excessive computation and parameter overheads lead to large memory usage and deficient real-time performance, which limits the practicability of algorithm. In this paper, our motivation is to design an improved neural network architecture to compensate for the aforementioned dilemmas and execute a trade-off between high accuracy and low memory consumption. To achieve this goal, we designed a network framework with a two-branched structure. In this framework, we first utilize a lightweight downsampling module to squeeze the spatial dimension of the input feature map and forward them into two branches. One branch named global context embedding (GCE) focuses on capturing global information that can be used to deduce heavily occluded and blurred lane markings. Another branch explicit boundary regression (EBR) tries to exploit spatial attention mechanism (SAM) to aggregate boundary information at different locations, and a supervisory operation by label’s edge information is attached at the end of EBR. In particular, SAM is integrated into the EBR module for better boundary regression, and channel attention mechanism (CAM) is placed behind the output of GCE encoder, so that channels with target objects can be assigned higher weights. Inspired by the effectiveness of MobileNet [10] series articles, we replace regular convolution layers with bottleneck units for network deepening operation, which contributes to the reduction of network parameters.
The main contributions of this study can be summarized as follows.(1)This paper proposes a lightweight DNN framework to simultaneously address precision performance and memory overloads issues, which is better suited for the lane detection task.(2)This paper designs an EBR module with SAM and auxiliary edge supervision to reinforce the consistency of semantic boundary. The lateral experimental results indicate these modules significantly improve the precision of lane boundaries.(3)The experimental part elaborates on the details of the ablation study and compares the segmentation results on TuSimple and CULane datasets with other methods. Results show that the proposed model attains faster inference time with competitive performance compared with state of the art. And the robustness of the algorithm can tackle lane detection in dark, dazzled, and blurred environment.
The remainder of this paper proceeds organized as follows. Section 2 reviews the previous researches on lane detection. Section 3 introduces the proposed method. Section 4 demonstrates the experimental performance of the proposed method. Section 5 summarizes this article.
2. Related Works
The present work relies heavily on prior efforts in lane detection and attention mechanism areas.
2.1. Lane Detection
Before the rise of deep learning, methods on lane detection generally focus on feature extraction, model fitting, and lane tracking. Researchers use the color, boundary, and texture features of the road to realize lane detection and tracking. We conclude a general block diagram of the traditional lane detection methods in Figure 1. For more algorithm types and implementation details of image preprocessing, region of interest (ROI) selection, lane modeling, lane detection, and lane tracking in traditional methods, please refer to [11, 12].

Methods based on deep learning have gradually become the dominant algorithm in vision tasks relying on their powerful representation capabilities. In order to handle the complex situations in lane detection, [6] designed a hybrid deep architecture by combining RNN with CNN. Input consecutive frames into CNN for feature extraction and then exploit RNN to further learn extracted features and make lane prediction. In [7], they design a network named RS-Lane, which adds split attention and self-attention distillation on the basis of LaneNet to increase the reasoning ability and robustness of the proposed method. This work can detect lane lines without number limits. In [13], the authors improve you only look once (YOLO) object detector and realize the detection of yellow lane lines by means of object detection. Pan et al. [8] proposed a spatial CNN that aggregates the features of each pixel through slice-by-slice convolution in a layer, resulting in top-1 performance in the CVPR’17 TuSimple benchmark. However, to take advantage of spatial information sufficiently, this method gathers both horizontal and vertical information by shifting the sliced feature recurrently, which is conceivably time-consuming. Compared with SCNN, Tu et al. [9] employ the same spatial information utilization strategy, but on this basis, the algorithm is simplified by changing the information shifting strides. Qin et al. [14] realize fast lane detection by gridding the image and searching the lane grids row-by-row and column-by-column. Even though this method can benefit from the fast speed delivered by grid downsampling, it also cannot prevent reduced accuracy from low-resolution sparse feature map. With approach in [15], the authors proposed PolyLaneNet, which converts the lane estimation task into conjecturing polynomials that represent each lane in input image. This approach obtains higher real-time efficiency; however, the accuracy relies on the position of lane line starting points, so the performance drops significantly when the lanes suffer severe occlusion. Typical framework of deep learning-based lane detection is shown in Figure 2.

2.2. Attention Mechanism
Attention mechanism originated from the research of natural language processing (NLP) and was gradually applied to the field of CV recently. This mechanism can selectively focus on important features and suppress irrelevant features, thus improving the performance of DNN. SENet [16] proposes a “Squeeze-and-Excitation” block to model interdependencies between channels. By recalibrating the channelwise feature responses, the network produces significant performance improvements at negligible overheads. Different from SENet, approaches of [5, 17] both exploit semantic interdependencies in spatial and channel dimensions. In order to make best use of two complementary attention outputs, [5] propagates the channelwise output to spatialwise attention submodule in sequential arrangement, while [17] arranges two modules in parallel and sums the outputs to further improve feature representation. In this work, we utilize channel attention and spatial attention in different branches to enhance network representation power. The channel attention and spatial attention are similar to DANet [17] and CBAM [5], respectively. In [18], a self-attention distillation (SAD) approach is proposed to improve the representation learning of CNN-based lane detection models, which is a flexible plug-and-play module. For more articles based on deep learning to achieve lane line detection, please refer to [19].
More close to our work, [20] introduces channel attention module and self-attention module in parallel to obtain global contexts and channel dependencies of feature maps. However, successive dilated convolutions introduced in the subsampling process and large rectangular multiplication under the self-attention mechanism will incur expensive computations and memory burdens to the network. In our network, we utilize a lightweight downsampling module to gather low-stage feature maps and exploit spatial and channel attention based on a simple yet efficient architecture.
3. Methods
In this section, we first present a general architecture of our designed model and then elaborate the inner blocks used to capture the global context information and spatial edge information. Finally, we demonstrate how to aggregate the information for further enhancement of feature representation.
3.1. Architecture Design
Figure 3 is a pictorial description of our proposed network architecture. It is obvious that this architecture consists of four components and splits into two branches. The above branch encodes increasingly abstract feature representation with long-range context in deeper encoder output, while the bottom branch reserves spatial details in low-level high-resolution feature maps. Subsequently, we take the superiorities of two branches into consideration, so that elementwise summation is employed to fuse features. The following describes the details of each module.

3.2. Lightweight Downsampling
In lightweight downsampling module, we consider different strategies to subsample original input. The large kernel size (with kernel size = 7) is adopted in the first subsampling process to enhance dense connections between feature maps and per-pixel classifiers, strengthening the robustness to local disturbances and allowing the classifiers to handle variant distortion from cameras. Here we emphasize that subsampling with larger kernel size filter does not impose additional parameter burdens on the network, while reducing the parameters to some extent. The bottleneck layer unit [10] consists of two point convolutions and a group convolution with stride 2, which can further subsample the input and generate abstract feature maps in compact channel dimension. Furthermore, we adopt the bottleneck with residual shortcut as a substitute for conventional convolution to reduce network overloads, as shown in Figure 4. These two feature extraction strategies are combined in the lightweight downsampling module to realize the aggregation of deep abstract features while reducing the number of parameters.

(a)

(b)
3.3. Global Context Embedding
The GCE module involves two parts laid out sequentially to encode long-range contexts. The first part is convolutional striding operation and dilated convolution, which capacitate high stage features to obtain richer global information. The other is the introduction of CAM. Since convolutional operation extracts informative features by blending cross-channel information with simple summation, the importance of different channels is ignored. However, we expect the network to selectively emphasize channels that contain lane line semantic features and restrain irrelevant ones. Channel attention provides a means of recalibration channelwise feature responses according to interchannel correlations, which stimulates the network sensitivity to crucial and informative features. CAM calculates the specific weight of each channel in the unit of feature map, so that the channels that cover the target feature receive more attention. Therefore, the precision of the network can be significantly improved while slightly increasing the computation, which is consistent with the theme of our proposed method.
Concretely, the second strategy in 3.2 is first introduced to subsample the input; then stacking the dilated convolutions is aimed at expanding receptive field of filters. Next, CAM is appended to aggregate the intraclass consistency and enhance robustness to local disturbance; the structure is illustrated in Figure 5(a). Here we describe the operation of CAM in detail below. Suppose that is the output of the previous layer. Firstly, we reshape it to , where represents total pixel numbers. Then, we conduct a matrix multiplication between and transposed .

(a)

(b)
Next, the above multiplied resulting matrix is forwarded into a softmax layer to normalize the interchannel dependencies between any two channel maps. Finally, we multiply the weighted channel attention map with corresponding channel and perform summation with origin channel to obtain the final feature maps as follows: denotes the influence exerted by channel on , where is a learnable parameter which is initialized as 0 and gradually learns to assign more weight.
3.4. Explicit Boundary Regression
The lane lines in the distance are inconspicuous and incomplete due to the influence of light and occlusion. If we merely perform ordinary convolution operation in high-resolution input to generate local feature map, the discriminability for these indistinct lane targets will be covered by other salient objects, resulting in misclassification and misdetection of semantic segmentation. To remedy the above mentioned issue, we first exert SAM to enhance the feature discriminant ability, which redistributes weight based on the interspatial relationship of features in a global view, as shown in Figure 5(b). We improve the SAM proposed in [5], which reallocates the weight of each pixel in feature map and assigns the pixel containing target cue to a higher weight, thus improving the representation ability of indistinct boundary features. Given the spatialwise refined features, we introduce an edge supervision to guide the network to learn the boundary characteristics. This supervision acts as an auxiliary boundary segmentation task, enabling the network to achieve EBR. Next, we expound the process of SAM and edge supervision.
As illustrated in Figure 5(b), given a local feature , we first feed it into point convolution, max-pooling, and average-pooling operations along the channel axis to generate feature descriptors , and , respectively, where . Then we concatenate the above three descriptors and forward them into a convolution layer with large kernel size of 7. After that we apply a sigmoid layer to calculate the 2D spatial feature map . Finally, we perform an elementwise multiplication between and feature map to obtain the final output as follows:where and represent convolutional operation with kernel size of 1 and 7, respectively; denotes elementwise multiplication.
To enhance the continuity and discriminability of the lane boundary, we employ Sobel edge extraction operator to filter the semantic labels and exploit them as supervision signal. It is worth noting that the lane lines occupy only a small proportion in the labels and the imbalance between background and foreground is detrimental to the segmentation performance. Thus, we employ focal loss [21], as an auxiliary loss function, to supervise the output of SAM; the equation is shown as follows:where means the probability of class , , is the maximal number of labels, and is a modulating factor and is set to 2 here.
3.5. Integration and Classification
To embed rich semantic information into low-level features, we conduct transpose convolution with stride 2 to upsample the outputs of GCE and integrate them with the outputs of EBR by summation. Furthermore, from the perspective of feature fusion, the exploit of EBR can bridge the gap between low-level and high-level features as emphasized in [22]. Then two bottleneck layers with residual connection are followed to tightly aggregate features from aforementioned two branches. Finally, we convolve the fusion results and generate final prediction maps.
4. Experiment
We evaluate our network on two widely used lane detection benchmark datasets: TuSimple benchmark dataset [23] and CULane dataset [8]. We first introduce the datasets and report implementation details, then perform a series of ablation experiments, and present comparison results with other state-of-the-art approaches.
4.1. Datasets and Implementation Details
TuSimple: It is a well-known traffic lane detection benchmark which contains five annotated lane markings, involving 3626 images for training and 2782 images for testing. Because no validation set is given, we randomly split 368 images from train set as validation set, which are employed for preventing overfitting and validating the model during training. CULane: The CULane dataset is a large scale challenging lane detection dataset which contains more than 55 hours of videos and extracts 133235 frames, involving 88880 frames for training, 9675 frames for validation, and 34680 frames for testing. This dataset comprises different conditions including urban, rural, and highway, which consists of normal and night challenging scenarios.
4.1.1. Implementation Details
Training: Our experiments are executed on PyTorch deep learning framework. Considering the excessive resolution of inputs is computationally expensive in training, thus we first resize the original images to for TuSimple and for CULane. Then we train our network using stochastic gradient decent (SGD) with momentum 0.9 and weight decay 0.0001. The batch size is set to 8 here. Influenced by DeepLab [24] success in semantic segmentation, we employ the similar “poly” learning rate policy where the initial learning rate is multiplied by . We adopt cross-entropy loss to measure the similarity between prediction mask and ground truth. Besides, a weighted focal loss served as an auxiliary loss function for precise boundary regression, as shown in where represents the segmentation loss calculated by cross-entropy, reveals boundary regression loss by focal loss, and is a weight factor. The network training will be terminated when the epoch number exceeds 100, and the model with lowest validation loss will be selected as the final testing model. Metrics: The evaluation metrics used to qualify the approach’s segmentation performance consists of mean intersection-over-union (mIoU), accuracy (Acc), false positive (FP), false negative (FN), true positive (TP), and F1-measure (F1). With respect to speed evaluation metrics, we adopt the FPS.
The Acc metric defined in TuSimple benchmark is shown below:where denotes the number of points detected correctly, and represents the total number of the ground truth points in clip.
The F1 metric defined in CULane dataset is shown below:
4.2. Ablation Study
In this section, we stepwise disintegrate our network to investigate the effect of each component in proposed method. In the next experiments, we conduct comparative experiments on inner structure of the proposed network architecture on TuSimple dataset and use the framework without attention module and edge supervision as the baseline for ablation experiments.
4.2.1. Attention Mechanism
We use CAM in the GCE module to capture long-range dependencies for better lane marking inference. And SAM are used to break the situation where the position weights are identical, which is complementary to the channel attention. To verify the performance of the aforementioned components, we explore four different implementation combinations in Table 1. It can be seen from the table that the integration of SAM and CAM has significantly improved the segmentation accuracy of the network, especially the CAM. The SAM is placed on the branch of high-resolution feature map to optimize local position details by increasing the pixel weights containing lane features, thus improving the detection accuracy in a small range. The CAM is integrated into the high-semantic feature map branch to optimize the global classification accuracy of the network by enhancing the channel weights containing lane cues, resulting in an obvious boost in detection accuracy. These two submodules are arranged in parallel combination at different branches, so that the dependencies modeled by two attention modules do not affect each other. Therefore, it can be seen from the last line in Table 1 that the precision of the combined model has been further improved.
4.2.2. Boundary Regression
In order to better identify and locate boundary features, we employ label-based edge detection results as the supervision signal to strengthen the network representation ability. Then we attach the edge supervision to the output of SAM. This improves the continuity of lane boundary and contributes explicitly to regression of lane lines. We first quantitatively test the balance of bilateral losses, which further analyze the correlation between significance of boundary regression and final segmentation results; various values {0.05, 0.1, 0.2, 0.5, 0.75} are used to weight the focal loss. It is apparent from Figure 6, with the same setting, that mloU reaches the peak at 70.60% with w of 0.2. The initial climbing trend of the line chart shows that moderate edge supervision helps the network efficiently extract lane lines, while excessive reliance will lead to a rapid decline in network performance. Therefore, we set the turning point as the final weight factor.

We visualize the effect of auxiliary edge supervision in Figure 7. The first column and the second column represent the original image and ground truth, respectively. The last two columns exhibit the detection results with and without edge supervision. From the experiment results, the fourth column obviously outperforms the third column at better refined edge details.

4.3. Performance Evaluation on TuSimple
We carry out comparative experiments on TuSimple with other state-of-the-art methods to further evaluate the performance of our approach. The methods include RESA [9], Ultra-Fast [14], PolyLaneNet [15], SAD [18], FastDraw [25], and LaneNet [26]. The quantitative results are specifically reported in Table 2. The experimental results show that the proposed method outperforms the LaneNet and PolyLaneNet by 1.73% and 1.75% and achieves comparable performance with UFast. PolyLaneNet obtains the lane line detection results by predicting the polynomial parameters of the lane lines, while a slight deviation from the parameters will lead to a significant decrease in the accuracy of lane detection result. RESA gets excellent performance by recurrently convolving sliced feature maps vertically and horizontally to gather global information for each pixel. This approach introduces a large number of convolution operations, which significantly increases the number of parameters and inference time of the network. Figure 8 displays the intuitive comparison results. Because TuSimple’s evaluation metrics allow the predicted points to bias from the true points within a certain threshold range, our quantitative accuracy index is only 1.75% higher than PolyLaneNet. However, it is apparent from the qualitative graph that our results are more consistent with real lines and the deviation value is smaller. In order to verify the lightweight structure and real-time performance of our model, further contrast experiments were carried out on inference time and parameters. We loop 100 times to calculate the run time of a single frame. The last two columns in Table 2 provide the results. The experimental results show that our network is 2.7×, 5.5×, and 24.4× faster than FastDraw, ENet-SAD, and RESA, respectively. In addition, the model parameters are only 1/16 and 1/39 of RESA and Res18-UFast. From the above comparison tests, although RESA and ENet-SAD achieve better results in terms of accuracy index, our method provides a more practical and comprehensive choice considering the high requirements for real-time performance in practical application scenarios and the limited memory of embedded devices.

On the basis of completing the above ablation and comparative experiments, we also conduct further tests on the robustness of the algorithm. In the course of driving, the illumination of road surface often changes when being sheltered by shadows or passing under the viaduct. Furthermore, spurious marks on the road surface and significant changes in road color will also interfere with the lane line detection results. Therefore, the lanes detection algorithm needs to be robust enough to deal with the above situations. In Figure 9 we first display multiple road scenes with obvious challenges in road light, surface color, and spurious dirt traces. Then, the lane line clustering and polynomial fitting results are superimposed on the original map to get a more intuitive result displayed in the following rows. The experiment results show that the proposed algorithm can accurately identify the lanes, including scenarios with curves, road slope, and online.

4.4. Performance Evaluation on CULane
To verify the effectiveness and generality of the proposed method, we conduct comparative experiments on the CULane dataset. Several recent lane detection methods, including ENet-SAD [18], SCNN [8], FastDraw [25], Res18-Ultra [14], and Res18-VP [27], are used for comparison. Table 3 presents the results. Compared with TuSimple dataset, CULane dataset has a larger and richer training set and can better train the generalization performance of the model in different scenarios. It can be seen from Table 3 that the proposed method is comparable with Res18-VP and Res-Ultra in F1 metric and has outstanding performance in parameters, which denotes our method strikes a balance between accuracy, speed, and computation burdens. Compared with ENet-SAD, our model has 0.59 M more parameters. Considering that both models are lightweight network structures, the total network parameters are less than 2 M. Therefore, the influence of the 0.59 M parameter advantage on practical performance is not distinct. In terms of speed, the detection efficiency of proposed network is more than 5 times faster than that of ENet-SAD; this significant gap in real-time performance is more obvious in practical application, which can meet the detection speed requirements in various occasions. Then we select the challenging scenarios from the test set to intuitively verify the effectiveness of the proposed method. Figure 10 presents the visualizations.

5. Conclusion
In this paper, we propose an efficient lane detection method based on lightweight attention DNN, which is tailored for real-time lane detection task. Our method can effectually capture global context information to segment occluded lane lines and judge the classification of each pixel. Meanwhile, it retains high-resolution dense features to better conjecture the inconspicuous lane boundaries. In order to generate semantically precise prediction maps and refine segmentation results along lane boundaries, we further incorporate attention and edge supervision mechanisms into the network. We evaluate effectiveness and generality of our proposed method on TuSimple and CULane datasets. Although the results show that state-of-the-art methods can obtain slightly higher accuracy, the model parameters and computational overheads far exceed our network. Thus, our proposed method strikes a balance between accuracy and computational costs. Extensive experiments demonstrate that our network can attain robust and effective performance under the challenging scenarios, which provides a reference for the implementation of lane detection in embedded devices. In the future, we will continue to explore how to improve the accuracy of the model and keep seeking a balance between accuracy and efficiency. In addition, we will deploy our algorithm on an embedded vehicle platform to guide subsequent obstacle avoidance and planning tasks.
Data Availability
The datasets are available from http://github.com/TuSimple/tusimple-benchmark and https://xingangpan.github.io/projects/CULane.html.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was partially supported by the Jiangsu Provincial Agricultural Science and Technology Independent Innovation Fund Project (no. CX(21)2025), National Natural Science Foundation of China (no. 61873064), and Suzhou Science and Technology Plan Project (no. SNG2020039).