A genetic programming approach to explore the crash severity on multi-lane roads
Introduction
Crashes on high-speed (speed limit greater than 45 mph), multi-lane arterial corridors (more than one lane in each direction of travel) with partially limited access account for a significant proportion of traffic fatalities. In the state of Florida; crashes on high-speed (speed limit equal or greater than 45 mph), multi-lane arterial corridors with partially limited access account for 45.36% (NHTSAT, 2007) of the total number of fatalities related to speeding. Changing traffic conditions and environmental settings make highway safety and traffic operations a perennial field of concern. Numerous state-of-the-art methods to improve safety of the roadways are available to the practicing engineer; hence the challenge is not only to identify which methods suits best but also to explore what new insight could be added to the existing body of knowledge.
In the field of transportation safety, it is not only important for us to identify the contributing factors but also to understand their contribution to the problem at hand. To understand the contribution for better assessment of the safety situation, innovative methodologies are being adopted. Since the data used in this study is observational (i.e. collected outside the purview of a designed experiment); an information discovery approach has to be adopted. Pande and Abdel-Aty (2008) in their work on association rules, point out that data mining techniques remain underutilized for analysis of crashes. The underutilization is especially noteworthy since most studies use observational data collected outside the purview of an experimental design. Apart from using a new methodology, the authors have also approached the roadway elements in a more unified way. In this study the corridors have been treated in its entirety, i.e. putting both the segments and intersections together. They then have been clustered into four groups based on the length of corridor. Abdel-Aty and Wang (2006) have shown a spatial correlation between crash patterns of successive signalized intersections, which may be attributed to the characteristics of the segments joining them. In Florida, all the crashes occurring within 250 ft from the center of an intersection are categorized as intersection related crashes. Recently Das et al. (2008) showed that proximity only is not the best way to assign crashes. Wang et al. (2008) used frequency modeling for crashes with fixed as well as varying influence distance and found different set of significant factors. These recent research justified the treatment of the corridor as a whole and not breaking them into segments and intersections.
In the present study the authors set up a classification problem for the injury occurrence as well as the severity of crashes. In a typical classification problem the algorithm develops a set of rules which when followed leads to a particular category of the target variable. For example, in crash severity analysis when the binary target variable represents severe/non-severe crashes, the classification rule developed will lead to either severe crashes or non-severe crashes. The variables that enter the rule are significant and their directionality is critical for understanding the contribution of the variable in the analysis. Abdel-Aty and Pande (2006) have used the classification trees and neural networks in detecting the relationship between real time freeway traffic conditions and rear-end crashes. However according to Deschaine and Francone (2004), genetic programming (GP) is observed to perform better than classification trees in terms of lower error rates and also outperforms neural networks in regression analysis. GP is a heuristic search technique that iteratively evolves better programs which could either be the best solutions or lead to the best solutions. The innovative evolutionary computation, GP, is based on the genetic algorithms (GA). In GA, the optimum solution is reached by using the well established techniques of evolutionary biology. In a recent work by Makkeasorn et al. (2006) in the field of water resources management, soil moisture estimation models were developed by the use of Discipulus™ Linear Genetic Programming (LGP) software and were applied to the soil moisture distribution analysis. The work shows that LGP, a type of GP, helps in the development of excellent nonlinear multivariate regression models. The work also compared the LGP model developed with the linear regression and nonlinear regression models independently and the LGP model was found to be the best for the data. The linear regression model overestimated the soil moisture while the nonlinear regression models tend to underestimate it. According to Chang and Chen (2000) the regression models generated by GP is also independent of any model structure. Use of GA in transportation is not new. They have been used widely in traffic signal system optimization and network optimization (Park et al., 2000, Ceylan and Bell, 2004, Teklu et al., 2007). The use of GA or GP in transportation safety studies is relatively new and hence the authors intend to test the method and observe its potential. A set of roadway geometric variables were chosen to understand the classification of injury/non-injury crashes as well as severe/non-severe crashes. The authors also use Discipulus™ for the classification problem. Aspects of the software critical to the study in hand will be discussed in Section 3 of the paper.
The focus in this study is to evolve the best possible classification rules that are developed by the LGP methodology. The best program developed by LGP is essentially a set of line-by-line instructions. When the instructions are read from top-to-bottom, they lay out a classification rule. The use of LGP to detect the classification rule is an improvement from all other existing methodologies. In LGP, the heuristic approach to reach the best program goes through a process of evolving numerous programs. The process terminates when no further improvement in classification or decrease in misclassification is observed. The details of the selection process are given in Section 3.
The following section deals with the intricate details of the data preparation, which includes the creation of the dummy variables. The approach to modeling and the dependent variable set up are to be discussed. The section after that deals with the modeling methodology, i.e. explain the GP in more detailed manner and also explain the set up of the classification problem. Disadvantages of GA which led to the development of GP are discussed in this section. The results and analysis section primarily focus on the significant variables and their relationships, discovered by the best evolved programs and their interpretation relevant to roadway safety.
Section snippets
Study area and available data
The crash data available were from the Crash Analysis and Reporting (CAR) system of the Florida Department of Transportation (FDOT). The Roadway Characteristics and Inventory (RCI) data was also made available through FDOT. The data used are for the years 2004 through 2006 for all the state roads of Florida. The datasets have information regarding traffic, roadway geometric and traffic crashes. The datasets were merged and the parameters were modified to suit the genetic programming methodology
Problems in genetic algorithm
GP, which is a class of evolutionary algorithms, has its roots in the GA. GA is a method to grow from one population to a new population through the process of evolution. For a detailed review of conventional GA the readers are directed to the classical work by Holland (1975) and Goldberg (1989). For the more advanced learners, typically, in GA the representation is generally fixed length representation of length ‘l’ and the alphabet size is ‘k’. In the search space of a fixed length
Analysis and results
Each of the best programs chosen for the analysis in hand is a set of effective instructions which lead to the final classification rule. Typically for the classification problem the “Class 1 Hit Rate”, “Class 0 Hit Rate” and the “Weighted Hit Rate (WHR)” for each of the best programs are provided. Once the criterion is chosen, the set of effective instructions (after the removal of introns) form the classification rule for that particular program. In the present study the WHR has been used as
Conclusions
As stressed earlier in the paper, classification is critical to our understanding of the variables of significance and their contribution to the safety problem at hand. In the present study the authors have set up a classification problem for the injury as well as severity of crashes. Typically in a classification problem the algorithm develops a set of rules which when followed leads to a particular category of the target variable. For example, in crash severity analysis when the binary target
References (32)
- et al.
Exploring the overall and specific crash severity levels at signalized intersections
Accident Analysis and Prevention
(2005) - et al.
Passenger car collision fatalities—with special emphasis on collisions with heavy vehicles
Accident Analysis and Prevention
(2008) - et al.
Traffic signal timing optimisation based on genetic algorithm approach, including drivers’ routing
Transportation Research Part B: Methodological
(2004) - et al.
Underreporting in traffic accident data, bias in parameters and the structure of injury severity models
Accident Analysis and Prevention
(2008) - et al.
Comprehensive analysis of relationship between real-time traffic surveillance data and rear-end crashes on freeways
Transportation Research Record
(2006) - et al.
Crash estimation at signalized intersections along corridors: analyzing spatial effect and identifying significant factors
Transportation Research Record
(2006) Aspects of road design and trucks from the analysis of crashes
- et al.
Linear Genetic Programming
(2007) - et al.
Classification and Regression Trees
(1984) - et al.
Prediction of PCDDs/PCDFs emissions from municipal incinerators by genetic programming and neural networking modeling
Waste Management and Research
(2000)
Urban arterial crash characteristics related with proximity to intersections and injury severity
Transportation Research Record
The influence of reduced friction on head injury metrics in helmeted head impacts
Traffic Injury Prevention
Effects of rural highway median treatments and access
Transportation Research Record
Genetic Algorithms in Search, Optimization and Machine Learning
Adaptation in Natural and Artificial Systems
Cited by (36)
The impact of target speed on pedestrian, bike, and speeding crash frequencies
2023, Accident Analysis and PreventionRecent computer vision applications for pavement distress and condition assessment
2023, Automation in ConstructionMachine learning applied to road safety modeling: A systematic literature review
2020, Journal of Traffic and Transportation Engineering (English Edition)Citation Excerpt :In addition to this study, only Delen et al. (2006) and Kwon et al. (2015) did not have most of the variables in their studies related to environmental conditions. The latter used factors vehicle-related factors which were entirely absent from the models of Alikhani et al. (2013), Das and Abdel-Aty (2010), Iranitalab and Khattak (2017), Kashani and Mohaymany (2011), Oña et al. (2011), Oña et al. (2013b), Zhang et al. (2018). In other models, they were inexpressive (i.e., small number of variables compared to the other groups).
Effects of state-led suburbanization on traffic crash density in China: Evidence from the Chengdu City Proper
2020, Accident Analysis and PreventionSafety impacts of pavement surface roughness at two-lane and multi-lane highways: accounting for heterogeneity and seemingly unrelated correlation across crash severities
2019, Transportmetrica A: Transport SciencePredicting crash risk and identifying crash precursors on Korean expressways using loop detector data
2016, Accident Analysis and PreventionCitation Excerpt :The function set contained six standard arithmetic operators, such as +, −, ×, ÷, protected square root, and protected natural logarithm, which can express most of mathematical models solving classification problem. These operators were generally used to build genetic programming model on this account (Das and Abdel-Aty, 2010; Das et al., 2010; Xu et al., 2013a,b). The terminal set included the traffic variables selected by conditional logistic regression analysis.