A genetic programming approach to explore the crash severity on multi-lane roads

https://doi.org/10.1016/j.aap.2009.09.021Get rights and content

Abstract

The study aims at understanding the relationship of geometric and environmental factors with injury related crashes as well as with severe crashes through the development of classification models. The Linear Genetic Programming (LGP) method is used to achieve these objectives. LGP is based on the traditional genetic algorithm, except that it evolves computer programs. The methodology is different from traditional non-parametric methods like classification and regression trees which develop only one model, with fixed criteria, for any given dataset. The LGP on the other hand not only evolves numerous models through the concept of biological evolution, and using the evolutionary operators of crossover and mutation, but also allows the investigator to choose the best models, developed over various runs, based on classification rates. Discipulus™ software was used to evolve the models. The results included vision obstruction which was found to be a leading factor for severe crashes. Percentage of trucks, even if small, is more likely to make the crashes injury prone. The ‘lawn and curb’ median are found to be safe for angle/turning movement crashes. Dry surface conditions as well as good pavement conditions decrease the severity of crashes and so also wider shoulder and sidewalk widths. Interaction terms among variables like on-street parking with higher posted speed limit have been found to make injuries more probable.

Introduction

Crashes on high-speed (speed limit greater than 45 mph), multi-lane arterial corridors (more than one lane in each direction of travel) with partially limited access account for a significant proportion of traffic fatalities. In the state of Florida; crashes on high-speed (speed limit equal or greater than 45 mph), multi-lane arterial corridors with partially limited access account for 45.36% (NHTSAT, 2007) of the total number of fatalities related to speeding. Changing traffic conditions and environmental settings make highway safety and traffic operations a perennial field of concern. Numerous state-of-the-art methods to improve safety of the roadways are available to the practicing engineer; hence the challenge is not only to identify which methods suits best but also to explore what new insight could be added to the existing body of knowledge.

In the field of transportation safety, it is not only important for us to identify the contributing factors but also to understand their contribution to the problem at hand. To understand the contribution for better assessment of the safety situation, innovative methodologies are being adopted. Since the data used in this study is observational (i.e. collected outside the purview of a designed experiment); an information discovery approach has to be adopted. Pande and Abdel-Aty (2008) in their work on association rules, point out that data mining techniques remain underutilized for analysis of crashes. The underutilization is especially noteworthy since most studies use observational data collected outside the purview of an experimental design. Apart from using a new methodology, the authors have also approached the roadway elements in a more unified way. In this study the corridors have been treated in its entirety, i.e. putting both the segments and intersections together. They then have been clustered into four groups based on the length of corridor. Abdel-Aty and Wang (2006) have shown a spatial correlation between crash patterns of successive signalized intersections, which may be attributed to the characteristics of the segments joining them. In Florida, all the crashes occurring within 250 ft from the center of an intersection are categorized as intersection related crashes. Recently Das et al. (2008) showed that proximity only is not the best way to assign crashes. Wang et al. (2008) used frequency modeling for crashes with fixed as well as varying influence distance and found different set of significant factors. These recent research justified the treatment of the corridor as a whole and not breaking them into segments and intersections.

In the present study the authors set up a classification problem for the injury occurrence as well as the severity of crashes. In a typical classification problem the algorithm develops a set of rules which when followed leads to a particular category of the target variable. For example, in crash severity analysis when the binary target variable represents severe/non-severe crashes, the classification rule developed will lead to either severe crashes or non-severe crashes. The variables that enter the rule are significant and their directionality is critical for understanding the contribution of the variable in the analysis. Abdel-Aty and Pande (2006) have used the classification trees and neural networks in detecting the relationship between real time freeway traffic conditions and rear-end crashes. However according to Deschaine and Francone (2004), genetic programming (GP) is observed to perform better than classification trees in terms of lower error rates and also outperforms neural networks in regression analysis. GP is a heuristic search technique that iteratively evolves better programs which could either be the best solutions or lead to the best solutions. The innovative evolutionary computation, GP, is based on the genetic algorithms (GA). In GA, the optimum solution is reached by using the well established techniques of evolutionary biology. In a recent work by Makkeasorn et al. (2006) in the field of water resources management, soil moisture estimation models were developed by the use of Discipulus™ Linear Genetic Programming (LGP) software and were applied to the soil moisture distribution analysis. The work shows that LGP, a type of GP, helps in the development of excellent nonlinear multivariate regression models. The work also compared the LGP model developed with the linear regression and nonlinear regression models independently and the LGP model was found to be the best for the data. The linear regression model overestimated the soil moisture while the nonlinear regression models tend to underestimate it. According to Chang and Chen (2000) the regression models generated by GP is also independent of any model structure. Use of GA in transportation is not new. They have been used widely in traffic signal system optimization and network optimization (Park et al., 2000, Ceylan and Bell, 2004, Teklu et al., 2007). The use of GA or GP in transportation safety studies is relatively new and hence the authors intend to test the method and observe its potential. A set of roadway geometric variables were chosen to understand the classification of injury/non-injury crashes as well as severe/non-severe crashes. The authors also use Discipulus™ for the classification problem. Aspects of the software critical to the study in hand will be discussed in Section 3 of the paper.

The focus in this study is to evolve the best possible classification rules that are developed by the LGP methodology. The best program developed by LGP is essentially a set of line-by-line instructions. When the instructions are read from top-to-bottom, they lay out a classification rule. The use of LGP to detect the classification rule is an improvement from all other existing methodologies. In LGP, the heuristic approach to reach the best program goes through a process of evolving numerous programs. The process terminates when no further improvement in classification or decrease in misclassification is observed. The details of the selection process are given in Section 3.

The following section deals with the intricate details of the data preparation, which includes the creation of the dummy variables. The approach to modeling and the dependent variable set up are to be discussed. The section after that deals with the modeling methodology, i.e. explain the GP in more detailed manner and also explain the set up of the classification problem. Disadvantages of GA which led to the development of GP are discussed in this section. The results and analysis section primarily focus on the significant variables and their relationships, discovered by the best evolved programs and their interpretation relevant to roadway safety.

Section snippets

Study area and available data

The crash data available were from the Crash Analysis and Reporting (CAR) system of the Florida Department of Transportation (FDOT). The Roadway Characteristics and Inventory (RCI) data was also made available through FDOT. The data used are for the years 2004 through 2006 for all the state roads of Florida. The datasets have information regarding traffic, roadway geometric and traffic crashes. The datasets were merged and the parameters were modified to suit the genetic programming methodology

Problems in genetic algorithm

GP, which is a class of evolutionary algorithms, has its roots in the GA. GA is a method to grow from one population to a new population through the process of evolution. For a detailed review of conventional GA the readers are directed to the classical work by Holland (1975) and Goldberg (1989). For the more advanced learners, typically, in GA the representation is generally fixed length representation of length ‘l’ and the alphabet size is ‘k’. In the search space of a fixed length

Analysis and results

Each of the best programs chosen for the analysis in hand is a set of effective instructions which lead to the final classification rule. Typically for the classification problem the “Class 1 Hit Rate”, “Class 0 Hit Rate” and the “Weighted Hit Rate (WHR)” for each of the best programs are provided. Once the criterion is chosen, the set of effective instructions (after the removal of introns) form the classification rule for that particular program. In the present study the WHR has been used as

Conclusions

As stressed earlier in the paper, classification is critical to our understanding of the variables of significance and their contribution to the safety problem at hand. In the present study the authors have set up a classification problem for the injury as well as severity of crashes. Typically in a classification problem the algorithm develops a set of rules which when followed leads to a particular category of the target variable. For example, in crash severity analysis when the binary target

References (32)

  • A. Das et al.

    Urban arterial crash characteristics related with proximity to intersections and injury severity

    Transportation Research Record

    (2008)
  • Deschaine, L.M., Francone, F.D., 2004. White paper: comparison of Discipulus (Linear Genetic Programming software with...
  • J.D. Finan et al.

    The influence of reduced friction on head injury metrics in helmeted head impacts

    Traffic Injury Prevention

    (2008)
  • J.L. Gettis et al.

    Effects of rural highway median treatments and access

    Transportation Research Record

    (2005)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization and Machine Learning

    (1989)
  • J.M. Holland

    Adaptation in Natural and Artificial Systems

    (1975)
  • Cited by (36)

    • Machine learning applied to road safety modeling: A systematic literature review

      2020, Journal of Traffic and Transportation Engineering (English Edition)
      Citation Excerpt :

      In addition to this study, only Delen et al. (2006) and Kwon et al. (2015) did not have most of the variables in their studies related to environmental conditions. The latter used factors vehicle-related factors which were entirely absent from the models of Alikhani et al. (2013), Das and Abdel-Aty (2010), Iranitalab and Khattak (2017), Kashani and Mohaymany (2011), Oña et al. (2011), Oña et al. (2013b), Zhang et al. (2018). In other models, they were inexpressive (i.e., small number of variables compared to the other groups).

    • Predicting crash risk and identifying crash precursors on Korean expressways using loop detector data

      2016, Accident Analysis and Prevention
      Citation Excerpt :

      The function set contained six standard arithmetic operators, such as +, −, ×, ÷, protected square root, and protected natural logarithm, which can express most of mathematical models solving classification problem. These operators were generally used to build genetic programming model on this account (Das and Abdel-Aty, 2010; Das et al., 2010; Xu et al., 2013a,b). The terminal set included the traffic variables selected by conditional logistic regression analysis.

    View all citing articles on Scopus
    View full text