Next Article in Journal
Development and Validation of a Questionnaire on Motivation for Cooperative Playful Learning Strategies
Previous Article in Journal
Predictors of Athlete’s Performance in Ultra-Endurance Mountain Races
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Estimation of COVID-19 Epidemiology Curve of the United States Using Genetic Programming Algorithm

1
Faculty of Engineering, University of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia
2
Clinical Hospital Centre, Rijeka, Krešimirova ul. 42, 51000 Rijeka, Croatia
3
Faculty of Engineering, University of Kragujevac, Sestre Janjić, 34000 Kragujevac, Serbia
4
Bioengineering Research and Development Centre (BioIRC), Prvoslava Stojanovića 6, 34000 Kragujevac, Serbia
5
Faculty of Medicine, University of Rijeka, Ul. Braće Branchetta 20/1, 51000, Rijeka, Croatia
6
Faculty of Dental Medicine, University of Rijeka, Kresimirova 40/42, 51000 Rijeka, Croatia
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2021, 18(3), 959; https://doi.org/10.3390/ijerph18030959
Submission received: 14 December 2020 / Revised: 19 January 2021 / Accepted: 20 January 2021 / Published: 22 January 2021

Abstract

:
Estimation of the epidemiology curve for the COVID-19 pandemic can be a very computationally challenging task. Thus far, there have been some implementations of artificial intelligence (AI) methods applied to develop epidemiology curve for a specific country. However, most applied AI methods generated models that are almost impossible to translate into a mathematical equation. In this paper, the AI method called genetic programming (GP) algorithm is utilized to develop a symbolic expression (mathematical equation) which can be used for the estimation of the epidemiology curve for the entire U.S. with high accuracy. The GP algorithm is utilized on the publicly available dataset that contains the number of confirmed, deceased and recovered patients for each U.S. state to obtain the symbolic expression for the estimation of the number of the aforementioned patient groups. The dataset consists of the latitude and longitude of the central location for each state and the number of patients in each of the goal groups for each day in the period of 22 January 2020–3 December 2020. The obtained symbolic expressions for each state are summed up to obtain symbolic expressions for estimation of each of the patient groups (confirmed, deceased and recovered). These symbolic expressions are combined to obtain the symbolic expression for the estimation of the epidemiology curve for the entire U.S. The obtained symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients for each state achieved R 2 score in the ranges 0.9406–0.9992, 0.9404–0.9998 and 0.9797–0.99955, respectively. These equations are summed up to formulate symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients for the entire U.S. with achieved R 2 score of 0.9992, 0.9997 and 0.9996, respectively. Using these symbolic expressions, the equation for the estimation of the epidemiology curve for the entire U.S. is formulated which achieved R 2 score of 0.9933. Investigation showed that GP algorithm can produce symbolic expressions for the estimation of the number of confirmed, recovered and deceased patients as well as the epidemiology curve not only for the states but for the entire U.S. with very high accuracy.

1. Introduction

According to 202 [1], the Coronavirus disease 2019 (COVID-19) is a respiratory and vascular disease that is caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The outbreak of COVID-19 can be traced back to December 2019 in Wuhan province (China), although Apolone et al. [2] stated that the virus may have been actively spreading much earlier in Italy. The main transmission of COVID-19 from an infected to an uninfected person is through coughing, sneezing, singing, talking or breathing. The new infection [3,4] occurs when virus-containing particles (respiratory droplets or aerosols) get into the mouth, nose or eyes of uninfected people who are in the vicinity of the infected person.
The COVID-19 symptoms are variable, but generally include fever and cough [5,6,7]. However, people infected with COVID-19 may have different symptoms, and these symptoms may change over time. ISome patients have a high fever, cough and fatigue while in others have a low fever at the beginning and develop difficulty breathing weeks later. The symptoms of COVID-19 [6,8] may be non-specific such as fever and dry cough. The symptoms of COVID-19 can manifest 1–14 days after exposure to the virus [9]. The standard method for testing on COVID-19 is real-time reverse transcription-polymerase chain reaction (rRT-PCR) [10,11]. The rRT-PCR test is typically done using respiratory samples which are obtained using a nasopharyngeal swab. However, in some patients, a nasal swab or sputum sample may also be used.
Since the outbreak began, researchers from various fields have been extensively investigating this disease. Today, there are numerous research studies in which artificial intelligence (AI) was applied for the development of an epidemiological model of COVID-19. Most methods utilize machine learning (ML) algorithms on collected datasets. Most research focused on the application of neural networks, either applying regression methods on the datasets or analyzing the collected data in terms of time series, but other machine learning methods have been applied as well.
Several machine learning approaches have been used to model COVID-19 spread. One of the earliest papers [12] in this field is the utilization of Multi-Layer Perceptron (MLP) on a publicly available dataset [13] to estimate the number of confirmed, deceased and recovered patients on a global scale. In [14], the authors investigated the impact of COVID-19 on the financial movement of Crude Oil price and three U.S. stock indexes: DJI, S&P 500 and NASDAQ Composite. In this investigation, the system consists of the stationary wavelet transform (SWT) and bidirectional long short-term memory (BDLSTM) networks to predict the commodity and stock prices. In [15], the authors developed a modified stacked auto-encoder for modeling the transmission dynamics of COVID-19 epidemics in China. The data for this investigation were collected from 11th January 2020 to 27th February 2020, from WHO. Using this model, the authors performed forecasting of cumulative confirmed patients of COVID-19 across China from 20th January 2020 to 20th April 2020. Using the multiple-step forecasting, the estimated average errors of 6–10-day step forecasting were in the range from 0.73% to 2.27%. The combination of multiple machine learning approaches, including autoregressive integrated moving average (ARMA), cubist regression (CUBIST), random forest (RF), ridge regression (RIDGE), support vector regression (SVR) and stacking-ensemble learning, have been used [16] for the task of time series forecasting of one, three and six days ahead in ten Brazilian states with a high daily incidence. The results show that the models can generate accurate forecasting, achieving errors in a range of 0.87–3.51%, 1.02–5.62% and 0.95–6.90% for one, three, and six days ahead, respectively. The XGBoost classifier has been used [17] on 485 blood samples from infected patients in the region of Wuhan, China, to identify crucial predictive biomarkers of disease mortality. The utilized method predicted the mortality of individual patients more than 10 days in advance with an accuracy of 90%.
Deep Learning with LSTM network is used in [18] on the publicly available dataset provided by John Hopkins University and the Canadian health authority to forecast the COVID-19 outbreak in Canada. The results of the conducted investigation predict the possible ending point of this outbreak around June 2020. The hybrid Wavelet-autoregressive integrated moving average model and regression tree are used in [19] to forecast the number of daily confirmed patients for Canada, France, India, South Korea and the UK.
In this paper, the AI method, GP algorithm, is utilized, since this algorithm offers a possibility of creating mathematical expression from the given data which provides the best correlation between input and output data. Over the years, GP has been implemented in various fields such as curve fitting, data modeling and symbolic regression [20,21,22,23]; image and signal processing [24,25,26,27]; financial trading, time series prediction and economic modeling [28,29,30,31]; and industrial process control [32,33,34,35]. However, GP has also been implemented in medicine-based tasks. In [36], the authors applied GP to oral cancer prognosis. The dataset used in GP contained only 31 patients with feature selection of smoking, drinking, tobacco chewing, histological differentiation of SCC and oncogene p63. In this analysis, the authors achieved average scores of 83.87% accuracy and AUC score of 0.8341 for the classification task. In [37], the authors used GP and ANN to compare the performance on six medical classification problems, and these are breast cancer (benign or malignant), diabetes (positive or negative), gene (intron–exon, exon–intron or no boundary in DNA sequence), heart (diameter of a heart vessel is reduced by more than 50% or not), horse (horse with colic will die, survive or must be killed) and thyroid (thyroid hyperfunction, hypofunction or normal function). The results show that GP performs comparably to ANNs in classification problems. The authors of [38] developed prediction models for confirmed patients (CC) and death patients (DC) across the three most affected states Maharashtra, Gujarat and Delhi as well as the whole of India based on GP. The results show that the proposed models are highly reliable for short time series prediction of COVID-19 patients in India.
As seen from the literature overview, the implementation of AI is usually based on ANN and the GP algorithm implementation is seen in traces. The benefit of the GP algorithm over ANN is that after the training the symbolic expression is obtained, which can be used and manipulated in further analysis. The ANN in general will provide a trained architecture that is almost impossible to transform into mathematical expression due to the large number of interconnected neurons.
The global trend of COVID-19 has been on the exponential rise, with more outbreaks happening through time. This research focuses on the U.S. for two reasons. The first reason is the quality and quantity of the data available, which are highly precise in comparison to the data availability of other countries. As mentioned further in the paper, U.S. data are collected at many levels and the high number of tests performed allows data to be highly precise. This is true for the entire period of the data collection. The second reason is the extremely high number of cases exhibited in the U.S., with both recovery and death rates being relatively high, allowing for enough data to model these separate goals. After research of COVID-19, investigation of the application of various AI methods in COVID-19 spread and the literature overview of the GP algorithm, the following questions arise:
  • Is it possible to utilize a GP algorithm to obtain the symbolic expression for each U.S. state based on latitude and longitude of the central location of that state and the number of days since the outbreak began for the estimation of the number confirmed, deceased and recovered patients with high accuracy?
  • Based on the obtained symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients for each U.S. state, is it possible to formulate the symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients for the entire U.S. with high accuracy?
  • Is it possible to utilize three symbolic expressions for the estimation of the number confirmed, deceased and recovered patients for the entire U.S. to formulate symbolic expression for estimation of the epidemiology curve for the entire U.S. with high accuracy?

2. Materials and Methods

In this section, the publicly available dataset which was used in this paper is described. Then, the GP algorithm used to obtain the symbolic expressions for confirmed, deceased and recovered patients of each U.S. state is described. Based on these symbolic expressions, the methodology of determining the epidemiology curve for the entire U.S. is explained.

2.1. Dataset Description

As mentioned above, the publicly available dataset [13] was used to obtain symbolic expressions for confirmed, deceased and recovered patients using the GP algorithm. This dataset contains the number of confirmed, deceased and recovered patients for certain locations, for each day since the COVID-19 outbreak started. The locations in this dataset are defined with two parameters: latitude and longitude. Each state is defined with one location in terms of latitude and longitude. In total, 50 states were considered in these analyses. The federal district District of Columbia as well as inhabited territories such as American Samoa, Guam, Northern Marian Islands, Puerto Rico and the U.S. Virgin Islands were omitted from this investigation due to lack of data. In this study, the period used was from 22 January 2020 to 3 December 2020. The dataset is divided into three groups: confirmed, deceased and recovered patients. The dataset for confirmed/deceased/recovered patients for each state consists of the central location (latitude and longitude) of the state and the number of patients for each day since the outbreak started. The geographical locations and number of confirmed/deceased/recovered patients for each state are shown in Figure 1.
As shown in Figure 1, the blue dots indicate latitude and longitude locations that were used for each state as part of input data to obtain the symbolic expression for estimation of confirmed/deceased/recovered patients. In Figure 1, it can be noticed that the highest numbers of confirmed patients are in Texas, California and Florida followed by Illinois and New York. In Figure 1, it can be noticed that the highest numbers of deceased patients are in North Carolina, Utah, Colorado, Georgia and New Mexico. In Figure 1, it can also be noticed that the highest numbers of recovered patients are in states that have the highest number of confirmed patients, namely Texas, California and Florida.
As mentioned above, the publicly available dataset was used to obtain symbolic expressions for confirmed, deceased and recovered patients using the GP algorithm. The initial form of the dataset was in time-series form starting from 22 January 2020 to 3 December 2020. For each state, the central location of that state is given (latitude and longitude) as well as the number of confirmed, deceased and recovered patients in the above-mentioned period. It should be noted that, in this investigation, only U.S. states were considered while the federal district (District of Columbia), as well as inhabited territories (American Samoa, Guam, Northern Marian Islands, Puerto Rico and the U.S. Virgin Islands) were omitted. The reason they were excluded is that only the states were considered and that for previously mentioned territories the number of confirmed/deceased/recovered patients is missing for some dates. The data for each state were transformed from the time-series form into regression form. Thus, for the central location of each state, there are only 317 instances used to train and test symbolic expression obtained by the GP algorithm. Each of the 317 instances of the regression data consisted of three input variables, namely latitude and longitude of the state central location and the day from which the outbreak started, while the output variable was the number of confirmed/deceased/recovered patents for a specific date. The last ten instances of the confirmed number of patients dataset for the state of Alabama are shown in Table 1.
As shown in Table 1, the latitude and longitude of the central location for Alabama that represents the first two input variables are constant while the only changing input variable is the number of days since the outbreak began. The output variable in this case represents the number of confirmed patients for each day since the outbreak began. For each state, these 317 instances were shuffled to prevent overtraining of the GP algorithm and divided into training and testing datasets in the ratio of 80:20. The training dataset consisted of 254 instances and was used in GP to obtain symbolic expression, while the remaining 63 instances were used to test the symbolic expression and measure the R 2 score. For each state, the data from the modified dataset was divided into training and testing data in a ratio of 80:20. This means that 80% of the dataset or 254 instances were used to obtain symbolic expression using the GP algorithm while the remaining 20% or 63 instances were used to test the obtained symbolic expression and calculate the R 2 value. In each GP execution, the entire dataset for the state was randomly shuffled and then divided into training and testing datasets to prevent overfitting.
In the case of confirmed/deceased/recovered patients for each state, the data were first randomly shuffled and then divided into a ratio of 80:20. This means that the 80% dataset for confirmed/deceased/recovered patients of each state was used to obtain the symbolic expression for the estimation of the number of confirmed/deceased/recovered patients. The estimation accuracy in terms of the R 2 score of the obtained symbolic expression was then evaluated on 20% of the dataset and 80% of the training dataset for each state. If the R 2 number was below 0.99 in both the training and testing dataset, the execution was repeated by shuffling the dataset and then dividing it in the same ratio. The procedure was repeated 30 times, and, if the R 2 score in both training and testing was below 0.99, the GP algorithm was executed for another state (using the dataset for the next state). After all symbolic expressions were obtained for the estimation of the number of confirmed/deceased/recovered cases, the analysis of the obtained symbolic expressions was performed. For those symbolic expressions that did not achieve the 0.99 R 2 score, the highest achievable R 2 score was chosen. To put those percentages into better perspective, for each state, there are 317 instances in terms of the number of confirmed/deceased/recovered patients for each day since the outbreak began. As stated, 80% of the dataset for each state was used for training, i.e. 254 instances, while the remaining 20% (63 instances) were used to test the obtained symbolic expression or in other words to calculate the R 2 score. By doing such a rigorous procedure, overfitting was avoided.
The reason these input attributes were selected is that these values are not entirely relevant to the spreading of COVID-19, but a question arises if these three input variables (latitude, longitude and day) could be used to obtain the symbolic expressions for the estimation of the number of confirmed/deceased/recovered patients for each state. The other reason these variables were chosen was because these input values were already provided in the dataset. The reason the authors chose the U.S. for this investigation is that the number of confirmed/deceased/recovered patients is well documented for each state. While the inclusion of other variables such as population density, climate, travel and migration would contribute to the process of obtaining symbolic expressions, we wanted to investigate if it is possible to obtain symbolic expressions for the estimation of the number of confirmed, deceased and recovered cases for each state using only latitude, longitude and the number of days since the outbreak began.

2.2. Genetic Programming Algorithm

Genetic Programming (GP) algorithm can be described as the combination of machine learning (ML) methods and Evolutionary Algorithms (EA). As in the case of most supervised ML models, the dataset must be divided into training and testing portions. The training dataset is used in GP to obtain symbolic expression, which correlates input values with the output. The testing dataset is used to test the obtained symbolic expressions. This portion of the dataset is unseen by the symbolic expression and GP for that matter.
The benefit of utilizing the GP in comparison to other machine learning methods is the shape of models generated by it, formulated as equations that transform the set of inputs to the output goal. Because these equations utilize basic mathematical functions, they can, after only minor modifications, be utilized in any software supporting them. This is important in multi-discipline goals such as the one explored here because epidemiological or medical staff may not have the equipment and software necessary to utilize the models generated by a neural network. Even among the engineering staff, different versions of libraries and unfamiliarity with the programming language used in the model creation can cause issues during model interpretation. Simple, language-agnostic, mathematical equations can easily be shared and implemented in new or existing software.
The GP algorithm starts the execution by creating the initial population which is then propagated throughout the predefined number of generations. In each generation, the population members compete to become parents of the next generation usually using tournament selection. After the selection of the best population members using the aforementioned selection procedure, the genes between two or more population members are exchanged using crossover operation, or genes of population members are randomly selected and changed in mutation operation.
To obtain the best symbolic expressions for estimation of confirmed/deceased patients for each state using GP algorithm, the input–output variables and GP parameters must be defined. In the case of confirmed/deceased/recovered patients for each state, the input and output variable representation is shown in Table 2.
As shown in Table 2 the input variables in GP algorithm for development of symbolic expression for estimation of confirmed/deceased/recovered patients for each state are latitude ( X 0 ), longitude ( X 1 ) and the day ( X 2 ) since the outbreak began. The output variable in symbolic expression for estimation of confirmed/deceased/recovered patients is the number of confirmed/deceased/recovered patients, respectively.
The initial set of solutions in the GP is named the initial population and is typically randomly generated. It can either be generated using the full-size tree (up to the maximal defined value), called a full method, or a grow method, which can stop before a maximal size is reached. The combination of the two, called ramped half-and-half, was used in the presented research. Nodes of the generated trees are randomly selected from the available ones in the geneset.
The continuity of the generated equations is guaranteed by the fact the GP uses modified versions of certain mathematical operations. For example, in the case of a square root x , the operation is implemented as a square root of an absolute value ( | x | ), to avoid calculating square roots of negative values. Another common discontinuity is division by zero. In the case of division by zero, or near-zero values, a protected division is used, which returns the value of 1.0 [39,40,41]. These modified operations vary between GP implementations and must be taken into account when the generated models are implemented.

Fitness Function

After population initialization and in each generation, the population members must be evaluated to determine how well they perform before performing the selection. This task is achieved with fitness measure which is a primary mechanism for giving a high-level statement of the problem’s requirements to the GP system.
The GP syntax trees are interpreted utilizing executing the nodes in the tree in a specific order that guarantees that nodes are not executed before the value of their arguments is known. This procedure is achieved by traversing the tree recursively starting from the root node, and executing the evaluation of each node after the values of its children are known. In this paper, the mean absolute error (MAE) is utilized as a fitness function for evaluation of population members, which can be written in the following form:
M A E = i = 1 n | y i x i | n ,
where y i , x i and n represent the prediction, true value and number of instances, respectively. After each GP algorithm execution, the symbolic expression is obtained on training portion of the dataset. This symbolic expression is then evaluated on the testing portion of the dataset with coefficient of determination ( R 2 ). The R 2 metric of each symbolic expression is calculated using the mathematical equation which can be written in the following form
R 2 = 1 S R E S I D U A L S T O T A L = 1 i = 0 m ( y i y ^ i ) 2 i = 0 m ( y i 1 m i = 0 m y i ) 2 .
The R 2 metric compares two set of solutions, and these are the real data y and the data obtained by the model y ^ . This means that the R 2 metric calculates the amount of variance contained inside the data y, which is explained by the data y ^ as a model output. The result of R 2 metric is in range from 0 to 1 where 1 indicates that there is no variance between the real data and the data obtained by the model while 0 value indicates that none of the variance in the real data is explained by the model.
The improvement of solutions is achieved using evolutionary computation operations:
  • Crossover is taking two selected solutions and combining them into a new, children, solution (influenced by the crossover coefficient).
  • Mutation is randomly modifying an existing solution from the previous and copying it to the current generation (influenced by the subtree, hoist and point mutation coefficients).
  • Reproduction is copying the solutions from the previous to the current generation without modification (influenced by the maximal sample’s coefficient).
The solutions to be used in the above operations are determined using tournament selection, which is a fitness proportional selection. The process of selection and modification using EC operations is performed until the termination criteria are reached, termination criteria being either the selected number of generations being reached or the fitness value falling below a pre-specified threshold. The final hyperparameter of the GP which needs to be described is the parsimony coefficient. The tendency exists for the solutions to grow through the generations. Sometimes, this results in a better solution, but, in other cases, the growth generates larger equations without significant gains in the fitness function. The larger solutions are of higher computational complexity, slowing both the training process and the later use of the generated models. To combat this, the fitness function of the solutions may be lowered depending on their size. The amount of this is modified with the parsimony coefficient, the larger value of which penalizes the large solutions more.
The utilized values of hyperparameters are given in Table 3. The hyperparameters are randomly selected from the given ranges and training is performed. In the case the desired R 2 value is not reached once the GP training is complete, new hyperparameters are randomly selected and the process is repeated. In addition to hyperparameter values, GP uses a function set of mathematical functions to insert into symbolic expressions: addition, subtraction, multiplication, division, square root, maximum and minimum of values, absolute value and natural logarithm.
As the table shows, not all of the goals utilize the same ranges. Initial hyperparameter ranges for all goals were equal. During the research, certain hyperparameter values for certain goals were increased to obtain higher quality solutions, in the case that the smaller range did not provide the necessary performance. Initially, smaller ranges were preferred, due to shorter training times, but were increased if the models did not achieve the required error.
In this investigation the computer hardware used consisted of CPU Intel Core I5-4570 with a base clock of 3.20 GHz, 8 Gb of DDR3 RAM. The time required to obtain the symbolic expression for the estimation of the number of confirmed/deceased/recovered patients for each state was approximately 4/3/5 min, respectively. If in each GP algorithm execution, the obtained symbolic expression for the estimation of the number of confirmed/deceased/recovered patients achieved a high R 2 score, then the total number of GP algorithm executions would be 150, i.e., 150 symbolic expressions. The total (ideal) time to obtain these symbolic expressions would be 600 min (10 h). However, this is not the case since for some symbolic expressions for the estimation of the number of confirmed/deceased/recovered patients there were more than 10 GP algorithm executions required to obtain symbolic expressions with high accuracy. Thus, the approximate time to obtain all 150 symbolic expressions (for each state, three symbolic expressions for estimation of confirmed, deceased and recovered patients) took approximately two working days.

2.3. Epidemiology Curve

To define the epidemiology curve for a specific area first the symbolic expressions for the estimated number of confirmed, deceased and recovered patients must be obtained first. After the aforementioned symbolic expressions are obtained, the epidemiology curve can be calculated using the following expression:
y = y c o n f i r m e d y d e c e a s e d y r e c o v e r e d ,
where y c o n f i r m e d represents the total estimated number of confirmed patients, y d e c e a s e d represents the total estimated number of deceased patients and y r e c o v e r e d represents the total estimated number of recovered patients. Thus, the estimated epidemiology trend for a specific country can be calculated as the difference among confirmed, deceased and recovered patients. In this case, the total estimated number of confirmed patients is calculated as the sum of all symbolic expressions obtained for 50 U.S. states and is written in the following form:
y c o n f i r m e d = n = 1 N y C i , i = 1 , , 50 ,
where y C i represents the estimated number of confirmed patients for each state. The total estimated number of deceased patients for the entire U.S. is obtained in the same way using the following expression:
y d e c e a s e d = n = 1 N y D i , i = 1 , , 50 ,
where y D i represents the estimated number of deceased patients for each states which is obtained using symbolic expression for the specific state. The same procedure is used to obtain total estimated number of recovered patients in the entire U.S. using the following mathematical expression:
y r e c o v e r e d = n = 1 N y R i , i = 1 , , 50 ,
where y R i represents the estimated number of recovered patients for state i.

3. Results and Discussion

In this section, the results are presented and discussed. First, the symbolic expressions for the numbers of confirmed, deceased and recovered patients are presented, followed by the symbolic expression for the estimation of the epidemiology curve for the entire U.S.

3.1. Symbolic Expression for Estimation of the Number of Confirmed Patients for Each State and the Entire U.S.

The procedure of obtaining the symbolic expression of estimation of confirmed patients for each state was performed on the dataset [13], which was split into training and testing portions in ratio 80:20. This means that 80% of the dataset was used for training or in other words for obtaining the symbolic expressions, while 20% of the dataset was to obtain the R 2 score using Equation (2). In each iteration of the GP algorithm, the GP parameters were randomly selected from the pre-specified range given in Table 3. The majority of obtained symbolic expressions used for estimation of confirmed patients for each state are too large to be presented in this paper. Instead, examples of obtained symbolic expressions are shown for the estimation of confirmed patients for Maryland and Virginia, which achieved accuracy on the testing dataset of 0.99334 and 0.9915, respectively.
y C o n f M a r y l a n d = X 0 X 2 | X 2 | min ( X 0 X 2 | X 2 | , | X 1 | ( min ( X 2 2 14547.7 , X 2 | X 1 | , | X 1 | 2 , ( X 2 2 14547.7 ) log ( X 2 | X 2 | 14547.7 ) ) + X 2 | X 2 | 14547.7 ) 1 2 )
y C o n f V i r g i n i a = X 2 max ( ( X 2 X 1 X 0 ) max ( 59018.9 , 0.000021 max ( 1 , X 0 ) X 2 min ( X 0 , X 2 ) max ( 66744.2 X 0 , X 1 X 0 , X 2 ) ) , min ( X 0 , X 1 X 0 + X 1 + X 2 ) ) ( X 2 min ( X 2 , max ( X 2 max ( X 2 , min ( 59018.9 , X 1 X 0 , X 2 , min ( X 0 , X 2 ) ) ) , min ( max ( X 1 X 0 , X 1 ( X 2 X 0 log ( X 2 ) ) max ( 0.000042 X 0 , X 2 X 1 ) ) , X 2 max ( X 2 , min ( 59018.9 , X 2 ) ) + X 0 ) ) ) ) 1 2
These two symbolic expressions of confirmed patients for Maryland and Virginia were chosen due to the simplicity of symbolic expressions. In these equations, X 0 represents latitude, X 1 represents longitude and X 2 is the days elapsed since the start of the dataset. The GP parameters used for obtaining the symbolic expressions of confirmed patients for each state is given in Table A1. The table of results is given in Appendix A (Table A1).
As shown in Table A1, the population tended towards the lower values in the range, with most of the solutions using population sizes smaller than 1000 in all cases. Most solutions used the number of generations higher than 150, up to a maximum of 194/200. For the tree depth, the values for all states models were around the middle of the available range. Crossover coefficient tended towards the lower side of the range, with some of the solutions even using the 0.9 minimum available value, the same being true for all mutation coefficients. Constant ranges are large, indicating the lack of information contained within the dataset inputs. Parsimony coefficients tend towards the higher side of the available range, indicating that the models ran into bloating issues that needed to be curbed. All R 2 scores achieved are above 0.9, the lowest being 0.94 for Vermont. The distribution of achieved results for each symbolic expression in terms of R 2 score for each state is shown in Figure 2a.
As shown in Figure 2a, the obtained symbolic expressions for estimation of confirmed patients for each state have very high accuracy. The highest accuracy (higher than 0.999) was achieved with the symbolic expressions for New York, North Dakota Oregon, South Carolina, Virginia, West Virginia and Wisconsin. To compare the accuracy of obtained symbolic expression for confirmed patients with the real data all symbolic expressions for estimation of confirmed patients for each state are summed up using Equation (4). The accuracy of summed symbolic expression for the estimation of confirmed patients for the entire U.S. is calculated using Equation (2). In Figure 2b, the summed symbolic expression for estimation of confirmed patients is compared to the real data for the entire U.S.
As indicated in Figure 2b, the accuracy of symbolic expression for the estimation of the number of confirmed patients achieved the R 2 score of 0.9992. This score is graphically validated in comparison to the real data. The increase in the number of confirmed cases in the observed period can be best described throughout political and social events that have occurred in the aforementioned period. In Figure 2b, it can be noticed that in the first 60 days since the outbreak started there is almost negligible growth in the number of confirmed patients when compared to the interval from 60 to 317 days. According to MD [42], the virus had been circulating undetected at least since January 2020, and possibly as early as November 2019. The first reported case of COVID-19 in the U.S. was reported on 21 January 2020. From 21 January to 23 February 2020, 14 cases of COVID-19 were reported in six states. However, the outbreak appeared contained through February 2020, although the CDC warned the American public for the first time on 25 February 2020 (35th day since the outbreak began) to prepare for a local outbreak [43]. In the last week of February, several large events contributed to the further spreading of COVID-19 in Louisiana, Massachusetts and Georgia [43]. On 12 March 2020, the number of confirmed cases in the U.S. exceeded 1000 [44]. According to Liptak [45], the White House on 16 March advised the general population to avoid gatherings of more than 10 people, and on 19 March the State Department [46] advised U.S. citizens to avoid all international travel. According to Khazan [47], by the middle of March, all 50 states were able to perform tests with a doctor’s approval but the number of available test kits remained limited, which means that the true number of people infected was much higher than reported. In this period, federal and state agencies began taking urgent steps to prepare for a surge of hospital patients, establishing additional places for patients in the case hospitals became overwhelmed and the manpower from the military and volunteer armies were called up to help construct the emergency facilities. Although the government in this period responded by preventing rallies and providing testing facilities and hospital capacities for a growing number of confirmed patients, the general population began to protest against government-imposed lockdowns. The first protest was in Michigan on 15 April 2020 (85th day since the outbreak began) where an estimated 3000 people took part in the protest [48]. Following the protest in Michigan, anti-lockdown protests were held in every state where protesters were in the range from 100 s to 1000 s. Additionally, there were 450 major protests (Black Live Matters) which were held in cities and towns across the U.S. due to racially charged events [49]. These protests that occurred in a period between 90 and 150 days since the outbreak began certainly had a major influence on the rapid virus spread as well as the growth in the number of confirmed patients. In the period between 100 and 250 days since the outbreak in the U.S. started, more than 150 health professionals sent a letter to the federal government in which they requested a lockdown of 6–8 weeks. They believed that this would restore the country by 1 October 2020 [50,51,52]. If the government enabled the lockdown, this could prevent rapid growth of the number of confirmed patients; however, the government’s negligence to the multiple demands of health professionals also contributed to the rapid spread of the virus. The additional contributing factor to the rapid spread of the virus is the motorcycle rally in South Dakota, which more than 400,000 people attended [53]. The massive gathering resulted in more than 300 confirmed patients from 20 states [53,54]. In the last period, from 250 to 317 days since the outbreak began, the presidential elections campaign contributed to the additional spreading of the virus. It is reported that the aftermath of presidential elections campaigning increased the number of confirmed patients by 35% [55]. At the end of the investigated period, all previously mentioned political and social events had some influence on the virus spread, which resulted in more than 14,000,000 confirmed patients by 317 days since the outbreak began.

3.2. Symbolic Expression for Estimation of the Number of Deceased Patients for Each State and the Entire U.S.

The procedure for obtaining a symbolic expression for the estimation of deceased patients for each state is the same as for the symbolic expressions obtained for the estimation of confirmed patients for each state. Most of the obtained expressions are too large to be presented in this paper. Instead, the examples of the two smallest symbolic expressions are shown for the estimation of the number of deceased patients for Hawaii and Idaho, which achieved accuracy on a testing dataset of 0.9905 and 0.9906, respectively.
y d e c e a s e d H a w a i i = max ( X 0 + 2 X 2 354.196 , log ( X 2 X 0 ) ( ( X 2 2 X 0 ) log ( X 2 X 0 ) X 0 + X 2 ) ) .
y d e c e a s e d I d a h o = 0.0000132271 X 1 X 2 2 2 X 2 X 0 X 1 X 2 min X 1 X 0 , X 2 X 1 X 0 X 0 .
Equations (9) and (10) represent two symbolic expressions for the estimation of the number of deceased patients for Hawaii and Idaho. X 0 and X 1 represent latitude and longitude of the central location of each state, respectively. X 2 represents a specific day calculated from the date at which the COVID-19 outbreak started (22 January 2020). In Table A2, the GP parameters used to obtain the best symbolic expression of deceased patients for each state are given with the achieved R 2 accuracy. Individual results are given in Table A2.
As was the case previously, the population ranges tended towards the higher possible value, but there are selected values all across the range. Many models converged with the number of generations near the lower end of available values. Most solutions used the higher end of the available values for the number of tournament entries. Many initial tree size values tended towards having six as the lower bound. Similar to the previously observed case, the values for crossover and mutation probabilities tend towards the lower possible range. The value of the maximum number of samples is shown to have been selected across the entirety of the range. The range of constant used is also large for the best-selected solutions. Parsimony coefficient values tend to be around the middle of the range for most cases. The achieved R 2 score values of obtained symbolic expression for deceased patients estimation for 50 states are in a range between 0.9404 and 0.9998. The lowest R 2 score value was achieved in the case of Washington while the largest R 2 score value was achieved in the case of Florida.
The achieved R 2 score for each state is shown in Figure 3a.
As shown in Figure 3a, all symbolic expressions for estimation of deceased patients obtained for each state are estimating the number of deceased patients with high accuracy. The symbolic expressions that achieved accuracy higher than 0.999 are for Arkansas, California, Florida, Illinois and Missouri. To compare the accuracy of obtained symbolic expressions for deceased patients with the real data, all symbolic expressions for each state are summed up using Equation (5). The accuracy of the summed symbolic expression for estimation of deceased patients for the entire U.S. is calculated using Equation (2). In Figure 3b, the summed symbolic expression for deceased patients estimation is compared to the real data for the entire U.S.
As shown in Figure 3b, the summed symbolic expression for the estimation of the number of deceased patients for the entire U.S. estimates the number of deceased patients with high accuracy when compared to the real data (data from the dataset). The achieved R 2 accuracy of symbolic expression for the estimation of the number of deceased patients for the entire U.S. is equal to 0.9997.
As shown in Figure 3b, the number of deceased patients in the first 50 days since the outbreak began is very small. The first recorded deceased patient in the U.S. was in California on 2 February on 2020 [56]. On 11 April 2020, the number of deceased patients in the U.S. became the highest in the world with the number of deceased patients reaching 20,000 [57]. The anti-lockdown and Black Lives Matter protests which occurred between April and July in various states had a huge contribution to the increase of confirmed and deceased patients. By 27 May, 100,000 patients had died from COVID-19 [58]. On 22 September, the number of deceased patients passed 200,000 [59]. The presidential election campaign also had some influence on the number of deceased patients since the number surpassed the value of 250,000. According to Woolf et al. [59], COVID-19 has become deadlier than heart disease and cancer. The mortality rates from COVID-19 poses the threat to different age groups. The investigation showed that COVID-19 had become the third leading cause of death for patients aged 45–84 years and the second leading cause of death for those aged 85 years or older. For example, between 1 October (254th day) and 3 December (317th day) the number of COVID-19 deaths tripled, from 826 to 2430 deceased per day. However, it should be mentioned that, on 21 April (91st day), at the height of the spring surge, the number of COVID-19 deaths was 2856. The record-breaking number of deceased patients, which is indicated in the last 63 days in Figure 3b, indicates that lethality may increase further as transmission increases with holiday travel and gatherings.

3.3. Symbolic Expression for Estimation of the Number of Recovered Patients for Each State and the Entire U.S.

The procedure of obtaining symbolic expressions for the estimation of the number of recovered patients for each state is the same as the procedure used to obtain the symbolic expression for the estimation of the number of the confirmed and deceased patients for each state. Most of the symbolic expressions used to estimate the number of recovered patients for a particular state are too large to be presented in this article. Instead, two simple symbolic expressions are given, which are obtained for estimation of the numbers of the recovered patients for Washington and Wyoming that achieved accuracies on the testing dataset of 0.9929 and 0.9966, respectively.
y R e c o v e r e d W a s h i n g t o n = 0.0000508903 X 1 X 2 2 log ( | X 1 | ) min ( X 2 , X 2 X 0 ) min ( X 2 3 / 2 X 0 , X 2 X 0 ) .
y R e c o v e r e d W y o m i n g = X 2 X 2 | X 1 | log ( X 2 ) | 2 X 2 X 1 | | X 2 X 1 | X 0 | X 0 | + X 2 | 2 X 2 | | X 2 X 1 | X 0 | X 0 | .
As above, in Equations (11) and (12), X 0 and X 1 are the latitude and longitude of central state locations, respectively, while X 2 is the day since the COVID-19 outbreak started (22 January 2020). In Table A3, the symbolic expression for the entire U.S. is given with GP parameters used to obtain the symbolic expression, symbolic expression, and the R 2 accuracy. In Table A3, the GP parameters used to obtain the best symbolic expression for the estimation of the number of recovered patients for each state is given with the achieved R 2 score. As in previous cases, individual results may be found in the Appendix (Table A3).
The GP parameters shown in Table A3 used to obtain the symbolic expression for estimation of recovered patients in each state were randomly selected from the pre-specified ranges shown in Table A3.
Most solution populations were close to the middle of the available range. The same is true for the selected number of generations, while the tournament size tended towards the higher end of the available range. It can be noticed that the values of the operation coefficients, as in the previously observed cases, tend towards the lower end of the range. A number of the selected maximal samples tended towards the higher end of the range. The values of parsimony coefficients were equally distributed across the range. Values of constant ranges are large as they were in both initial cases.
As shown in Table A3, all symbolic expressions obtained for estimation of recovered patients for each state achieved very high accuracy, which was measured in terms of R 2 score. The R 2 score is in a range from 0.9797 to 0.99955. The lowest accuracy value achieved was in the case of Vermont, while the highest accuracy value was achieved in the case of Minnesota. In Figure 4a, the R 2 accuracy is shown on U.S. map.
As shown in Figure 4a and Table A3, it can be noticed that all symbolic expressions achieved very high accuracies. The symbolic expressions obtained for estimation of the number of the recovered patients that achieved R 2 accuracies higher than 0.999 are those symbolic expressions obtained for Idaho, Minnesota, Ohio, Tennessee and Texas. To compare the estimation of recovered patients for the entire U.S. with the real data, the symbolic expression of the entire U.S. was obtained as the sum of 50 symbolic expressions (each symbolic expression for one state) using Equation (6). The results obtained with the symbolic expression for the entire U.S. and the real data are used in Equation (2) to obtain the R 2 value to measure the accuracy of the symbolic expression for estimation of the number of recovered patients for the entire U.S. In Figure 4b, the comparison of the estimated number of recovered patients achieved with the obtained symbolic expression for the entire U.S. with the real data is shown.
As shown in Figure 4b, the number of recovered patients in the entire U.S. is exponentially growing. In the first 100 days since the outbreak began (22 January 2020), the number of recovered patients was very small when compared to the remaining 217 days. After 200 days, the number of recovered patients reached almost 2,500,000. It is interesting to notice that, in the last 20 days, the number of recovered patients is growing much faster than ever before, since the number of recovered patients in that 20 days is almost 1,000,000. The symbolic expression for the estimation of recovered patients for the entire U.S. estimates the number of recovered patients with high accuracy when compared to the real data (data from the dataset). Another indicator of how accurate is the symbolic expression is the R 2 value of 0.9996 calculated using Equation (2).
At the beginning of the virus outbreak in the U.S., the first 80 days the number of recovered patients has a similar trend as the number of confirmed and deceased patients. As the number of confirmed patients increased due to various large gatherings such as protests and other manifestations, the number of recovered patients also increased. However, it should be mentioned that recovery from the disease was different for different age groups. Woolf et al. [59] shown that the mortality rate is higher for age groups above 35 and that, ofr the age group higher than 85, the death of virus in the U.S. is the second cause of death. Younger age groups are more likely to recover from the disease than the older ones.

Symbolic Expression for Estimation of Epidemiology Curve for the Entire U.S.

At this point, all the required components for defining symbolic expression for epidemiology curve estimation are defined. The symbolic expression for the estimation of the number of confirmed patients for the entire U.S. is defined with the summation of 50 symbolic expressions. Each symbolic expression is the expression for the estimation of the number of confirmed patients for a specific U.S. state. The same procedure is applied to obtain the symbolic expression for the estimation of the number of deceased patients for the entire U.S. With these three equations, the epidemiology curve for the entire U.S. can be calculated using Equation (3). The obtained symbolic expression for epidemiology curve estimation is used with the real data to calculate the accuracy of the aforementioned symbolic expression in terms of R 2 metric using Equation (2). The achieved R 2 value of the symbolic expression for estimation of the epidemiology curve is equal to 0.9933. In Figure 5, the estimated epidemiology curve is compared with the real data.
As shown in Figure 5, the general trend of infected (active) patients is still rising. For the first 250 days since the outbreak began (22 January 2020), the number of infected patients is slowly rising. Unfortunately, in the last 80 days, the number of infected patients is rapidly increasing, which means that the number of confirmed patients is growing much faster than the number of recovered patients. The obtained symbolic expression for estimation of the epidemiology curve follows the trend of the real data with smaller deviations. The most noticeable deviation from real data can be seen in the last 20 days where the estimated number of infected patients is smaller than those obtained from the real data.

3.4. Sensitivity Analysis

The variance-based sensitivity analysis method [60] (Sobol indices) was used to estimate the effects of each model input parameter (latitude, longitude and number of days since the outbreak began) on output parameter (number of confirmed/deceased/recovered patients for the entire U.S.). This method was used to calculate the first-order, second-order and total-effect using Python package SALib [61]. The first-order indices indicate the amount of variance in the output that can be attributed to varying each parameter individually. For the first-order indices, since there is no interaction between parameters, the sum is equal to 1. The total-effect indices represent a measure of total variance for given parameter including the interaction effects and as the result have a sum greater than 1. If the total-effect indices are substantially larger than the first-order indices, then there are likely higher-order interactions occurring. The second-order indices represent measure of variance caused by variation of two parameters. The Saltelli’s cross sampling method was utilized to perform uniform distributions for each variable which resulted in 80,000 parameters. After each estimation of confirmed/deceased/recovered cases for the entire U.S., the values obtained using sensitivity analysis are given in Table 4, Table 5 and Table 6.
As shown in Table 4, the latitude ( X 0 ) and the number of days parameter since the outbreak began ( X 2 ) parameter exhibit the first-order sensitivities but the longitude ( X 1 ) parameter has very small first-order sensitivity. The total-effect indices for all three parameters are substantially higher than those values obtained for first-order indices which indicates that higher order interactions occur. The second-order indices indicate the strongest interaction between latitude ( X 0 ) and number of days ( X 2 ) exists, which is equal to 0.384535.
Table 5 shows that, as in the case of the number of confirmed patients, the sensitivity analysis showed that the latitude ( X 0 ) and the number of days since the outbreak began exhibit the first-order sensitivities, while the first-order sensitivity for longitude ( X 1 ) is minimal. Higher-order sensitivities do not occur as the total-effect values are not substantially higher than the first-order indices.
From the performed sensitivity analysis, it can be noticed that first-order indices occur for number of days parameter ( X 2 ) while the other two are very small. However, in this case, the first-order index for longitude ( X 1 ) is much higher than latitude ( X 0 ). The total effect indices values are not substantially higher than the first-order values so there are no higher-order interactions occurring, as shown in Table 6.

3.5. Discussion

From the conducted investigation, it can be noticed that each symbolic expression obtained for each U.S. state is estimating the number of confirmed patients in each state with high accuracy. As shown in Figure 1a and Table A1, the value of achieved R 2 score is in range from 0.9406 to 0.9992. From these values, it can be concluded that all symbolic expressions are estimating the number of confirmed patients for each state with very high accuracy. To obtain the symbolic expression which was used for the estimation of the number of confirmed patients for the entire U.S., all 50 symbolic expressions for the estimation of the number of confirmed patients for each state were summed up using Equation (4). This symbolic expression achieved high accuracy of 0.9987 in the estimation of the number of confirmed patients for the entire U.S, and the comparison with the real data is shown in Figure 1b. Initially, the number of confirmed patients in the U.S. for the first 60 days since the outbreak started was negligible. However, the initial government response to inform the general public was made in mid-March of 2020, although mass gatherings in the first 60 days did occur and CDC informed the general public at the end of February of a potential local outbreak. During Days 60–250, the number of confirmed patients grew from 0 to above 7,000,000. In this period, the general public protests in almost every state for anti-lockdown measures and Black Lives Matter were the most contributing factors for increased virus spreading at that time. In the last period of Days 250–317, the number of confirmed patients in the U.S. grew extremely from 7,000,000 to 14,000,000. The most contributing factor s for virus spread at that time were presidential rallies and mass gatherings in South Dakota. The symbolic expression obtained to estimate the number of the confirmed patients for the entire U.S., which was obtained as the summation of symbolic expressions for the estimation of the number of confirmed patients for each state, follows the real data with smaller deviations in the ranges 120–170 and 300–317 days. The first interval can be divided into sub-intervals where in the first (120–150 days) the symbolic expressions are underestimated the number of confirmed patients, while in the second subinterval (150–170 days) the symbolic expression overestimated the number of confirmed patients. In the second interval (300–317 days), the symbolic expressions underestimated the number of confirmed patients.
The same procedure was adopted for the estimation of the number of deceased and recovered patients. First, the symbolic expressions for the estimation of the number of deceased and recovered patients for each state were obtained, and then all 50 equations were summed up to obtain symbolic expressions for estimation the number of deceased or recovered patients. As shown in Figure 2a and Table A2, the value of achieved R 2 score with symbolic expressions for the estimation of the number of deceased patients for each state range from 0.9404 to 0.99984. The obtained symbolic expressions for the estimation of the number of deceased patients for each state were used in Equation (5) to obtain the symbolic expression for the estimation of the number of deceased patients for the entire U.S., which achieved an accuracy of 0.9997. As shown in Figure 2b the curve of deceased patients has a similar trend as the number of confirmed patients curve. In the first 60 days since the outbreak, the number of deceased patients is extremely small. In the interval from 60 to 150 days since the outbreak began, the number of deceased patients rapidly increase to almost 125,000. This increase in the number of deceased patients can be attributed to the protests that occurred in this period, which contributed to the virus spreading as well as the high mortality rate in the age groups higher than 35 years, according to Woolf et al. [59]. After 150 days, the number of deceased patients has an almost linear trend and the number of deceased patients after 300 days since the outbreak began is near 250,000. In the interval between 300 and 317 days since the outbreak began, the number of deceased patients has again rapid growth. As in the case of the number of confirmed patients, the presidential election campaign did have some influence on spreading the virus among those who attended these gatherings. The symbolic expression obtained for the estimation of the number of deceased patients follows the trend of the real data with small oscillations in the interval between 60 and 100 days. In the interval between 300 and 317 days since the outbreak began, the symbolic expression underestimated the number of deceased patients by 3%. From this comparison and the achieved R 2 score in estimation of the number of deceased patients for the entire U.S., it can be concluded that the symbolic expression estimates the number of deceased patients with high accuracy.
The R 2 score achieved with symbolic expressions for the estimation of the number of recovered patients for each state, as shown in Figure 3a and Table A3, was in range from 0.9797 to 0.99955. The symbolic expressions for the estimation of the number of recovered patients for each state were summed up using Equation (6) to obtain the symbolic expression for the estimation of the number of recovered patients for the entire U.S., and this symbolic expression achieved an accuracy of 0.9996. As shown in Figure 3b, it can be noticed that, for the first 60 days since the outbreak began, there were very few recovered patients. After 60 days, the number of recovered patients almost exponentially increased, reaching 810,000 recovered patients after 317 days. The symbolic expression for the estimation of the number of recovered patients very accurately follows the number of recovered patients when compared to the real data. The only deviation from the real data in the case of the number of confirmed and deceased patients can be noticed at the end where the symbolic expression slightly underestimated the number of recovered patients. The number of recovered patients, in general, is high when compared to the number of deceased patients. However, this number could be higher if the government at the time the outbreak in the U.S. started enforced rigorous restrictions which were already advised by CDC and healthcare professions several times in the investigated period.
The symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients are obtained were all used to obtain the symbolic expression for the estimation of epidemiology curve. From the obtained solution, which is graphically represented in Figure 5, after 30 days since the outbreak began, there is a small increase in the number of infected (actual) patients. From 30 to 130 days since the outbreak began, there is an increase in the number of infected patients to almost 110,000. In the range from 150 to 250 days since the outbreak began, there is an increase in the number of infected patients to approximately 250,000. After 250 days, the number of infected patients had a huge increase, which means that the number of confirmed patients rapidly increased with a smaller increase in the number of deceased and recovered patients. The symbolic expression for the estimation of the number of infected patients, when compared to the real data, follows the trend of infected patients with smaller oscillations in the range from 100 to 150 days, where symbolic expression slightly underestimated the number of infected patients. In the range from 150 to 180 days, the symbolic expression overestimated the number of infected patients. In the range from 180 to 300 days, there are some small oscillations in the estimation of the number of infected patients. The most noticeable difference in the estimation of the number of infected patients is noticed in the range from 300 to 317 days since the outbreak began, where symbolic expression underestimated the number of infected patients. This underestimation of the number of infected patients arises from an underestimation of symbolic expressions obtained for the estimation of the number of confirmed, deceased and recovered patients.
Based on the conducted investigation, it can be noticed that the latitude and longitude of the central location, and the number of days since the outbreak began can be used as the input variables to obtain symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients for each state. However, it should be mentioned that lower accuracies achieved with symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients will have some influence when they are summed up to estimate the number of confirmed, deceased and recovered patients for the entire U.S. and finally the epidemiology curve. The underestimations made by symbolic expressions for confirmed, deceased and recovered patients in the interval from 300 to 317 days are noticed when the estimated epidemiology curve is compared with the epidemiology curve obtained from the real data.
The failure of the general public and the government officials to take necessary steps to prevent viral transmission made the entire U.S. vulnerable. The lack of necessary steps allowed COVID-19 to become one of the leading causes of death for those aged 35 years or older. The development and implementation of the vaccine offer some prospects, but this solution will not come soon enough to avoid an increased number of COVID-19-related hospitalizations and death.
From the performed sensitivity analysis on symbolic expression for the estimation of the number of confirmed cases for the entire U.S., it can be concluded that latitude ( X 0 ) and the number of days since the outbreak began ( X 2 ) have high variances, contributing to the model variance, while the longitude ( X 1 ) accounts for only 0.047% of the model variation. The second-order variation confirms that the latitude and the number of days since the outbreak began together have high contribution to the model variance. The sensitivity analysis performed on symbolic expressions for the estimation of the number of deceased and recovered cases for the entire U.S. showed that the latitude and the number of days since the outbreak began exhibit first-order sensitivities while the longitude first order sensitivity is much smaller when compared to the other two parameters. The higher-order sensitivities do not occur due to the fact the total-effect values are not substantially higher than the first-order indices.

4. Conclusions

In this study, the GP algorithm was utilized on a publicly available dataset to obtain symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients for each state. The symbolic expressions for the estimation of the numbers of confirmed deceased and recovered patients were summed to obtain symbolic expressions for the estimation of the number of confirmed, recovered and deceased patients for the entire U.S. The equation for the estimation of the number of confirmed, deceased and recovered patients for the entire U.S. was used to obtain the equation for the estimation of the epidemiology curve, which estimates the real epidemiology curve with high accuracy. From the extensively conducted investigations, the following conclusions can be drawn:
  • The GP algorithm can be utilized to obtain symbolic expressions for each U.S. state based on the latitude and longitude of their central location and day as an input variable to estimate the number of confirmed/deceased/recovered patients for the aforementioned state.
  • The obtained symbolic expressions for the estimation of the number of confirmed/deceased/recovered patients for each state can be summed to obtain the symbolic expression for the estimation of the number of confirmed/deceased/recovered patients for the entire U.S. with high accuracy.
  • Symbolic expressions for the estimation of the number of confirmed, deceased and recovered patients of the entire U.S. can be combined to obtain the symbolic expression for the estimation of the epidemiology curve with very high accuracy.

Author Contributions

Conceptualization, N.A., I.L., S.B.Š., T.Š., A.P., T.Ć., N.F. and Z.C.; methodology, N.A., I.L. and S.B.Š.; software, N.A.; validation, N.A. and I.L., T.Š., Z.J., A.B. and Z.C.; formal analysis, A.P., T.Ć., N.F. and Z.C.;investigation, I.L., S.B.Š., A.B. and T.Š.; resources, T.Ć. and Z.C.; data curation, T.Š. and N.F.; writing—original draft preparation, N.A., I.L., A.B. and S.B.Š.; writing—review and editing, N.A., T.Š., A.P., A.B., Z.J., T.Ć., N.F. and Z.C.; visualization, Z.J. and N.A.; supervision, N.F. and Z.C.; project administration, N.F. and Z.C.; and funding acquisition, T.Ć., N.F. and Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received was partially funded by Central European Initiative, under the grant “Use of Regressive Artificial Intelligence (AI) and Machine Learning (ML) Methods in Modelling of COVID-19 Spread“ (305.6019-20).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The obtained hyperparameters and the transformed dataset used in the research is made available at https://github.com/RitehAIandRobot/CovidUSA-RegressionData.

Acknowledgments

This research was (partly) supported by the CEEPUS network CIII-HR-0108, European Regional Development Fund under the grant KK.01.1.1.01.0009 (DATACROSS), project CEKOM under the grant KK.01.2.2.03.0004, CEI project “COVIDAi” (305.6019-20) and University of Rijeka scientific grant uniri-tehnic-18-275-1447.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Tables of Results and Hyperparameters for Each State

Appendix A.1. Table of Results and Hyperparameters for Confirmed Patients

Table A1. GP parameters used to obtain symbolic expressions of confirmed patients for each state with achieved R 2 score.
Table A1. GP parameters used to obtain symbolic expressions of confirmed patients for each state with achieved R 2 score.
Federal StateGP Paramters (Population size, Number of Generations, Tournament Selection, Tree Depth,
Crossover Coefficient, Subtree Mutation Coefficient, Point Mutation Coefficient,
Hoist Mutation Coefficient, Maximum Samples, Constant Range, Parsimony Coefficeint)
R 2 Score
Alabama21117449(5, 11)0.920.00860.03770.03480.740.97(−70,101.14, 63,475.94)0.3370.9946
Alaska77116077(5, 11)0.90.00810.03140.0050.250.99(−74,741.62, 45,222.72)0.9130.9977
Arizona88314770(4, 12)0.910.02090.04010.02830.230.96(−81,270.3, 72,219.87)0.9280.9988
Arkansas96011928(6, 9)0.910.0550.00230.01110.360.97(−10,062.42, 73,692.85)1.6550.9973
California46116537(4, 12)0.910.0110.00240.06760.280.92(−37,240.04, 78,697.54)0.5070.9974
Colorado46010745(5, 12)0.930.00190.0170.04880.130.97(−56,475.89, 14,750.88)0.9130.9951
Connecticut90119549(5, 8)0.90.00050.01240.06480.240.9(−28,905.51, 10,867.32)1.1290.9972
Delaware51115685(4, 12)0.90.03190.00290.00340.590.98(−92,946.76, 90,765.21)1.1530.9982
Florida93015593(5, 11)0.910.0140.03650.04120.170.97(−74,763.22, 17,556.85)1.6310.9975
Georgia71018044(4, 10)0.910.02430.02490.0190.790.93(−11,923.53, 43,023.78)0.7810.9990
Hawaii63718839(4, 9)0.910.06220.01770.00980.720.99(−84,943.77, 58,918.57)0.4060.9978
Idaho82218541(5, 12)0.90.00430.03870.04140.960.91(−50,998.8, 83,710.65)1.4410.9939
Illinois82017453(6, 8)0.910.01010.00010.00250.40.97(−68,319.97, 77,849.63)0.3340.9949
Indiana96619422(4, 11)0.910.03030.04150.01270.080.94(−73,395.89, 37,738.69)1.5120.9966
Iowa51611632(6, 8)0.910.00080.04770.00370.390.91(−28,074.55, 95,030.32)0.3320.9961
Kansas27717444(6, 9)0.90.02930.01090.04110.670.91(−82,436.08, 59,439.65)1.6810.9979
Kentucky86711197(3, 12)0.930.00460.03310.02930.950.94(−30,555.51, 41,645.64)1.2020.9979
Louisiana28417491(3, 9)0.960.02190.01080.00620.870.99(−95,819.22, 60,443.52)1.9530.9924
Maine80413386(4, 9)0.90.0210.01020.00250.020.99(−37,413.71, 42,394.4)1.4170.9931
Maryland50714525(3, 12)0.950.02980.00230.01790.380.98(−15,303.86, 17,295.29)0.5290.9933
Massachusetts32218398(5, 10)0.90.00580.05680.01280.840.97(−31,625.58, 98,448.88)1.4950.9948
Michigan82710050(6, 10)0.930.01670.00230.04510.780.97(−52,357.23, 60,834.5)1.0810.9905
Minnesota68113520(4, 7)0.920.01680.02060.03330.350.92(−94,313.64, 64,670.51)1.4110.9957
Mississippi32419985(6, 7)0.90.00260.06040.01220.890.91(−47,173.58, 27,677.71)1.8350.9946
Missouri294110100(3, 9)0.920.00470.06190.01560.610.99(−48,330.54, 74,375.8)0.840.9959
Montana55419185(3, 7)0.910.01140.01010.06770.690.91(−82,122.76, 43,188.31)0.9020.9950
Nebraska94811848(3, 10)0.90.02060.0330.04030.80.91(−74,393.69, 72,222.09)1.6120.9966
Nevada73610250(3, 8)0.910.03730.00290.04910.270.91(−44,470.01, 81,992.42)1.0860.9938
New Hampshire98717245(6, 7)0.910.00140.0510.02010.510.97(−86,428.52, 85,172.59)0.5380.9950
New Jersey91312051(6, 7)0.920.04630.00860.01470.490.98(−98,447.98, 75,736.98)0.6650.9976
New Mexico54918571(6, 8)0.920.0380.0040.00860.931(−37,868.32, 70,577.25)0.5250.9987
New York72018079(3, 12)0.90.0050.02710.040710.97(−37,026.3, 56,065.38)1.4310.9992
North Carolina90516661(3, 9)0.930.04910.00070.00860.850.97(−97,323.84, 12,889.53)0.5970.9977
North Dakota35216837(6, 7)0.90.03790.03160.02490.640.94(−71,924.48, 96,395.39)0.3220.9991
Ohio82710164(4, 12)0.920.00290.02780.03080.90.96(−71,388.35, 16,610.45)1.5570.9986
Oklahoma97512265(6, 9)0.920.00110.01140.04690.520.9(−92,019.71, 79,120.08)1.1440.9980
Oregon38216996(4, 9)0.90.0210.06610.00890.720.91(−17,084.62, 57,894.87)1.0550.9992
Pennsylvania81213028(6, 7)0.910.00090.02760.03240.250.95(−15,561.47, 37,619.51)1.3390.9939
Rhode Island80011879(5, 8)0.910.01110.04060.0090.690.93(−82,805.13, 96,728.37)1.0290.9969
South Carolina44411448(6, 9)0.920.00410.03090.00030.340.95(−43,420.85, 57,720.71)0.6180.9990
South Dakota35314351(5, 10)0.930.01930.01150.03450.560.91(−51,233.69, 53,021.87)0.390.9958
Tennessee74611823(4, 11)0.920.0490.01340.00390.660.96(−89,010.11, 63,768.4)1.4380.9963
Texas93312427(5, 10)0.940.01870.0240.00880.040.91(−94,093.37, 12,518.37)1.1080.9960
Utah27314256(6, 8)0.910.00470.00920.03380.80.96(−74,941.61, 80,716.41)0.8520.9975
Vermont44013651(5, 9)0.970.00240.010.01260.590.9(−17,580.55, 22,768.17)1.7750.9406
Virginia52913476(6, 7)0.920.040.03030.00940.610.97(−52,056.14, 24,414.66)1.0220.9992
Washington41411591(4, 9)0.910.0110.00340.02160.580.91(−87,760.78, 65,016.13)1.620.9934
West Virginia36619423(6, 8)0.930.03610.00280.0080.540.9(−37,953.52, 50,863.41)0.8690.9992
Wisconsin30912832(4, 8)0.920.0470.00720.01180.640.95(−45,415.23, 20,678.45)1.8920.9992
Wyoming56516745(4, 11)0.90.03170.00210.02460.170.92(−28,663.07, 33,417.43)0.690.9912

Appendix A.2. Table of Results and Hyperparameters for Deceased Patients

Table A2. GP parameters used to obtain symbolic expressions of deceased patients for each state with achieved R 2 score.
Table A2. GP parameters used to obtain symbolic expressions of deceased patients for each state with achieved R 2 score.
StateGP Paramters (Population size, Number of Generations, Tournament Selection, Tree Depth,
Crossover Coefficient, Subtree Mutation Coefficient, Point Mutation Coefficient,
Hoist Mutation Coefficient, Maximum Samples, Constant Range, Parsimony Coefficeint)
R 2 Score
Alabama1560100197(4, 11)0.930.04580.01190.00430.820.96(−59,911.8, 74,719.3)0.1280.9951
Alaska1807193109(3, 7)0.910.05090.02520.00430.230.92(−53,669.79, 47,613.97)0.010.9911
Arizona1485101174(5, 8)0.910.03850.00910.01260.680.99(−44,062.11, 11,466.03)0.1510.9963
Arkansas1630102196(4, 11)0.910.00270.02410.00110.080.91(−81,559.68, 83,104.24)0.0530.9992
California1924117172(3, 8)0.920.01790.03420.029210.92(−96,904.52, 79,877.88)0.1620.9992
Colorado1178179175(5, 12)0.910.0270.02370.01840.20.93(−83,617.77, 25,231.08)0.0390.9968
Connecticut1858103133(6, 8)0.960.00250.01490.00340.510.96(−46,501.8, 88,545.84)0.0250.9976
Delaware1128157151(6, 11)0.90.01850.04930.00050.30.98(−33,858.02, 50,208.99)0.1970.9907
Florida1969129153(6, 8)0.910.0030.02850.05240.940.95(−55,101.46, 69,724.53)0.0680.9998
Georgia1170106139(3, 12)0.90.00050.01010.00170.340.98(−13,935.76, 26,427.37)0.0490.9977
Hawaii1296170138(5, 12)0.90.01290.02220.00850.130.98(−28,616.39, 56,164.19)0.0350.9906
Idaho1083130106(3, 7)0.930.02990.00280.00150.860.95(−67,748.18, 45,591.95)0.1380.9906
Illinois1700123198(3, 11)0.920.030.01730.02750.811(−49,294.38, 21,593.25)0.1610.9990
Indiana1051134102(3, 9)0.910.02130.04140.01940.560.97(−19,518.96, 33,483.42)0.1020.9940
Iowa1667111193(3, 7)0.940.00970.00290.01440.280.97(−98,321.14, 41,757.4)0.0880.9937
Kansas1393108183(6, 12)0.940.00190.02620.02930.70.95(−31,926.29, 67,467.88)0.0240.9985
Kentucky1829131168(3, 10)0.920.06580.00610.00790.20.95(−26,869.89, 16,036.16)0.1950.9950
Louisiana1016114130(6, 12)0.910.00240.01910.06280.80.91(−16,806.27, 71,585.89)0.1350.9941
Maine1869145148(6, 7)0.920.00780.02350.04640.040.98(−64,099.38, 46,028.69)0.0540.9862
Maryland1529172103(3, 11)0.920.03520.01280.02690.30.92(−49,603.7, 42,351.06)0.1560.9982
Massachusetts1487127102(4, 10)0.930.01840.04280.00780.920.97(−64,176.94, 12,262.06)0.1790.9989
Michigan1386129179(4, 8)0.910.00330.00510.00880.140.97(−34,562.94, 11,241.83)0.1110.9950
Minnesota1336181185(6, 9)0.910.04320.01370.030.530.91(−16,875.9, 67,235.11)0.1610.9953
Mississippi1480152188(3, 7)0.930.00530.01050.00290.980.92(−67,189.16, 11,351.11)0.0980.9977
Missouri1173138177(6, 9)0.920.02870.02550.00820.990.94(−35,034.14, 45,295.23)0.0640.9993
Montana1846117137(6, 9)0.970.00180.00580.00760.050.99(−14,649.73, 78,214.13)0.1280.9956
Nebraska1936113174(6, 9)0.910.00530.03170.03940.90.96(−89,634.26, 23,529.16)0.0980.9928
Nevada1626120172(4, 10)0.920.02590.00260.00110.710.94(−99,440.0, 84,386.68)0.0570.9966
New Hampshire1052124164(6, 7)0.90.02890.01320.0080.680.94(−74,774.46, 60,048.45)0.0760.9945
New Jersey1555105100(6, 7)0.910.01270.06240.01330.720.99(−56,699.21, 11,954.36)0.0330.9952
New Mexico1113151162(4, 12)0.910.00830.01340.0710.480.91(−99,066.51, 38,651.75)0.060.9795
New York1127175123(3, 7)0.910.01050.00010.05770.710.94(−89,053.56, 96,401.32)0.1410.9980
North Carolina1417115198(4, 11)0.930.02880.01540.01770.670.99(−39,717.96, 70,703.02)0.0580.9972
North Dakota1195141194(4, 11)0.90.03010.02970.0170.680.91(−46,157.53, 54,882.75)0.1220.9907
Ohio1701154155(3, 8)0.90.01430.06350.00540.131(−29,436.69, 66,004.0)0.1590.9976
Oklahoma1248102121(6, 11)0.940.00790.00370.03710.510.99(−19,528.3, 62,852.77)0.0530.9934
Oregon1950166134(3, 8)0.90.00680.02480.0620.610.94(−11,840.52, 71,714.33)0.0880.9960
Pennsylvania1766149125(4, 9)0.920.00650.05330.00070.390.9(−37,853.26, 79,323.8)0.160.9984
Rhode Island1313100177(4, 9)0.910.03510.00880.03030.581(−91,193.12, 47,731.69)0.1710.9904
South Carolina1951135129(6, 7)0.910.03770.01890.01170.60.91(−36,718.97, 96,189.2)0.110.9982
South Dakota1986117149(3, 7)0.920.06380.00140.00780.070.99(−18,339.16, 69,218.67)0.0750.9972
Tennessee1807135113(4, 7)0.960.00410.01950.0010.070.97(−40,942.66, 76,218.33)0.1310.9989
Texas1611169163(3, 12)0.910.01640.04590.00590.550.96(−30,718.01, 95,378.4)0.1970.9986
Utah1106123144(3, 10)0.970.00580.00170.02410.680.94(−88,621.48, 18,401.85)0.1750.9927
Vermont1482175180(6, 9)0.920.01080.02950.04020.030.92(−95,752.51, 86,013.68)0.1160.9925
Virginia1153196192(6, 9)0.90.03230.03720.02840.470.95(−81,831.35, 83,625.37)0.1060.9970
Washington1596124125(5, 10)0.90.0010.0260.06090.010.96(−78,735.68, 72,604.06)0.1440.9404
West Virginia1309165139(6, 7)0.910.05430.00560.0030.790.91(−20,826.25, 46,189.87)0.0420.9976
Wisconsin1423144108(4, 9)0.940.01760.01580.01510.30.94(−88,808.96, 72,727.52)0.0130.9957
Wyoming1763107185(4, 11)0.930.04490.00440.00480.420.93(−29,862.23, 55,353.33)0.1360.9918

Appendix A.3. Table of Results and Hyperparameters for Recovered Patients

Table A3. TGP parameters used to obtain symbolic expressions of recovered patients for each state with achieved R 2 score.
Table A3. TGP parameters used to obtain symbolic expressions of recovered patients for each state with achieved R 2 score.
StateGP Paramters (Population size, Number of Generations, Tournament Selection, Tree Dept,
Crossover Coefficient, Subtree Mutation Coefficient, Point Mutation Coefficient,
Hoist Mutation Coefficient, Maximum Samples, Constant Range, Parsimony Coefficeint)
R 2 Score
Alabama1692194101(5, 11)0.940.00840.0120.01860.330.94(−40,687.16, 62,255.12)1.1750.997328
Alaska1819173101(4, 10)0.920.00530.0120.00610.240.91(−96,425.9, 52,000.72)0.2290.992049
Arizona1987109178(4, 7)0.910.03080.0120.04910.910.93(−31,479.48, 24,361.18)1.8470.995554
Arkansas1370171187(3, 10)0.910.05570.0040.01280.780.98(−39,980.95, 93,794.62)1.4260.997685
California1596167145(5, 10)0.930.00620.02090.0270.890.96(−75,968.18, 36,861.41)0.6520.99833
Colorado1742170164(5, 7)0.910.03990.00060.04870.940.98(−30,246.91, 92,280.0)0.1190.998557
Connecticut1777184102(6, 8)0.920.00930.01310.03070.040.92(−34,426.08, 15,684.83)1.2770.998242
Delaware1963196171(6, 7)0.960.00470.01370.00540.220.92(−15,778.35, 39,877.09)0.10.99423
Florida1957101144(6, 12)0.930.00510.00510.02780.740.91(−44,281.68, 36,978.67)0.3950.997043
Georgia1892119149(4, 9)0.90.04820.01050.0150.780.94(−64,708.07, 93,322.46)0.8580.998556
Hawaii1041138177(6, 10)0.90.01330.04530.01470.950.98(−43,162.04, 20,225.95)0.3480.983491
Idaho1967117194(3, 10)0.940.00710.04840.00580.680.98(−84,158.66, 36,695.1)0.250.999529
Illinois1155114109(3, 11)0.940.01140.00120.00270.921(−79,669.27, 92,735.77)1.1120.996273
Indiana1572109184(4, 11)0.920.01210.01950.02540.110.96(−62,380.4, 28,314.4)1.6910.996928
Iowa1363144196(6, 12)0.910.01380.04890.01860.780.98(−81,536.43, 29,347.32)0.2070.998156
Kansas1750144182(5, 9)0.910.02970.03040.02130.950.93(−58,666.97, 75,144.45)1.6780.998437
Kentucky1293183154(6, 10)0.90.02230.02750.01620.830.96(−84,243.44, 82,923.58)1.40.99247
Louisiana1597138198(4, 12)0.930.02240.02740.00240.750.91(−65,660.45, 51,980.82)1.0690.993177
Maine1848110102(4, 10)0.910.05680.02390.00860.030.96(−94,938.64, 41,857.95)0.2210.997116
Maryland1983177197(4, 7)0.90.07080.00810.01740.980.91(−33,498.91, 75,399.0)1.3590.996912
Massachusetts1923123117(5, 11)0.920.00460.00130.06250.410.94(−53,547.7, 53,068.29)0.5990.991613
Michigan1957121149(4, 7)0.920.01510.00480.04270.270.99(−70,498.48, 76,208.22)1.6020.990563
Minnesota1474164161(4, 7)0.950.00850.0040.03620.660.93(−89,378.38, 12,518.38)1.270.999551
Mississippi1878157180(4, 10)0.910.00780.03430.02430.360.99(−69,080.55, 41,650.59)0.1480.998907
Missouri1524114189(6, 11)0.960.00550.00830.01780.710.91(−58,679.61, 18,090.86)1.7030.997492
Montana1737171184(6, 7)0.930.02480.0040.02530.510.94(−29,350.61, 26,050.27)1.8210.991147
Nebraska1736136151(4, 9)0.90.01460.03730.04270.950.93(−88,950.04, 33,806.79)0.7760.991462
Nevada1178146110(3, 7)0.910.04770.00260.03470.10.95(−43,965.03, 72,125.86)1.3270.992045
New Hampshire1678136159(5, 7)0.920.02140.01160.00220.890.91(−28,363.16, 48,743.6)0.6480.993274
New Jersey1193114200(6, 8)0.950.00970.00310.01520.840.91(−38,954.87, 50,274.61)0.6740.997067
New Mexico1642181199(4, 10)0.970.01030.00080.0160.430.94(−41,177.27, 26,347.72)0.3910.997407
New York1779152165(5, 12)0.90.01540.00630.05520.270.94(−82,261.02, 85,378.68)1.560.997782
North Carolina1406130143(5, 11)0.930.05130.00620.00540.690.98(−28,595.52, 66,166.6)0.2210.996477
North Dakota1573164191(5, 10)0.910.04280.02220.01640.870.96(−87,840.07, 16,782.34)0.8970.996419
Ohio1451141117(3, 12)0.910.0390.02570.00990.560.95(−13,223.56, 83,042.99)1.0630.999314
Oklahoma1331188113(6, 8)0.910.00170.02140.05030.270.96(−48,918.28, 92,436.69)0.4840.996927
Oregon1500109115(4, 9)0.90.06790.01090.00880.70.99(−18,150.22, 21,135.98)0.5390.99823
Pennsylvania1650114121(3, 9)0.930.01250.0160.03590.771(−49,036.36, 57,956.7)1.7850.99771
Rhode Island1975180196(6, 9)0.920.0250.00760.00720.940.94(−20,950.37, 87,459.91)0.4590.983645
South Carolina1239176198(3, 9)0.910.00210.050.02060.850.97(−14,444.19, 92,390.17)1.3810.99792
South Dakota1470166121(5, 7)0.920.00280.00230.06030.830.94(−88,393.46, 72,432.92)0.9290.998192
Tennessee1783131123(4, 7)0.930.0050.01710.04740.860.96(−58,354.06, 88,482.19)1.8210.99929
Texas1047133125(3, 9)0.90.05010.00330.03750.840.94(−42,111.51, 86,733.0)1.0570.999476
Utah1253181191(5, 10)0.930.02130.02080.01630.120.92(−42,893.54, 61,858.96)0.3390.995521
Vermont1029177183(3, 8)0.910.01790.01570.05720.460.97(−85,678.92, 59,965.71)0.3240.979788
Virginia1650153151(6, 12)0.940.00240.02270.02710.990.9(−23,993.54, 41,056.55)1.7360.996561
Washington1766134149(5, 10)0.910.0070.01350.02050.050.97(−47,952.8, 78,162.68)1.5490.992944
West Virginia1542175198(3, 12)0.920.03850.00840.02540.70.94(−81,647.2, 33,925.68)1.2410.99826
Wisconsin1884158124(3, 7)0.920.00780.01580.03490.20.92(−58,100.96, 16,423.25)1.5480.998998
Wyoming1778117105(4, 8)0.950.0080.00540.00560.260.92(−45,555.91, 47,028.4)1.9790.996617

References

  1. COVID-19 and vascular disease. EBioMedicine 2020, 58, 102966. [CrossRef]
  2. Apolone, G.; Montomoli, E.; Manenti, A.; Boeri, M.; Sabia, F.; Hyseni, I.; Mazzini, L.; Martinuzzi, D.; Cantone, L.; Milanese, G.; et al. Unexpected detection of SARS-CoV-2 antibodies in the prepandemic period in Italy. Tumori J. 2020, 0300891620974755. [Google Scholar] [CrossRef]
  3. Coronavirus Disease (COVID-19): How Is It Transmitted? World Health Organization. Available online: https://www.who.int/news-room/q-a-detail/coronavirus-disease-covid-19-how-is-it-transmitted (accessed on 12 December 2020).
  4. Transmission of COVID-19. European Centre for Disease Prevention and Control. 2020. Available online: https://www.ecdc.europa.eu/en/covid-19/latest-evidence/transmission (accessed on 12 December 2020).
  5. Grant, M.C.; Geoghegan, L.; Arbyn, M.; Mohammed, Z.; McGuinness, L.; Clarke, E.L.; Wade, R. The Prevalence of Symptoms in 24,410 Adults Infected by the Novel Coronavirus (SARS-CoV-2; COVID-19): A Systematic Review and Meta-Analysis of 148 Studies from 9 Countries. 2020. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3582819 (accessed on 12 December 2020).
  6. Symptoms of Coronavirus. Centers for Disease Control and Prevention. Available online: https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/symptoms.html (accessed on 12 December 2020).
  7. Lorencin, I.; Baressi Šegota, S.; Anđelić, N.; Blagojević, A.; Šušteršić, T.; Protić, A.; Arsenijević, M.; Ćabov, T.; Filipović, N.; Car, Z. Automatic Evaluation of the Lung Condition of COVID-19 Patients Using X-ray Images and Convolutional Neural Networks. J. Pers. Med. 2021, 11, 28. [Google Scholar] [CrossRef] [PubMed]
  8. Coronavirus. Available online: https://www.who.int/health-topics/coronavirus (accessed on 12 December 2020).
  9. Q & A on COVID-19: Basic Facts. European Centre for Disease Prevention and Control. 2020. Available online: https://www.ecdc.europa.eu/en/covid-19/facts/questions-answers-basic-facts (accessed on 12 December 2020).
  10. Long, C.; Xu, H.; Shen, Q.; Zhang, X.; Fan, B.; Wang, C.; Zeng, B.; Li, Z.; Li, X.; Li, H. Diagnosis of the Coronavirus disease (COVID-19): rRT-PCR or CT? Eur. J. Radiol. 2020, 126, 108961. [Google Scholar] [CrossRef] [PubMed]
  11. Zhang, J.J.; Cao, Y.Y.; Dong, X.; Wang, B.C.; Liao, M.Y.; Lin, J.; Yan, Y.Q.; Akdis, C.A.; Gao, Y.D. Distinct characteristics of COVID-19 patients with initial rRT-PCR-positive and rRT-PCR-negative results for SARS-CoV-2. Allergy 2020. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Car, Z.; Baressi Šegota, S.; Anđelić, N.; Lorencin, I.; Mrzljak, V. Modeling the Spread of COVID-19 Infection Using a Multilayer Perceptron. Comput. Math. Methods Med. 2020, 2020. [Google Scholar] [CrossRef]
  13. Dong, E.; Du, H.; Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. 2020, 20, 533–534. [Google Scholar] [CrossRef]
  14. Štifanić, D.; Musulin, J.; Miočević, A.; Baressi Šegota, S.; Šubić, R.; Car, Z. Impact of COVID-19 on Forecasting Stock Prices: An Integration of Stationary Wavelet Transform and Bidirectional Long Short-Term Memory. Complexity 2020. [Google Scholar] [CrossRef]
  15. Hu, Z.; Ge, Q.; Jin, L.; Xiong, M. Artificial intelligence forecasting of covid-19 in china. arXiv 2020, arXiv:2002.07112. [Google Scholar]
  16. Ribeiro, M.H.D.M.; da Silva, R.G.; Mariani, V.C.; dos Santos Coelho, L. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil. Chaos Solitons Fractals 2020, 135, 109853. [Google Scholar] [CrossRef]
  17. Yan, L.; Zhang, H.T.; Goncalves, J.; Xiao, Y.; Wang, M.; Guo, Y.; Sun, C.; Tang, X.; Jing, L.; Zhang, M.; et al. An interpretable mortality prediction model for COVID-19 patients. Nat. Mach. Intell. 2020, 2, 283–288. [Google Scholar] [CrossRef]
  18. Chimmula, V.K.R.; Zhang, L. Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos Solitons Fractals 2020, 135, 109864. [Google Scholar] [CrossRef] [PubMed]
  19. Chakraborty, T.; Ghosh, I. Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis. Chaos Solitons Fractals 2020, 135, 109850. [Google Scholar] [CrossRef] [PubMed]
  20. Cai, W.; Pacheco-Vega, A.; Sen, M.; Yang, K.T. Heat transfer correlations by symbolic regression. Int. J. Heat Mass Transf. 2006, 49, 4352–4359. [Google Scholar] [CrossRef]
  21. Gustafson, S.; Burke, E.K.; Krasnogor, N. On improving genetic programming for symbolic regression. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation, Scotland, UK, 2–5 September 2005; Volume 1, pp. 912–919. [Google Scholar]
  22. Keijzer, M. Scaled symbolic regression. Genet. Program. Evolvable Mach. 2004, 5, 259–269. [Google Scholar] [CrossRef]
  23. Raymond, C.; Chen, Q.; Xue, B.; Zhang, M. Adaptive weighted splines: A new representation to genetic programming for symbolic regression. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, Cancún, Mexico, 8–12 July 2020; pp. 1003–1011. [Google Scholar]
  24. Marko, K.A.; Hampo, R.J. Application of genetic programming to control of vehicle systems. In Proceedings of the Intelligent Vehicles92 Symposium, Detroit, MI, USA, 29 June–1 July 1992; pp. 191–195. [Google Scholar]
  25. Trujillo, L.; Olague, G. Using evolution to learn how to perform interest point detection. In Proceedings of the IEEE 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 1, pp. 211–214. [Google Scholar]
  26. Martin, M.C. Evolving visual sonar: Depth from monocular images. Pattern Recognit. Lett. 2006, 27, 1174–1180. [Google Scholar] [CrossRef]
  27. Hu, X.; Ding, L.; Shang, J.; Fan, H.; Novack, T.; Noskov, A.; Zipf, A. Data-driven approach to learning salience models of indoor landmarks by using genetic programming. Int. J. Digit. Earth 2020, 13, 1–28. [Google Scholar] [CrossRef]
  28. Chen, S.H.; Duffy, J.; Yeh, C.H. Equilibrium selection via adaptation: Using genetic programming to model learning in a coordination game. In Advances in Dynamic Games; Springer: Berlin, Germany, 2005; pp. 571–598. [Google Scholar]
  29. Neely, C.J.; Weller, P.A.; Ulrich, J.M. The adaptive markets hypothesis: Evidence from the foreign exchange market. J. Financ. Quant. Anal. 2009, 44, 467–488. [Google Scholar] [CrossRef] [Green Version]
  30. Agapitos, A.; Brabazon, A.; O’Neill, M. Genetic programming with memory for financial trading. In European Conference on the Applications of Evolutionary Computation; Springer: Berlin, Germany, 2016; pp. 19–34. [Google Scholar]
  31. Michell, K.; Kristjanpoller, W. Generating trading rules on U.S. Stock Market using strongly typed genetic programming. Soft Comput. 2020, 24, 3257–3274. [Google Scholar] [CrossRef]
  32. Cpalka, K.; Łapa, K.; Przybył, A. A new approach to design of control systems using genetic programming. Inf. Technol. Control. 2015, 44, 433–442. [Google Scholar] [CrossRef]
  33. Enríquez-Zárate, J.; Trujillo, L.; de Lara, S.; Castelli, M.; Emigdio, Z.; Mu noz, L.; Popovič, A. Automatic modeling of a gas turbine using genetic programming: An experimental study. Appl. Soft Comput. 2017, 50, 212–222. [Google Scholar] [CrossRef]
  34. Zhang, Y.; Hu, T.; Liang, X.; Ali, M.Z.; Shabbir, M.N.S.K. Fault detection and classification for induction motors using genetic programming. In European Conference on Genetic Programming; Springer: Berlin, Germany, 2019; pp. 178–193. [Google Scholar]
  35. Dou, T.; Lopes, Y.K.; Rockett, P.; Hathway, E.A.; Saber, E. Model predictive control of non-domestic heating using genetic programming dynamic models. Appl. Soft Comput. 2020, 97, 106695. [Google Scholar] [CrossRef]
  36. Tan, M.S.; Tan, J.W.; Chang, S.W.; Yap, H.J.; Kareem, S.A.; Zain, R.B. A genetic programming approach to oral cancer prognosis. PeerJ 2016, 4, e2482. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. Brameier, M.; Banzhaf, W. A comparison of linear genetic programming and neural networks in medical data mining. IEEE Trans. Evol. Comput. 2001, 5, 17–26. [Google Scholar] [CrossRef] [Green Version]
  38. Salgotra, R.; Gandomi, M.; Gandomi, A.H. Time Series Analysis and Forecast of the COVID-19 Pandemic in India using Genetic Programming. Chaos Solitons Fractals 2020, 135, 109945. [Google Scholar] [CrossRef]
  39. Koza, J.R.; Koza, J.R. Genetic Programming: On the Programming of Computers by Means of Natural Selection; MIT Press: Cambridge, MA, USA, 1992; Volume 1. [Google Scholar]
  40. Stephens, T. GPLearn (2015). 2019. Available online: https://gplearn.readthedocs.io/en/stable/index.html (accessed on 12 December 2020).
  41. Anđelić, N.; Šegota, S.B.; Lorencin, I.; Mrzljak, V.; Car, Z. Estimation of COVID-19 epidemic curves using genetic programming algorithm. Health Inform. J. 2021, 27, 1460458220976728. [Google Scholar] [CrossRef]
  42. Md, J.M. When Did COVID-19 Arrive and Could We Have Spotted It Earlier? 2020. Available online: https://www.medpagetoday.com/infectiousdisease/covid19/86291 (accessed on 12 December 2020).
  43. Public Health Response to the Initiation and Spread of Pandemic COVID-19 in the United States, 24 February–21 April 2020. Available online: https://www.cdc.gov/mmwr/volumes/69/wr/mm6918e2.htm (accessed on 12 December 2020).
  44. Alex Horton, M.B. Trump Announces Travel Ban from Most of Europe. 2020. Available online: https://www.washingtonpost.com/world/2020/03/11/coronavirus-live-updates/ (accessed on 12 December 2020).
  45. Liptak, K. White House Advises Public to Avoid Groups of More Than 10, Asks People to Stay Away from Bars and Restaurants. 2020. Available online: https://edition.cnn.com/2020/03/16/politics/white-house-guidelines-coronavirus/index.html (accessed on 8 January 2021).
  46. U.S. Embassy Panama City|19 March,..T.E. Global Level 4 Health Advisory—Do Not Travel. 2020. Available online: https://pa.usembassy.gov/globallevel-4-health-advisory-do-not-travel-march-19-2020/ (accessed on 8 January 2021).
  47. Khazan, O. The 4 Key Reasons the U.S. Is So Behind on Coronavirus Testing. 2020. Available online: https://www.theatlantic.com/health/archive/2020/03/whycoronavirus-testing-us-so-delayed/607954/ (accessed on 7 January 2021).
  48. Hernandez, S. This Is How a Group Linked to Betsy DeVos Is Organizing Protests to End Social Distancing, Now with Trump’s Support. 2020. Available online: https://www.buzzfeednews.com/article/salvadorhernandez/coronavirus-quarantine-protests-facebook-groups (accessed on 7 January 2021).
  49. Wu, J.; Chiwaya, N.; Smith, S. Map: Protests and Rallies for George Floyd Spread Across the Country. 2020. Available online: https://www.nbcnews.com/news/us-news/map-protests-rallies-george-floyd-spread-across-country-n1220976 (accessed on 8 January 2021).
  50. Durkee, A. Medical Experts Tell Government: ’Shut It Down Now, and Start Over. 2020. Available online: https://www.forbes.com/sites/alisondurkee/2020/07/24/medical-experts-tell-government-shut-it-down-now-and-start-over/ (accessed on 8 January 2021).
  51. Board, T.E. America Could Control the Pandemic by October. Let’s Get to It. 2020. Available online: https://www.nytimes.com/2020/08/08/opinion/testing-lockdown.html (accessed on 8 January 2021).
  52. Resetting Our Response: Changes Needed in the U.S. Approach to COVID-19. Available online: https://www.centerforhealthsecurity.org/our-work/publications/resetting-our-response-changes-needed-in-the-us-approach-to-covid-19 (accessed on 7 January 2021).
  53. Walker, M.; Healy, J. A Motorcycle Rally in a Pandemic? We Kind of Knew What Was Going to Happen.2020. Available online: https://www.nytimes.com/2020/11/06/us/sturgis-coronavirus-cases.html (accessed on 8 January 2021).
  54. COVID-19 Outbreak Associated with a 10-Day Motorcycle Rally in a Neighboring State—Minnesota, August–September 2020. 2020. Available online: https://www.cdc.gov/mmwr/volumes/69/wr/mm6947e1.htm (accessed on 7 January 2021).
  55. Mansfield, E.; Salman, J.; Pulver, D.V. Trump’s Campaign Made Stops Nationwide. Coronavirus Cases Surged in his Wake in at Least Five Places. USA Today. 2020. Available online: https://eu.usatoday.com/story/news/investigations/2020/10/22/trumps-campaign-made-stops-nationwidethen-coronavirus-cases-surged/3679534001/ (accessed on 7 January 2021).
  56. Moon, S. A Seemingly Healthy Woman’s Sudden Death Is Now the First Known US Coronavirus-Related Fatality. 2020. Available online: https://edition.cnn.com/2020/04/23/us/california-woman-first-coronavirus-death/index.html (accessed on 7 January 2021).
  57. Shumaker, L. U.S. Coronavirus Deaths Top 20,000, Highest in World Exceeding Italy: Reuters Tally. 2020. Available online: https://cn.reuters.com/article/health-coronavirus-usa-casualties/u-s-coronavirus-deaths-highest-in-world-exceeding-italy-reuters-tally-idINKCN21T0O2 (accessed on 7 January 2021).
  58. U.S. Coronavirus Death Toll Surpasses 100,000. 2020. Available online: https://www.washingtonpost.com/graphics/2020/national/100000-deaths-american-coronavirus/ (accessed on 7 January 2021).
  59. Woolf, S.H.; Chapman, D.A.; Lee, J.H. COVID-19 as the Leading Cause of Death in the United States. JAMA 2020, 325, 123–124. [Google Scholar]
  60. Sobol, I.M. Sensitivity analysis for non-linear mathematical models. Math. Model. Comput. Exp. 1993, 1, 407–414. [Google Scholar]
  61. Herman, J.; Usher, W. SALib: An open-source Python library for sensitivity analysis. J. Open Source Softw. 2017, 2, 97. [Google Scholar] [CrossRef]
Figure 1. The number of confirmed, deceased and recovered patients in the U.S. on 3 December 2020.
Figure 1. The number of confirmed, deceased and recovered patients in the U.S. on 3 December 2020.
Ijerph 18 00959 g001
Figure 2. The obtained results for deceased patients: (a) the accuracy of obtained symbolic expressions used for estimation of number of confirmed patients for each state achieved on the testing dataset in terms of R 2 score; and (b) the comparison of estimated and actual numbers of confirmed patients through time.
Figure 2. The obtained results for deceased patients: (a) the accuracy of obtained symbolic expressions used for estimation of number of confirmed patients for each state achieved on the testing dataset in terms of R 2 score; and (b) the comparison of estimated and actual numbers of confirmed patients through time.
Ijerph 18 00959 g002
Figure 3. The obtained results for deceased patients: (a) the accuracy of obtained symbolic expressions used for estimation of number of deceased patients for each state achieved on the testing dataset in terms of R 2 score; and (b) the comparison of estimated and actual numbers of deceased patients through time.
Figure 3. The obtained results for deceased patients: (a) the accuracy of obtained symbolic expressions used for estimation of number of deceased patients for each state achieved on the testing dataset in terms of R 2 score; and (b) the comparison of estimated and actual numbers of deceased patients through time.
Ijerph 18 00959 g003
Figure 4. The obtained results for deceased patients: (a) the accuracy of obtained symbolic expressions used for estimation of number of recovered patients for each state achieved on the testing dataset in terms of R 2 score; and (b) the comparison of estimated and actual numbers of recovered patients through time.
Figure 4. The obtained results for deceased patients: (a) the accuracy of obtained symbolic expressions used for estimation of number of recovered patients for each state achieved on the testing dataset in terms of R 2 score; and (b) the comparison of estimated and actual numbers of recovered patients through time.
Ijerph 18 00959 g004
Figure 5. The comparison of estimated actual patients in the U.S. with the real data from the dataset.
Figure 5. The comparison of estimated actual patients in the U.S. with the real data from the dataset.
Ijerph 18 00959 g005
Table 1. The last ten instances of the confirmed number of patients dataset for state of Alabama.
Table 1. The last ten instances of the confirmed number of patients dataset for state of Alabama.
Instance NumberLatitudeLongitudeDayNumber of Confirmed Patients
30832.318230−86.902298308236,865
30932.318230−86.902298309239,318
31032.318230−86.902298310241,957
31132.318230−86.902298311242,874
31232.318230−86.902298312244,993
31332.318230−86.902298313247,229
31432.318230−86.902298314249,524
31532.318230−86.902298315252,900
31632.318230−86.902298316256,828
31732.318230−86.902298317260,359
Table 2. Input and output parameters defined and used in GP to obtain symbolic expression for estimation of number of confirmed/deceased patients for each state and recovered patients for the entire U.S.
Table 2. Input and output parameters defined and used in GP to obtain symbolic expression for estimation of number of confirmed/deceased patients for each state and recovered patients for the entire U.S.
ParametersConfirmed Patients AnalysisDeceased Patients AnalysisRecovered Patients Analysis
Latitude X 0 X 0 X 0
Longitude X 1 X 1 X 1
Day X 2 X 2 X 2
Number of patients per day y C o n f i r m e d y D e c e a s e d y R e c o v e r e d
Table 3. The list of GP parameters used to obtain symbolic expressions for the estimation of the number of confirmed patients for each state.
Table 3. The list of GP parameters used to obtain symbolic expressions for the estimation of the number of confirmed patients for each state.
ParameterConfirmedDeceasedRecovered
Lower BoundUpper BoundLower BoundUpper BoundLower BoundUpper Bound
Population Size20010001000200010002000
Number of generations100200100200100200
Tournament Size20100100200100200
Tree Depth3–67–123–67–123–67–12
Crossover coefficient0.910.910.91
Subtree mutation coefficient0.0010.10.0010.10.0010.1
Hoist mutation coefficient0.0010.10.0010.10.0010.1
Point mutation coefficient0.0010.10.0010.10.0010.1
Stopping criteria0.00110.00110.0011
Maximum number of samples0.910.910.91
Constant range−100,000100,000−100,000100,000−100,000100,000
Parsimony coefficient0.120.010.20.12
Table 4. First-order and total-effect Sobol indices measuring model sensitivity to parameters in symbolic expression for the estimation of the number of confirmed cases for the entire U.S. ( X 0 latitude, X 1 longitude and X 2 day).
Table 4. First-order and total-effect Sobol indices measuring model sensitivity to parameters in symbolic expression for the estimation of the number of confirmed cases for the entire U.S. ( X 0 latitude, X 1 longitude and X 2 day).
VariableDistributionSobol Indices
First-OrderTotal-Effect
X 0 19.74176, 66.160510.5925861.036779
X 1 −155.844, −69.97220.000470.00417
X 2 (0, 317)0.3671891.15179916
Table 5. First-order and total-effect Sobol indices measuring model sensitivity to parameters in symbolic expression for the estimation of the number of deceased cases for the entire U.S. ( X 0 latitude, X 1 longitude and X 2 day).
Table 5. First-order and total-effect Sobol indices measuring model sensitivity to parameters in symbolic expression for the estimation of the number of deceased cases for the entire U.S. ( X 0 latitude, X 1 longitude and X 2 day).
VariableDistributionSobol Indices
First-OrderTotal-Effect
X 0 19.74176, 66.160510.102820.124253
X 1 −155.844, −69.97220.0020360.0127
X 2 (0, 317)0.876790.99746
Table 6. First-order and total-effect Sobol indices measuring model sensitivity to parameters in symbolic expression for the estimation of the number of recovered cases for the entire U.S.
Table 6. First-order and total-effect Sobol indices measuring model sensitivity to parameters in symbolic expression for the estimation of the number of recovered cases for the entire U.S.
VariableDistributionSobol Indices
First-OrderTotal-Effect
X 0 19.74176, 66.160510.139760.010748
X 1 −155.844, −69.97220.0165420.13089
X 2 (0, 317)0.8363880.986203
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Anđelić, N.; Šegota, S.B.; Lorencin, I.; Jurilj, Z.; Šušteršič, T.; Blagojević, A.; Protić, A.; Ćabov, T.; Filipović, N.; Car, Z. Estimation of COVID-19 Epidemiology Curve of the United States Using Genetic Programming Algorithm. Int. J. Environ. Res. Public Health 2021, 18, 959. https://doi.org/10.3390/ijerph18030959

AMA Style

Anđelić N, Šegota SB, Lorencin I, Jurilj Z, Šušteršič T, Blagojević A, Protić A, Ćabov T, Filipović N, Car Z. Estimation of COVID-19 Epidemiology Curve of the United States Using Genetic Programming Algorithm. International Journal of Environmental Research and Public Health. 2021; 18(3):959. https://doi.org/10.3390/ijerph18030959

Chicago/Turabian Style

Anđelić, Nikola, Sandi Baressi Šegota, Ivan Lorencin, Zdravko Jurilj, Tijana Šušteršič, Anđela Blagojević, Alen Protić, Tomislav Ćabov, Nenad Filipović, and Zlatan Car. 2021. "Estimation of COVID-19 Epidemiology Curve of the United States Using Genetic Programming Algorithm" International Journal of Environmental Research and Public Health 18, no. 3: 959. https://doi.org/10.3390/ijerph18030959

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop