1 Introduction

The novel coronavirus disease 2019 (COVID-19) poses a significant and urgent threat to global health. Since the outbreak in early December 2019 in the Hubei Province of the People’s Republic of China, the number of patients confirmed to have the disease has exceeded 775 000 in more than 160 countries, and the number of people infected is probably much more significant. Despite public health risks targeted at containing the disease and delaying the spread, many countries have been faced with a critical care catastrophe. Outbreaks lead to significant increases in the demand for hospital beds and medical gear shortage, while medical personnel themselves could also get contaminated (Wynants et al. 2020; Ohata et al. 2020).

Furthermore, epidemiological time-series prediction represents an essential role in public health, leaving the directors to improve strategic plans. Forecasting diseases as realistic as possible is essential due to their impact on the public health system. Machine learning models have been used to forecast the epidemiological time series over the years (Wynants et al. 2020).

Recognition of epidemic growth patterns across temporal and social factors can improve our capability to create epidemic transmission designs, including the critical job of predicting the estimated intensity of the outbreak morbidity or mortality impact at the end. Several studies consider the epidemic growth in a large population a stochastic event; the infection increases exponentially among subjects, each of direct contact, closeness, or ambient traces (Fanelli and Piazza 2020). Explore the rise kinetics of an epidemic can help create well-grounded algorithms to predict and learn the essential features of infectious diseases’ growth dynamics. The strength of the outbreak is represented in mathematical functions, modeling the transmission, and this is commonly estimated using time-series analysis describing the plague spread as a function of time (Viboud et al. 2016).

The study’s primary motivation is to estimate with a particular level of accuracy the number of deaths because of COVID-19, handling to model the development of the pandemic. Predicting the number of potential deaths from COVID-19 can provide authorities and decision-makers with signs for purchasing respirators and pandemic prevention policies. Therefore, this work presents itself as an essential contribution to fighting the pandemic.

Kalman Filter (KF) is a widely used method for tracking and navigation and filtering and time series (Zeng and Ghanem 2020). The problem of Monitoring Outbreak spreading is pertinent to the control of morbidity. A compartment model can clarify the transmission dynamics of an outbreak. Precisely, the estimation of epidemic spreading on networks can be accomplished by a nonlinear Kalman filter, and it is an instrument for state estimation of nonlinear systems (Wang et al. 2020).

Designing and tuning machine learning methods are a labor- and time-intensive task that requires extensive experience. The field of automated machine learning (AutoML) relies on automating this task. AutoML tools enable novice users to create useful machine learning units, while experts can use them to free up valuable time for other tasks. Many strategies have been developed to build and optimize model learning pipelines or optimize and build deep neural networks in recent years (Gijsbers et al. 2019).

Ceará is one of the 27 federative units in Brazil. It is located in the north of the Northeast Region and borders the Atlantic Ocean to the north and northeast, the Rio Grande do Norte and Paraba to the east, Pernambuco to the south, and Piau to the west. Its total area is 148,920,472 km, or 9.37% of the Northeast area and 1.74% of Brazil’s surface. The state’s population is 9,075,649 inhabitants, as indicated by the Brazilian Institute of Geography and Statistics (IBGE), in 2018, which is the eighth-most populous state in the country. Today, Ceará has more than 235,222 confirmed cases of COVID-19 and 8850 deaths due to the disease. The cities with the highest incidence of confirmed cases per 100 thousand inhabitants are Acarape (11,434.1), Frecheirinha (10,560), Groaras (6532.3), Chaval (6106.1), and Quixel (6051.4).

This paper presents an objective method of forecasting the continuation of this COVID-19, working with a straightforward but highly effective process to achieve that. Assuming that the information used is dependable and the future will continue to stick to this disease’s latest pattern, our predictions suggest a continuing growth in the supported COVID-19 instances. This paper clarifies the deadline of a live calling exercise with enormous potential consequences for planning and decision making and offers objective forecasts for its confirmed instances of COVID-19.

This study’s main novelty uses an AutoML solution to forecast the epidemic growth of the state of Ceará in Brazil. In a nutshell, the primary contributions of this paper are:

  • use a Kalman Filter solution to forecast the epidemic growth on Ceará State ;

  • use an AutoML solution to forecast the epidemic growth on Ceará State ;

  • apply a comparative study of different methods of the forecast using AutoML.

The use of Kalman Filter was applied to merge the death curve of other countries with data of the state of Ceará in Brazil in order to obtain a long-term prediction. We use Auto Machine Learning tools to discover the best models for predicting the number of cases. We could only use these tools after the pandemic, where sufficient training data for the models can be obtained. The third contribution is applying the two techniques presented in the state of Ceará, validating the accuracy and precision of the techniques.

2 Literature review

A model is described of several numerical equations that are set to describe the interaction between various variables within specific methods. A model is not a perfect portrayal of reality. Commonly, we have no perfect understanding of the boundary conditions of the model and its uncertainty. We need to recognize the time progression of the probability density function (pdf) for the model state. With knowledge of the pdf for the model state, we can obtain knowledge about the model uncertainty. For time-based solutions, sequential data assimilation methods utilize the previous data analysis scheme to update the model state consecutively. The before-mentioned approaches have demonstrated helpful for several purposes, where new observations are sequentially absorbed into the model when they become ready.

Yang et al. use the ensemble Kalman Filter as a short period predictor and test non-pharmaceutical interventions’ success on the epidemic spreading. The study builds an individual-level-based network representation and performs stochastic reproductions to study the pestilences in Hubei Province at its initial stage and examine the plague dynamics under several situations (Yang et al. 2020). Sameni uses an extended Kalman Filter for joint parameters and variables for the estimates (Fanelli and Piazza 2020).

The task of tuning hyperparameters for different machine learning models is also highly likely to be time-consuming. In a more extended Computer Science-specific period, tuning of hyperparameters is an investigation procedure which, in this case, can be hugely exhaustive.

Deep learning (DL) methods have penetrated all facets of our lives and brought us a fantastic advantage. However, building a high-quality DL platform for a particular task depends upon human experience, hindering DL software to more regions (He et al. 2021).

To decrease these onerous development expenses, a novel notion of automating the whole pipeline of machine learning (ML) has surfaced, i.e., automatic machine learning (AutoML). There are a variety of definitions of AutoML. According to (Zöller and Huber 1993), AutoML was made to decrease information scientists’ need and enable domain experts to automatically assemble ML applications without much demand for statistical and ML knowledge. In [9], AutoML is described as a blend of automation and ML.

Automated machine learning is a natural solution to the shortage of information scientists. It can drastically increase information scientists’ performance and efficacy by speeding up work cycles, improving model accuracy, and even potentially replacing the need for data scientists. Automated machine learning (AutoML) becomes a promising strategy to construct a DL system with no expert support and an increasing number of researchers (He et al. 2021). AutoML aims to enhance a new way to develop ML applications by automation. ML experts can benefit from AutoML by automating tiresome tasks like hyperparameter optimization (HPO), leading to higher efficiency (Zöller and Huber 1993).

From the early 2000s, the earliest efficient approaches for HPO are suggested, for restricted applications, e.g., tuning C and \(\gamma \) of a support vector system (SVM) (Momma and Bennett 2002). Additionally, in 2004, the first automatic feature selection methods are released (Samanta 2004). A full model selection has been the initial effort to automatically construct a whole ML pipeline by simultaneously choosing a preprocessing, feature selection, and classification algorithm while controlling every method’s hyperparameters (Escalante et al. 2009). From 2011, several different ways of applying Bayesian optimization for hyperparameter tuning (Komer et al. 2014; Snoek et al. 2012) and model selection (Thornton et al. 2013). In 2015, Kanter and Veeramachaneni presented the automatic feature engineering without domain knowledge (Kanter and Veeramachaneni 2015). Ardabili et al. use a multi-layered perceptron (MLP) and adaptive network-based fuzzy inference system (ANFIS) to forecast COVID-19 cases (Ardabili et al. 2020). Pinter et al. use hybrid machine learning methods of adaptive network-based fuzzy inference system (ANFIS) and multi-layered perceptron-imperialist competitive algorithm (MLP-ICA) to predict time series of COVID-19 infected individuals and mortality rate (Pinter et al. 2020). Erraissi et al. present a Spark ML approach to predict COVID-19 cases (Erraissi et al. 2020).

Nanda et al. use the ARIMA model and SIR Model to generate the short term forecasts of the COVID-19 spread in SAARC countries, i.e., India, Afghanistan, Sri-Lanka, Maldives, Bhutan, Pakistan, Nepal, and Bangladesh, using the daily reported number of cases from 22 January 2020 up to 01 April 2020 (Nanda 2020).

Escobar et al. develop a method that estimates the probability that a sample will test positive for SARS-Cov-2 based on the sample’s complementary information using H2O.Ai AutoML. The study trained a machine learning model on samples from more than 8,000 patients tested for SARS-Cov-2 from April to July in Bogot, Colombia (Escobar et al. 2020). Ribeiro et al. use the autoregressive integrated moving average (ARIMA), cubist regression (CUBIST), random forest (RF), ridge regression (RIDGE), support vector regression (SVR), and stacking ensemble learning are evaluated in the task of time-series forecasting with one, three, and six days ahead the COVID-19 cumulative confirmed cases in ten Brazilian states with a high daily incidence (Ribeiro et al. 2020).

3 Material and methods

3.1 AutoML

Machine learning (ML) is at the forefront of the rising popularity of data-driven software applications. The consequent rapid proliferation of ML technology, explosive data growth, and lack of data science expertise have caused the industry to face increasingly challenging demands to stay informed about fast-paced develop-and-deploy design lifecycles. Recent academic and industrial research efforts have started to deal with this issue through automated machine learning (AutoML) pipelines and have concentrated on design performance because of the first-order design aim (Yakovlev et al. 2020; Santos et al. 2018; Chouhan et al. 2020; Ding et al. 2020; Rodrigues et al. 2018; De Souza et al. 2019; Dourado et al. 2020; Muhammad et al. 2020; Selvachandran et al. 2019; Sodhro et al. 2016, 2017, 2019a, b, 2020).

The AutoML pipeline comprises several processes: data preparation, feature engineering, model generation, and model analysis. Model generation can be further divided into optimization and search methods. The search space defines the design principles of ML versions, which may be divided into two classes: the conventional ML models (e.g., SVM and KNN), and neural architectures (He et al. 2021). Researchers have handled this optimization problem using several different methods. The first approach is primarily based on Bayesian Optimization (Komer et al. 2014; Kotthoff et al. 2017), which employs a probabilistic model to catch distinct hyperparameter configurations and their performance. Auto-sklearn, one of the most notable works relying on this approach, embraced a random-forest-based sequential model-based optimization technique for overall algorithm configuration. It utilizes meta-learning to recognize a previously optimized dataset closest to the given dataset and utilizes the famous dataset’s configuration to bootstrap the iterative optimization procedure.

AutoML approaches differ in their optimization process (e.g., Bayesian Optimization or Genetic Programming) and the pipelines they Create (e.g., with or without fixed arrangement). There are lots of Python libraries offered for performing automatic machine learning. All of these try to attain more or less the same goal, that of accomplishing the machine learning procedure. The following are a few of the most widely-used Python libraries for automatic machine learning:

  • Auto-Sklearn

  • TPOT

  • Auto-Keras

  • H2O.ai

  • Google’s AutoML.

Google’s AutoML and Auto-Keras use an algorithm called Neural Architecture Search (NAS). TPOT is a Python automatic system learning that optimizes machine learning pipelines using genetic programming.

3.2 Neural architecture search problem

Deep learning has empowered outstanding progress over the past years on an assortment of tasks, including image recognition, speech recognition, and machine translation. Architectures have been mainly developed manually by human experts, which can be a time consuming and error-prone procedure. As a result of this, there is growing interest in automatic neural search procedures (Wistuba et al. 2019).

Neural Architecture Search (NAS) is the process of automating architecture engineering, is consequently a logical next step in automating machine learning. Neural Architecture Search algorithm tries automatically to search the most optimal architecture and corresponding parameters for a problem. Already, NAS methods have outperformed manually designed architectures on some tasks like image classification, object detection, or semantic segmentation (Bender et al. 2018). NAS is a subfield of AutoML and has significant overlap with hyperparameter optimization and meta-learning. We categorize NAS’s approaches based on three dimensions: research space, research technique, and performance estimation strategy (Wistuba et al. 2019).

Given a neural architecture search space S, the input data D divided into \(D_{train}\) and \(D_{val}\), and the cost function C, the algorithm aim at finding an optimal neural network \(f\in F\), which could achieve the lowest cost on the dataset.

$$\begin{aligned}&f^{*} =argminCost\left( f\left( \theta ^{*}\right) ,D_{val}\right) ,\ f\ \in \ F \end{aligned}$$
(1)
$$\begin{aligned}&\theta ^{*} =argminL\left( f\left( \theta \right) ,D_{train}\right) ,\ \theta \end{aligned}$$
(2)

Cost is the metric evaluation function, e.g., accuracy, mean squared error, and \(\theta *\) is the learned parameter off. The search space F covers all the neural architectures, which can be morphed from the initial architectures.

3.3 Bayesian optimization

Bayesian optimization gives a principled technique based on the Bayes Theorem to guide a search of a global optimization problem that’s efficient and effective. It operates by making a probabilistic model of the objective function, named the surrogate function, which is then searched efficiently with an acquisition function before solution samples are chosen to evaluate the real objective function.

Bayesian optimization is an iterative way of solving these black-box optimization issues. Conventional Bayesian Optimization consists of a loop of three steps: update, generation, and observation.

  • update: train the underlying Gaussian process model with the present architectures and their performance;

  • Generation: generate the following design to observe by optimizing a delicately defined recovery function;

  • observation: receive the actual performance by training the generated neural architecture.

    figure a

3.4 H2O.ai

H2O is quick, scalable, open-source machine learning and deep learning for smarter software. The API allows making advanced calculations like deep learning, fostering, and bagging ensembles using AutoML (Candel et al. 2016). The tool provides H2O AutoML, a learning algorithm that piled ensembles within a purpose and overlooks finding candidate models. The consequence of the AutoML series is a rated list of best models for a dataset. Models in the leader board could be rated by design performance metrics or version features like typical forecast rate or coaching time. H2O AutoML uses the combination of random grid search with stacked ensembles, as diversified models improve the ensemble method’s accuracy Ledell2020. H2O AutoML maintains a variety of calculations (e.g., GBMs, Random Forests, Deep Neural Networks, GLMs), yielding a healthy amount of diversity across candidate versions, which can be exploited by stacked ensembles to generate a powerful final version. The technique’s effectiveness is reflected from the OpenML AutoML Benchmark, which compares the performance of a number of the most well-known, open-source AutoML systems across several datasets (Ledell 2020).

3.5 TPOT

The Tree-Based Pipeline Optimization Tool (TPOT) was among the earliest AutoML procedures and open-source computer software packages created for the information science community. TPOT was developed by Dr. Randal Olson as a postdoctoral student with Dr. Jason H. Moore in the Computational Genetics Laboratory at the University of Pennsylvania and is extended and encouraged by this team. The objective of TPOT would be to automate the construction of ML pipelines by mixing a flexible expression representation of pipelines with stochastic search algorithms like genetic programming (Olson and Moore 2019).

To automatically create and maximize these tree-based pipelines, TPOT utilizes a Genetic Programming (GP) algorithm. The TPOT GP algorithm follows a typical GP procedure: the GP algorithm generates 100 random tree-based pipelines and assesses their balanced cross-validation accuracy about the information collection. For every creation of the GP algorithm, the algorithm chooses the best 20 pipelines from the population in line with this NSGA-II selection strategy, in which pipelines are chosen to simultaneously optimize classification accuracy on the information collected while decreasing the number of operators in the pipeline. Every one of the top 20 chosen pipelines creates five duplicates (i.e., offspring) in the second generation’s population, 5 percent of these offspring cross with a different offspring utilizing one-point crossover, and then 90 percent of those remaining new offspring are randomly altered using a stage, fit, or mutation (1/3 possibility of each). Every creation, the algorithm updates a Pareto front of their non-dominated options found at any location in the GP run.

3.6 Auto-WEKA

Thornton built a tool, Auto-WEKA, to solve the problem for classification algorithms and feature selectors/evaluators implemented in the WEKA package. WEKA is a broadly used, open-source machine learning platform. As a result of the intuitive interface, it is very popular with novice users. Such users frequently find it tough to recognize the best approach to their specific dataset, one of the many available. Auto-WEKA considers the difficulty of concurrently choosing a learning algorithm and setting its hyperparameters, going away to previous methods that address these issues in isolation. Auto-WEKA does this using a fully automated approach using Bayesian optimization (Thornton et al. 2013).

Precisely, it reflects the merged space of WEKAs learning algorithms \(A=\left\{ A^{( 1)} ,....,\ A^{( k)}\right\} \) and their hyperparameter scopes \(\upsilon ^{( 1)} ,...,\upsilon ^{( n)}\) and intends to recognize the combination of algorithm \(A^{( j)} \in A\) and hyperparameter \(\upsilon ^{( j)} \ \in \ \upsilon \) that minimizes the cost function.

$$\begin{aligned} A^{*}_{\lambda *} \ \in \ argmin_{A^{( j)} \ \in \ A,\upsilon \ \in \ \upsilon ^{( j)} \ }\frac{1}{k}\sum ^{k}_{i=1} \lambda \left( A^{( j)}_{\lambda } ,D^{( i)}_{train} ,D^{( i)}_{test}\right) , \end{aligned}$$

where \(\lambda \left( A^{( j)}_{\lambda } ,D^{( i)}_{train} ,D^{( i)}_{test}\right) \) represents the loss function when trained on \(D^{( i)}_{train}\) and tested on \(D^{( i)}_{test}\).

3.7 Auto-Keras

Auto-Keras is an open-source software library for automated machine learning. Auto-Keras provides functions to search for architecture and hyperparameters of deep learning models automatically (Jin et al. 2019). The key idea of AutoKeras is to investigate the search space via morphing the neural architectures guided by the Bayesian optimization (BO) algorithm. The intuition behind the Auto-Keras function’s kernel function is the edit distance to morph one neural structure to another. Suppose \(f_a\) and \(f_b\) are just two neural networks. Inspired by Deep Graph Kernels. Auto-Keras suggest an edit-distance kernel for neural networks. Edit-distance here means how many operations are needed to morph one neural network to another. The concrete kernel function is defined as:

$$\begin{aligned} k( f_{a} ,f_{b}) =e^{-p^{2\ } d( f_{a} ,f_{b})} , \end{aligned}$$
(3)

where function d denotes the distance of two neural networks, a typical workflow for the Auto-Keras process is as follows: The User-initiated a study for the best neural design for the dataset and the Bayesian Optimizer in the Searcher would create a new architecture using CPU. It calls the Graph module to build the neural structure into a real neural network at the RAM. The new neural architecture is copied from the GPU to Model Trainer to train with the dataset (Jin et al. 2019).

3.8 Kalman Filter

The Kalman Filter is a method that utilizes a set of measures observed over a period, including noise and gives estimations according to the used set, by considering a joint probability distribution across the variables for each time frame. The Kalman Filter (KF), also named as linear quadratic estimation, is an optimal estimator which suggests parameters of interest from indirect, inexact, and dubious observations.

The KF aims to find the ‘most reliable estimate’ from noisy input. It is recursive; KF treats the new measures as they appear. The filter presents a recursive resolution to the linear optimal filtering problem to stationary and nonstationary situations. It is also recursive and measures the new state from the previous estimates and the new data. Unique the previous estimate needs storage, reducing the need for saving the whole past noted data (Haykin 2004). Filtering methods allow the recursive evaluation of model parameters. These techniques have found application in various disciplines and, across the last two decades, have been used to contagious infection epidemiology (Yang et al. 2014).

The KF dynamics rise from the regular periods of forecast and filtering. The change aspects of these periods are determined and translated in Gaussian probability density functions. Following new constraints on the system changes, the Kalman Filter dynamics converge to a steady-state filter, and the steady-state gain is inferred. The learning method connected with the filter, which describes the new data conveyed to the state measure by the latter system measure, is presented.

The Kalman Filter gives a linear minimum error variance estimate of the state characterized by a state-space model. The KF has the support of leading with noise in the couple, model, and the data. The main goal of the KF is to diminish the mean squared error within the real and measured data. Consequently, it gives the accurate as a possible measure of the mean squared error function data. Thought from this fact, it should be plausible to determine that the KF has much in common with the chi-square. The chi-square merit function is typically applied to fit a collection of model variables to a method named least squares fitting. The KF is usually named as recursive least squares (RLS) (Cazelles and Chau 1997).

3.9 State-space derivation

The differential equations of the KF can be incorporated into a state-space component. Let \(Y_{t},Y_{t-1},...,Y_1\) denote the observed values of a feature in time t,t-1,...,1. We assume that Y depends on an unobservable quantity \(\theta \), known as system state variables. The goal of Kalman Filter is make inferences of \(\theta \). The relation between \(Y_{t}\) and \(\theta \) is given by a equation (Cazelles and Chau 1997; Meinhold and Singpurwalla1983):

$$\begin{aligned} Y_{t}=F_{t}\theta _{t}+v_{t} \end{aligned}$$
(4)

where \(F_{t}\) is a known quantity. \(F_{t}\) is the noiseless connection between the t state vector and the measurement vector, and is assumed stationary over time. The observation error \(v_{t}\) is the associated with measurement error (Uhlmann and Julier 1997; Meinhold and Singpurwalla1983; Mandel et al. 2010). The main difference between KF and conventional linear models is that KF regression coefficients are not constant ant change over time as the system equation:

$$\begin{aligned} \theta _{t}=G_{t}\theta _{t-1}+w_{t} \end{aligned}$$
(5)

where \(\theta \) is the state vector at time t; \(G_{t}\) is the state transition matrix of the progress from the position at t-1 to the state at t , and is presumed stationary over time; \(w_{t}\) is the associated white noise with recognize covariance; \(v_{t}\) and the system equation error w t is presumed to be mutually independent random variables, spectrally white, and with normal probability distributions. \(w_{t}\) and \(v_{t}\) are sequences of white, Gaussian noise with zero mean:

$$\begin{aligned} E[w_{t}]=E[v_{t}]=0, \end{aligned}$$
(6)

The Kalman Filter is the filter that gets the least mean-square state error estimation. When \(Y_{0}\) is a Gaussian vector, the state and perceptions noises \(w_{t}\) and \(v_{t}\) are white and Gaussian, and the state and observation dynamics are linear. For the minimization of the MSE to support the optimal filter, it must be plausible to evaluate model errors using Gaussian distributions. The covariances of the noise models are considered stationary in period and are given by;

$$\begin{aligned} Q= E[w_{t}w^T_{t}] \end{aligned}$$
(7)
$$\begin{aligned} R= E[v_{t}v^T_{t}] \end{aligned}$$
(8)

The mean squared error is given by:

$$\begin{aligned} P_{k} = E[e_{t}e^T_{t}] = E[(Y_{t}-{\hat{Y}}_{t})(Y_{t}-{\hat{Y}}_{t})^T] \end{aligned}$$
(9)

where P is the error covariance matrix at time t. Consider the previous estimation of \({\hat{Y}}\) is named \({\hat{Y}}^{\prime }\) and was obtained by observation of the system. It is welcome to estimate using an update equation, mixing the old estimation with new measurement data.

3.10 Epidemiologic predictors

When it comes to contagious diseases, it is frequent to use compartmental models, such as the SIR and SEIR models. Differential equations models SIR and SEIR, seeking the variations in the model parameters to project the spreading behavior of a given disease, are applied to the new coronavirus, where many works use these models (Zhou et al. 2020; Fanelli and Piazza 2020).

3.10.1 SIR model

Martinie developed, in 1921, the SusceptibleInfectiousRemoved (SIR) model for plagues, which are spread in a human community by a vector; i.e., susceptive individuals acquire the infection from contagious vectors, and susceptive vectors acquire the disease from contagious people (Beretta and Takeuchi 1995; Zhu 2020; Schenzle 1984). The SIR model, in principle, explains the process of a virus spread. On the other hand, this factor is not ever consonant with the contagious path. Some viruses do not confer any long-lasting immunization (Zhu 2020).

The SIR model is among the most fundamental compartmental representations, and several models are extended of this basic one, including the SEIR case. The SEIR model defines three partitions: S for the amount of susceptible, I for the number of infectious, and R for the number of recuperated or death (or immune) people (Stone et al. 2000).

The equations that describe the SIR model are described in Eqs. 1011, and 12. All related to a unit of time, usually in days. Then, at each instant of time t, the values of each compartment can be changed (Beretta and Takeuchi 1995; Stone et al. 2000).

$$\begin{aligned} \frac{\mathrm{d}S}{\mathrm{d}t}= - \frac{\beta IS}{N}, \end{aligned}$$
(10)
$$\begin{aligned} \frac{\mathrm{d}I}{\mathrm{d}t}= \frac{\beta IS}{N} - \gamma I, \end{aligned}$$
(11)
$$\begin{aligned} \frac{\mathrm{d}R}{\mathrm{d}t}= \gamma I. \end{aligned}$$
(12)

The modeling is simple, since S(t) + I(t) + \( R (t) = N \) results in N, which represents the total population. Then, in each t, individuals moved from S to I. The model removes the individuals infected with the disease from the compartment. Equation 10 describes the model, where \(\beta \) is the average number of people comes into contact with another person multiplied by the likelihood of infection in that contact.

Equation 10 does different use of the faction mentioned above removing the number of infected people; in the I compartment, the new ones infected by the rate are added, with the removal of those who were recovered or died, introducing the term \(\mu \), which represents the recovery and mortality rate.

Equation 16 explains the variation in the recovered patients and the number of deaths compartment, which is described by \(\mu \) on those infected.

This model requires as input the amount of the susceptible, infected, and cured or dead population, all referring to time 0. And the necessary rates, it is transmission probability, recovery rate, and mortality.

3.10.2 SEIR model

Because the SIS and SIR model exclusively supports the cases without an incubation period, which is not the case for several classes of contagious infections, Cooke proposed a spread model for the case that after a specific period, the susceptibles person can get infectious. This model is named as the SEIR model (Cooke 1979).

The SEIR model differs from the SIR in one compartment, the E representing Exposure, which refers to diseases that are not manifested at the exact moment of infection, having an incubation period. Like COVID-19, which has an ordinary incubation period of 14 days.

The model is defined with four differential equations, described in Eqs. 131415, and 16. Some small changes are made, starting with the addition of the new Eq. 14, which represents the calculation of individuals exposed to the virus.

The model added a new rate, the incubation rate, \(\sigma \), which is the rate of latent individuals becoming infectious (typical period of incubation is 1/\(\sigma \)) (Cooke 1979).

$$\begin{aligned} \frac{\mathrm{d}S}{\mathrm{d}t}= - \frac{\beta IS}{N}, \end{aligned}$$
(13)
$$\begin{aligned} \frac{\mathrm{d}E}{\mathrm{d}t}= \frac{\beta IS}{N} - \sigma E, \end{aligned}$$
(14)
$$\begin{aligned} \frac{\mathrm{d}I}{\mathrm{d}t}= \sigma E - \gamma I, \end{aligned}$$
(15)
$$\begin{aligned} \frac{\mathrm{d}R}{\mathrm{d}t}= \gamma I. \end{aligned}$$
(16)

Analogous to the SIR representation, the sum of the compartments, which are now S(t) + E(t) + I(t) + \( R (t) = N \), results in the total population.

3.11 Nonlinear additive and multiplicative methods

3.11.1 Prophet

Prophet is an approach for prediction of time-series data based on an additive model. Prophet uses seasonality and day-off effects to calculate nonlinear tendencies. It operates appropriately with historical series that have regular periodical patterns and diverse seasons of past data. Prophet is resilient to missing data and variations in the bias and generally works well with outliers (Taylor and Letham 2018).

This method is a helpful method for time series with many distortions, lack of data, and drastic changes. What led us to use it since the lack of data on COVID-19 is excellent because it is a new disease.

$$\begin{aligned} y(t) = g(t) + s(t) + h(t) + \epsilon _t \end{aligned}$$
(17)

The Prophet equation 17 shows the following features, decomposing the time series into three elements: trend g(t), seasonality s(t), and holidays h(t).

  • g(t): piecewise linear or logistic increase curve for modelling non-seasonal changes in time series.

  • s(t): seasonal changes .

  • h(t): effects of day-off.

  • \(\epsilon _t\): error term accounts for any not common changes not accommodated by the model

3.12 Holt winters

Exponential smoothing is an ordinary procedure used to predict a time series left out the requirement of applying a parametric model (Gelper et al. 2010). The Holt-Winters also named to as double exponential smoothing, is an addition of exponential smoothing created for trended and periodic time series.

The Holt-Winters model (Winters 1960) is an expansion of the Holt method (Holt 2004), developed by Winters and divided into two groups, multiplicative and additive Holt-Winters. The multiplier model was selected for the analysis in this Chapter because it trends forecast values by seasonality, being the best for data with trends and increasing seasonality as a function of time.

The exponential and Holt-Winters procedures are susceptible to regular events or anomalies. Outliers influence prediction methods in two forms. First, the smoothed values are affected. Smoothed values depend on the present and historical values of the series, plus the outliers. The other influence concerns the choice of the parameters used in the recursive updating design (Gelper et al. 2010).

The use of the multiplicative method is explained by the characteristics of the data, using the numbers of infections and deaths of COVID-19; the curve presents an exponential shape. The trend and seasonality data have an increase according to the number of days; thereby, the multiplicative model is ideal.

In the Holt-Winters multiplicative method, the periodic partition is formulated in relative terms and used to fit the time series periodically. Equations 1819, and 20 describe the multiplicative method.

$$\begin{aligned} S_t= \alpha \frac{y_t}{I_{t-L}} + \left( 1 - \alpha \right) \left( S_{t-1} + b_{t-1}\right) . \end{aligned}$$
(18)
$$\begin{aligned} b_t= \gamma \left( S_t - S_{t-1}\right) + \left( 1 - \gamma \right) b_{t-1} \end{aligned}$$
(19)
$$\begin{aligned} I_t= \beta \frac{y_t}{S_{t-1} + b_{t-1}} + \left( 1 - \beta \right) I_{t-L} \end{aligned}$$
(20)

where \(S_t\) is the overall smoothing, \(b_t\) is the inclination smoothing, and \(I_t\) is the periodically smoothing. \(y_t\) refers to the real data at a period of t. L is the time. The \(\alpha \), \(\gamma \), and \(\beta \) are constants between 0 and 1. The model minimizes the Mean Square Error (MSE) equation using \(\alpha \), \(\gamma \), and \(\beta \).

3.13 COVID-19 epidemic on Ceará

On 9 September, Ceará reached 223,863 confirmed instances of COVID-19 and 8,634 deaths due to disease. One hundred ninety-eight thousand seven hundred eighty-eight individuals recovered from the disease. The data are from the IntegraSUS platform. There are also 88,177 suspected cases and 611 deaths under evaluation. The state has carried out 671,720 tests to spot the new coronavirus. The number of reported cases reached 679,359. Fortaleza is the leader in absolute amounts, with 47,638 confirmed instances and 3811 deaths from the illness. The funding registers 1784.6 cases per 100 thousand inhabitants. In Fortaleza’s macro-region, Maracana concentrates 6518 cases, 240 deaths, and incidence in 2861.1. Caucaia, the second city in deaths from the new coronavirus (340), has 5627 positive diagnoses and an incidence of 1557.8. In Maranguape, 4661 individuals have been infected, 115 have not resisted the disease, and the prevalence is 3613.8.

Figure 1 presents the plague evolution in Ceará between March and August of 2020.

Fig. 1
figure 1

COVID-19 cases curve in the state of Ceará

Figure  2 presents the plague evolution in Fortaleza between March and august of 2020. Fortaleza is the capital of the state of Ceará. Fortaleza has an area of 313,140 km and 2,643,247 inhabitants estimated in 2018, in addition to the highest demographic density among the country’s capitals, with 8390.76 inhabitants/km. Fortaleza continues as the epicenter of the pandemic in Ceará, with 3846 deaths and 48,855 people infected with the coronavirus.

Fig. 2
figure 2

COVID-19 cases curve in Fortaleza, capital of Ceará

3.14 Proposed method

The proposed method consists of two approaches. The first is to use the Kalman Filter method to predict the speed and behavior of the pandemic. The second approach uses the H20 framework to predict with machine learning models of the number of cases and deaths in Ceará.

Because the Kalman Filter needs a data entry to adjust the pandemic’s uncertainty and speed for forecast, a hybrid dataset was assembled with data from Ceará, Brazil, and China at the beginning of the pandemic. The proposal is that this hybrid dataset could provide long-term behavior for the Kalman filter, a model typically used for short-term forecasts (Fig. 3).

Fig. 3
figure 3

Proposed use of Kalman Filter with hybrid database

Two AutoML models were chosen for the experiments: H2o and TPOT, due to insufficient data to use the neural networks available in autokeras. The proposed analysis considers public data available of new confirmed cases and deaths reported daily for the state of Ceará, in the northeast region of Brazil, from 15 March until 17 May. The data were obtained from an open API available on https://github.com/integrasus/api-covid-ce, validated according to the Ceará Integrasus Platform (available at https://indicadores.integrasus.saude.ce.gov.br). The database has the following attributes:

  • Categorical result of COVID-19 examination

  • City of patience provided by Brazilian Geographic Institute

  • Asthma indicator

  • Indicator of cardiovascular problems

  • Date of death

  • Date of examination result

  • Date of begin of the symptoms

  • Date of examination notification

  • examination final result.

3.15 Performance metrics

The accuracy of the suggested approach is evaluated by applying a set of performance metrics as follows:

3.15.1 Root mean square error (RMSE)

$$\begin{aligned} rmse = \sqrt{\left( \frac{1}{n}\right) \sum _{i=1}^{n}(y^{\prime }_{i} - y_{i})^{2}} \end{aligned}$$
(21)

where \(y^{\prime }\) and y are the foretold and real values, respectively.

3.15.2 Mean absolute error (MSE)

$$\begin{aligned} mae = \left( \frac{1}{n}\right) \sum _{i=1}^{n}\left| y^{\prime }_{i} - y_{i} \right| \end{aligned}$$
(22)

where \(y^{\prime }\) and y are the foretold and real values, sequentially.

3.15.3 Coefficient of determination (\(R^2\))

$$\begin{aligned} R^2 = 1-\frac{\sum _{i=1}^{n}\left( y^{\prime }_{i} - y_{i}\right) ^{2}}{\sum _{i=1}^{n}\left( y^{\prime }_{i} - {y}_{i}\right) ^{2}} \end{aligned}$$
(23)

where \(y^{\prime }\) and y are the predicted and original values, respectively. y is the average of original values. The lowest value of RMSE and MAE indicates the most suitable approach. The greater rate of \(R^2\) shows a better correlation for the method.

4 Results and discussion

The results are the most critical factors for analyzing the pandemic since it shows the possible epidemic evolution according to the proposed models. The comparisons are based on standard metrics for regression models analysis, such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and R squared (\(R^2\)). Table 1 presents the error results by RMSE, MAE and \(R^2\). The TPOT automobile model showed the best result. However, AutoML models could only be used after a considerable amount of data, which is not available at the beginning of the pandemic. Thus, the Kalman Filter’s use was essential to project the pandemic propagation and decay time in Ceará with a reasonable margin of error.

Table 1 Method errors to short term experiments

4.1 Kalman Filter results

For the Kalman Filter, we use three approaches, the first shown in Fig. 4, which uses only the Kalman filter; as it is an adaptive method, it is necessary other data, and the forecast is based on data from Brazil. Adapting the filter to the data proved useful, making it a suitable method for short-term forecasts. The second approach using the Kalman Filter is to use the SEIR method; in this case, the data generated from the SEIR model were used in the filter. The third and last is the use of the hybrid data set, which consists of joining the data from Ceará and data generated from the SEIR model, before applying the data in the Kalman filter.

Fig. 4
figure 4

Kalman Filter result short term Ceará

Figure 5 presents the adaptative property of the Kalman Filter. The graph shows the Kalman filter’s prediction with data for 5, 10, or 15 days from the prediction day and the current curve. It is noticed that the closer to the prediction day, the filter approaches the real curve, reducing uncertainty and noise.

Fig. 5
figure 5

Kalman Filter predictions with data for 5, 10, or 15 days

Figure 6 shows the prediction for the COVID-19 death rate curve in the state of Ceará one month before the curve plateau was reached. Despite the error in the number of deaths being high of the value, the model could predict the period of stabilization and decline in the number of cases.

Fig. 6
figure 6

Prediction for the COVID-19 death rate curve in the state of Ceará one month before the curve plateau

Among the regular models for the COVID-19 global pandemic forecast, simple epidemiological and statistical models have gained more attention from authorities, and they are prevalent in the media. Due to a high level of uncertainty and lack of data, standard models have shown low accuracy for long-term prediction.

Fig. 7
figure 7

Prediction for the COVID-19 death rate curve in the state of Ceará with H2O.ai

According to the presented discussion, the use of Quadratic Kalman Filter as a predictor for the COVID-19 epidemiological data can be considered, with certain limitations being considered. The proposed Kalman Filter prediction approach is providing encouraging results for short-term predictions. Kalman filter-based proposed model is showing a large mean average error in the long-term. Hence, it can be concluded that the proposed prediction model is suitable for short-term prediction i.e., daily and weekly. The proposed prediction model can be updated to accommodate medium-term time-series predictions to discover the curve’s plateau, but with large error in the absolute number of cases.

4.2 H2O results

Table 2 presents the results for the H2O AutoML applied to Ceará COVID-19 deaths data set. The model id shows the best models chosen. The generalized linear model (GLM) was the one that obtained the best result. The first column present the name of model used (Fig. 7).

Table 2 H2O AutoML Results for Ceará COVID-19 with H2O AutoML

Figure 8 shows the prediction curve for COVID-19 deaths in Ceará with the best model obtained by the H2O.ai framework.

Fig. 8
figure 8

Prediction for the COVID-19 death rate curve in the state of Ceará with TPOT AutoML

4.3 TPOT results

Table 3 presents the results for the TPOT AutoML applied to Ceará COVID-19 deaths data set. The model was run for five generations, and the KNeighborsRegressor was chosen as the best model configured with 60 neighbors.

Table 3 TPOT AutoML Results for Ceará COVID-19 deaths data set

4.4 Comparison with state of art methods

Table 4 compares the best two approaches presented in this study with state of art regression models. TPOT and Kalman Filter obtain the best \(R^2\) score. The Prophet is a nonlinear model that modifies the seasonality, trend, and holidays of the time series. Holt-Winters is a method applied to time series. We use the multiplicative method due to the curve’s growth in the data, generally an exponential shape. The method has excellent efficacy in series with high seasonality, which is not much presented in data from the epidemic in Ceará.

Table 4 Method errors to long term predictions

The Prophet method has a large error for long-term predictions, but now its prediction has taken a different form compared to the result using data from China. It is noticeable that he was able to model the shape of the growth, peak, and decay of the curve, but the forecast values for the number of cases were different, resulting in a big error.

The use of compartmental epidemiological models as SEIR is widely popular throughout the COVID-19 pandemic. However, many predictions were not confirmed since the modeling could not represent the actual versions, dependent on several outside variables and steps of disease contention defined by general health managers. Each parameter is accountable for the speed of transitions between a single compartment along with the subsequent one. Compartmental models are legitimate approaches for understanding and analyzing epidemiological information, especially if the version is corrected to consider specific characteristics of the outbreak under investigation, as in this COVID-19 pandemic.

The forecast models infer that the amount of COVID-19 cases expands exponentially in its increasing phase. The exponential increase in cases strongly suggests that the epidemic growth is an underlying biological phenomenon instead of the number of tests completed. Some studies indicate that there is a particular generality from the temporal growth of COVID-19. Even though these facts, in a limited community, the exponential development of instances cannot stay forever. Hence, the stochastic model of disease spread saturates sometime. Forecasting plays a vital role in several study regions due to its benefits in conserving funds or improving the decision-making process to benefit the market. In the case of this COVID-19 outbreak, there are many challenges for forecasting as the COVID-19 incubation period is much more extended than other epidemic processes, and also a small number of datasets are available for this function.

The AutoML models used at work have considerable success. However, such models need to have training data that was not available at the beginning of the pandemic. The Kalman Filter model was accurate in terms of the long-term plateau date and the number of short-term forecasting cases. For long-term forecasting, AutoML models are a good option based on available training data. The distinction between the approach presented and the one usually used in other studies is that the use of Kalman Filter provides a long-term prediction at the beginning of the epidemic period using data from other countries/regions and the application of AutoML allows the semi-automatic selection of best model with better precision for the prediction of COVID-19 deaths.

5 Conclusion

Though SIR-based models have been extensively used to model the COVID-19 outbreak, they include some doubts. Several improvements are emerging to improve the standard of SIR-based models suitable for this COVID-19 outbreak. As an alternative to the SIR-based models, this study showed the use of machine learning models to predict the outbreak progression. We show that by using the Kalman Filter and AutoML models, we can achieve very high accuracy in predict COVID-19 cases. This study also shows that it is possible to achieve a \(R^2\) score of 0.99 on the prediction of COVID-19 deaths. The study presented has as main findings that using the TPOT application in predicting COVID-19 cases has a high \(R ^ 2\) score. The Kalman Filter can be used effectively for long-term prediction. The main limitations of using the method are that in AutoML approaches, training data is needed to create the models, making its use impractical at the beginning of the pandemic. The Kalman Filter approach needs data from other countries/cities to feed the model, which makes it feasible to use the approach but with the risk that if the behavior of the country’s epidemic curve used to feed the model has very different characteristics from the region where we want to get the death curve can lead to a high margin of error. The difference between the approach presented and the one commonly used in other studies is that the use of Kalman Filter allows a long-term prediction at the beginning of the epidemic period using data from other countries/regions and the application of AutoML allows the semi-automatic choice of best model with better precision for the prediction of COVID-19 deaths.