Data-driven models for predicting microbial water quality in the drinking water source using E. coli monitoring and hydrometeorological data
Created by W.Langdon from
gp-bibliography.bib Revision:1.7954
- @Article{sokolova:2022:Ste,
-
author = "Ekaterina Sokolova and Oscar Ivarsson and
Ann Lilliestrom and Nora K Speicher and Henrik Rydberg and
Mia Bondelind",
-
title = "Data-driven models for predicting microbial water
quality in the drinking water source using {E. coli}
monitoring and hydrometeorological data",
-
journal = "The Science of the total environment",
-
year = "2022",
-
volume = "802",
-
pages = "149798",
-
month = jan # " 1",
-
keywords = "genetic algorithms, genetic programming, TPOT,
Drinking Water, Environmental Monitoring, Escherichia
coli, Water Microbiology, Water Quality, Artificial
intelligence, E. coli, Machine learning, Microbial
water quality",
-
ISSN = "1879-1026",
-
DOI = "doi:10.1016/j.scitotenv.2021.149798",
-
abstract = "Rapid changes in microbial water quality in surface
waters pose challenges for production of safe drinking
water. If not treated to an acceptable level, microbial
pathogens present in the drinking water can result in
severe consequences for public health. The aim of this
paper was to evaluate the suitability of data-driven
models of different complexity for predicting the
concentrations of E. coli in the river Gota alv at the
water intake of the drinking water treatment plant in
Gothenburg, Sweden. The objectives were to (i) assess
how the complexity of the model affects the model
performance; and (ii) identify relevant factors and
assess their effect as predictors of E. coli levels. To
forecast E. coli levels one day ahead, the data on
laboratory measurements of E. coli and total coliforms,
Colifast measurements of E. coli, water temperature,
turbidity, precipitation, and water flow were used. The
baseline approaches included Exponential Smoothing and
ARIMA (Autoregressive Integrated Moving Average), which
are commonly used univariate methods, and a naive
baseline that used the previous observed value as its
next prediction. Also, models common in the machine
learning domain were included: LASSO (Least Absolute
Shrinkage and Selection Operator) Regression and Random
Forest, and a tool for optimising machine learning
pipelines - TPOT (Tree-based Pipeline Optimization
Tool). Also, a multivariate autoregressive model VAR
(Vector Autoregression) was included. The models that
included multiple predictors performed better than
univariate models. Random Forest and TPOT resulted in
higher performance but showed a tendency of
overfitting. Water temperature, microbial
concentrations upstream and at the water intake, and
precipitation upstream were shown to be important
predictors. Data-driven modelling enables water
producers to interpret the measurements in the context
of what concentrations can be expected based on the
recent historic data, and thus identify unexplained
deviations warranting further investigation of their
origin.",
-
notes = "PMID: 34454142",
- }
Genetic Programming entries for
Ekaterina Sokolova
Oscar Ivarsson
Ann Lilliestrom
Nora K Speicher
Henrik Rydberg
Mia Bondelind
Citations