Abstract
Introduction: The East San Gabriel Valley Watershed Management Group (ESGV Group), a client of Stantec, is comprised of the Cities of Claremont, La Verne, Pomona, and San Dimas (Group Members). The ESGV Group has performed water quality and storm parameter monitoring for five wet seasons. The collected data were analyzed to determine if simple relationships exist between descriptive storm parameters and the measured water quality. In other words, the group is interested in finding a linear relationship between storm parameters (e.g., storm duration, precipitation, intensity, etc.) and a specific water pollutant concentrations (e.g., E.coli). Traditionally, a multiple linear regression (MLR) is employed for this work, where all the available storm parameters are used as the predictors (i.e., regressors) and the water pollutant concentration serves as the output variable (i.e., response). This traditional approach can provide good performance when the number of predictors (i.e., available storm parameters) is small (e.g., 2-3). However, when the number of predictors becomes large (e.g., 5+), the traditional MLR may easily overfit and produce less optimal regression results. This is mainly because that MLR doesn't have any predictor selection process, so it tends to use all available predictors regardless of their usefulness. As a result, we propose a novel linear regression framework based on the Least Absolute Shrinkage and Selection Operator (LASSO), aimed at improving the performance on regression problems with large predictor numbers. We test our framework on three different types of water quality pollutants: E.coli, dissolved zinc, and total nitrogen. Each pollutant concentration is regressed against six available storm parameters: event precipitation, duration, peak intensity, average intensity, days since last storm, and days since last storm > 0.25 inches, under both traditional MLR method and the proposed LASSO regression. Although other storm parameters might be related to the water quality as well, those six storm parameters are the available data thus are selected as regressors in this study. The results show that for all three pollutants, the proposed LASSO regression outperforms the traditional MLR, using R squared (R2) as the evaluation metric. Methodology: Our proposed regression framework begins with pre-processing the collected data records, where outlier detection and influential data detection are conducted. Outliers are data values far from other data in the collected dataset, which have adverse impacts on regression. Figure 1 shows an example of how outliers can negatively influence the fit of the regression line. In our study, outliers are identified and filtered out by a pre-set z score. An influential data record is another type of 'outlier' that impacts the slope of the regression line. Influential data records are identified and filtered by a Cook's distance threshold. The processed dataset is obtained after removing both identified outliers and influential data records from the total available data. LASSO regression and the traditional MLR (for comparison) are then implemented on the processed data. LASSO regression is a variant of MLR, which encourages simple and sparse models (i.e., models with fewer storm parameters as predictors). It is usually preferred since it can automatically select useful predictors so that improve regression performance by avoiding overfitting. The difference between LASSO and MLR lies in the way of computing the estimated coefficient for each predictor, β ̂ , where LASSO adds an additional penalty term (l1 norm) on the coefficient in the minimization of residual. This penalty term shrinks the predictor coefficient values, so that an automatically predictor selection is achieved. The magnitude of this shrinkage is controlled by a hyperparameter α, known as LASSO parameter. The α value for each pollutant regression model is determined by fine tuning. The established regression models are evaluated on the processed dataset, based on cross-validation using R2 as the metric. R2 measures how much variation observed in the response is explained by the predictors in a regression model. The larger R2 is preferred for a model and the maximum is 1. Model Results: Three common pollutants are selected to compare the regression results of proposed LASSO with the traditional MLR approach: E.coli (contains 36 data points), dissolved zinc (contains 36 data points), and total nitrogen (contains 13 data points). Table 1 shows the R2 for both LASSO (red) and traditional MLR (blue). For all three selected pollutants, the R2 of the proposed LASSO regression outperforms the R2 of MLR. The total nitrogen provides the largest performance difference on two methods due to its limited amount of data. A small dataset usually favors the simple model, such as the LASSO model. To see how the LASSO regression model provides the simpler model than MLR for each pollutant, Table 2 presents the estimated regression coefficients for each predictor. For all pollutants, LASSO includes only a subset of predictors into the model with the positive predictor coefficients, while MLR uses all the available predictors. This demonstrates how LASSO can 'automatically' select the influential predictors and produce a simpler regression model. Figures 2, 3, and 4 show the regression results vs. actual data for all three pollutants. Consistent with the R2 comparisons, the LASSO regression results are in general slightly closer to the actual data than MLR, for all three test pollutants. Conclusions: This abstract presents a novel linear regression framework that has potential to be widely applied in stormwater management. Compared with the traditional MLR approach, the proposed LASSO regression encourages simpler and sparser models by automatically selecting the influential predictors. The regression relationship derived from LASSO outperforms the traditional MLR, as demonstrated in all three test cases of E.coli, dissolved zinc, and total nitrogen. More discussions on the framework will be provided in the following paper and the model will be further tested as more test data become available in the future.
This paper was presented at the WEF Stormwater Summit in Minneapolis, Minnesota, June 27-29, 2022.
Author(s)J. Li1; G. Kohli2; D. Son3; J. Carver4; J. Abelson5
Author affiliation(s)Stantec1; Stantec2; Stantec3; City of Pomona4; Stantec5;
SourceProceedings of the Water Environment Federation
Document typeConference Paper
Print publication date Jun 2022
DOI10.2175/193864718825158452
Volume / Issue
Content sourceStormwater Summit
Copyright2022
Word count19