Abstract
Background The most widely employed process control strategies within water resource recovery facilities (WRRFs) are model-based and built on expert knowledge. Examples include ammonia-based aeration control (ABAC) and nitrate-linked internal mixed liquor recycle (IMLR) control. Even where machine learning algorithms are integrated, they are based on supervised learning where the model is trained with the 'correct answer' embedded within the training data. While this approach is often successful, it relies upon high-performing historical control on which to train the model and can be handicapped by human biases toward certain operational modes. Control strategies based on reinforcement learning (RL) are different. In RL, the algorithm (or RL agent) interacts directly with the control environment during training to map optimal actions to observation state data to maximize a specified reward (Figure 1) (Silver et al., 2018; Sutton & Barto, 2018). In this way, the RL agent is entirely data driven and learns from direct experience. Although RL has gained attention as a tool for WRRF control optimization, there remain numerous open questions about the efficacy of employing RL agents to achieve valuable process optimization outcomes in WRRFs (Croll et al., 2023a; Nam et al., 2023). To help address these shortcomings the present study addressed three objectives: 1) evaluate common RL algorithms in the context of WRRF optimization; 2) evaluate the effects of increasing the number of processes controlled by the RL agent; and 3) evaluate the best performing RL agents to better understand how successful RL control strategies compared to domain-based control strategies like ABAC. Methods The present study evaluated RL agent control optimization in the context of the Benchmark Simulation Model No. 1 (BSM1) (Figure 2). The BSM1, a 5-zone bioreactor with a secondary clarifier, IMLR, return activated sludge (RAS), and waste activated sludge (WAS), provides a standard model construction and testing framework to assess and benchmark WRRF control performance through a simulation environment (Alex et al., 2008). It also has predefined dynamic influent and a facility operational cost function. The cost function accounts for direct operational costs, including aeration energy, mixing energy, pumping energy, and biosolids disposal, and indirect environmental costs based on the mass of pollutants discharged to the environment. By defining the BSM1 cost function, with variations, as the RL agent reward function, agents could be rapidly trained to minimize overall facility cost and compared to 'baseline' BSM1 operation. To facilitate training, a novel RL training environment (Croll et al., 2023b) was developed using OpenAI Gym (Brockman et al., 2016) to connect a SUMO (Dynamita) BSM1 simulation to the stable baselines3 (Raffin et al., 2021) package in Python. Four scenarios were evaluated in this study (Table 1). Scenario 1 (S1) tested four common RL algorithms to determine which was best fit for application to WRRF control optimization, representing four common types of RL algorithms: Deep Q Network (DQN), Proximal Policy Optimization (PPO), Advantage Actor Critic (A2C), and Twin Delayed Deep Deterministic Policy Gradient (TD3) (Croll et al., 2023b). RL agents in this scenario controlled the dissolved oxygen (DO) set point for Zones 3-5 (Z3-5). Scenarios 2-4 evaluated RL agents from the best algorithm, TD3, for their capacity to control a greater number of actions throughout the BSM1 (Table 1) (Croll et al., 2024). Results Despite only controlling a single action, neither the DQN nor the PPO algorithms were able to produce agents which met the BSM1 effluent limits. Both the A2C and TD3 algorithms produced successful agents (Table 1) (Croll et al., 2023b). However, actions recommended by the A2C agent were not practical for a physical system, as the agent tended to alternate between very low and very high DO setpoints, which would result in excess wear and tear on physical equipment (Figure 3). By contrast, the TD3 agent tended to recommend actions that looked very similar to ABAC control (Figure 3). Despite the success of the TD3 agents in Scenario 1, the TD3 agent only reduced the BSM1 cost by 1.8% over baseline control, likely due to the limited scope of control that the RL agent had over the BSM1 simulation. Increased scope of RL control was evaluated in Scenarios 2-4. As the number of actions under RL agent control increased, so did the level of BSM1 cost reduction, rising to 8.1% in Scenario 3 and 11.5% under Scenario 4 (Croll et al., 2024). However, it should be noted that under Scenario 4, with the addition of WAS control to the RL agent scope, the effluent was not compliant with BSM1 limits. Rather, the RL agent determined correctly that the single largest contributor to the BSM1 operational cost was biosolids disposal and effectively ceased all sludge wasting, opting instead to pay for the defined costs for environmental pollution. This finding highlights the importance of developing reward functions that accurately reflect operational goals and regulatory limits, and suggests that the BSM1 cost function may have undervalued effluent pollution. A breakdown of RL agent actions and key process parameters during RL agent operation under Scenarios 3 and 4 are shown in Figures 4-7. These will be discussed in more detail in the final paper.
The present study evaluated reinforcement learning (RL) agent control optimization in the context of the Benchmark Simulation Model No. 1 (BSM1). RL agents achieved minimal improvement when controlling a single action, but were able to achieve dramatic operational cost reduction when controlling a larger action space. An RL agent successfully maintained effluent limit compliance while controlling seven unique actions for a total facility cost reduction of 8.1% compared to BSM1 baseline.
Author(s)Croll, Henry, Ikuma, Kaoru, Ong, Say Kee, Sarkar, Soumik
Author(s)H. Croll1, K. Ikuma2, S. Ong2, S. Sarkar2
Author affiliation(s)1Stantec, IA, 2Iowa State University, IA
SourceProceedings of the Water Environment Federation
Document typeConference Paper
Print publication date Oct 2024
DOI10.2175/193864718825159612
Volume / Issue
Content sourceWEFTEC
Copyright2024
Word count10