Batch reinforcement learning (BRL) is an emerging field in the reinforcement learning community. It learns exclusively from static datasets (i.e. replay buffers). Model-free BRL models are capable of learning the optimal policy without the need for accurate environment models or simulation environments as oracles. Model-based BRL methods learn the environment dynamic models from the buffers, then use these models to predict environment responses, and generate Markov Decision Process (MDP) transitions given states and actions from policies. In the offline settings, existing replay experiences are used as prior knowledge for BRL models to learn from. Thus, generating replay buffers are crucial for BRL model benchmark. In our B2RL (Building Batch RL) dataset, we collect real-world datasets from our database, as well buffers generated by several behavioral policies in simulation environments. To the best of our knowledge, we are the first to open-source building datasets for the purpose of batch RL learning.
The real building buffer is extracted from the readings of student labs in one the school buildings. The amount of datapoints in the buffers ranges from 170~260K, depends on the number of rooms involved and missing values. We obtain data of an entire year, from the beginning of July 2017 to the end of June 2018 of 15 rooms across 3 floors (for most of the rooms, however, you might find the start and end dates of the entire dataset might be earlier and later. Due to missing and corrupted data, if we want to collect a similar amount of data for each room, the time period might be different from room to room.). Since the rooms on the same side of a floor often share similar thermal dynamics, we thus create batch data for each floor to ensure that the replay buffer reflects each variable air volume (VAV)’s thermal dynamics precisely.
-
State: We use the following attributes for the RL process to evaluate the policy: indoor air temperature, actual supply airflow, outside air temperature, and humidity. These states include the features needed for thermal comfort estimation and those that represent the responses of actions as RL states.
-
Action: We control two important parameters, namely, zone air temperature setpoint and actual supply airflow setpoint.
-
Reward: We monitor the thermal states of the space as well as the thermal comfort index predicted by a regression model, and then make control decisions with the actions selected by the BRL model. Our reward function penalizes high HVAC energy use and discourages a large absolute value of the thermal comfort index, which indicates discomfort to occupants.
We adopt Sinergym, an open-source simulation and control framework for training RL agents. It is compatible with EnergyPlus models using Python APIs. Our approach follows BRL paradigm: (1) We first train behavioral RL agents with 500K timesteps and select the one which gives the highest average score as the expert agent. We run on 5-zone building which is a single floor building divided in 5 zones, 1 interior and 4 exterior with 3 weather types: cool, hot, and mixed in continuous settings. We also experiment on two different kinds of response type, deterministic and stochastic. Then generate expert buffer with 500K transitions as the expert buffer. (2) A medium buffer is generated with 500K transitions after the behavioral agent is trained "halfway", which means the evaluation score reaches the half of the expert agents' scores. (3) Randomly initialized agent which samples action from allowed action spaces with uniform probability during the evaluation stage to generate buffers with 350,400 transitions.
- State: Site outdoor air drybulb temperature, site outdoor air relative humidity, site wind speed, site wind direction, site diffuse solar radiation rate per area, site direct solar radiation rate per area, zone thermostat heating setpoint temperature, zone thermostat cooling setpoint temperature, zone air temperature, zone thermal comfort mean radiant temperature, zone air relative humidity, zone thermal comfort clothing value, zone thermal comfort Fanger model PPD, zone people occupant count, people air temperature, facility total hvac electricity demand rate, current day, current month, and current hour.
- Action: Heating setpoint and cooling setpoint in continuous settings.
- Reward: We follow the default linear reward settings, it considers the energy consumption and the absolute difference to temperature comfort.
--
@inproceedings{liu2022b2rl,
title={B2RL: an open-source dataset for building batch reinforcement learning},
author={Liu, Hsin-Yu and Fu, Xiaohan and Balaji, Bharathan and Gupta, Rajesh and Hong, Dezhi},
booktitle={Proceedings of the 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation},
pages={462--465},
year={2022}
}