•High intraday variations of hourly indoor PM2.5 were detected.
•Random forest regression (RFR) was applied to modeling the hourly indoor PM2.5.
•RFR performed better than the traditional multiple linear regression (MLR) model.
•The outdoor PM2.5 levels were the most important predictor of indoor PM2.5.
This study developed a predictive model for hourly indoor fine particulate matter (PM2.5) concentration based on the random forest regression (RFR) method and compared its performance with the traditional multiple linear regression (MLR) method. The concentrations of indoor and outdoor PM2.5 were monitored at a total of 66 apartments in Nanjing (NJ) and Beijing (BJ), China, during both the heating and non-heating seasons. In total, 14,442 pairs of hourly indoor and outdoor PM2.5 were measured by light-scattering nephelometer, while potential influencing factors were obtained via questionnaires. Hourly indoor PM2.5 prediction were developed based on either the RFR or MLR method. A ten-fold cross-validation (10-fold CV) analysis was used to evaluate the predictive power of the models. The 10-fold CV results revealed the MLR models agree fairly well with the measured data, with coefficients of determination (R2) ranging from 0.70 (BJ) to 0.73 (NJ), while the root mean square error (RMSE) ranged from 28.0 μg/m3 (NJ) to 28.2 μg/m3 (BJ). Overall, the RFR models outperformed the reference MLR method as indicated by higher CV R2 (0.82 in BJ and 0.78 in NJ, respectively) and lower CV RMSE (20.4 μg/m3 in BJ and 24.3 μg/m3 in NJ, respectively). Our results show that the RFR approach can exceed the predictive power of the classic MLR method and is a promising methodology for estimating indoor PM2.5 concentrations in Chinese megacities when direct PM2.5 measurements are not possible.
Estimating hourly average indoor PM2.5 using the random forest approach in two megacities, China