Sakshi
Jain
a,
Albert A.
Presto
b and
Naomi
Zimmerman
*a
aDepartment of Mechanical Engineering, University of British Columbia, Vancouver, Canada. E-mail: nzimmerman@mech.ubc.ca; Fax: +1-604-822-2403; Tel: +1-604-822-9433
bDepartment of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, USA
First published on 24th October 2023
To date, epidemiological studies have generally not accounted for the spatiotemporal variations in PM2.5 concentration that populations experience. These studies typically infer exposure using home address and annually-averaged concentrations measured by a few centrally-located monitors. To quantify the impact of spatiotemporal variation on exposure estimates, this study uses land-use random forest models to estimate daily-average ambient PM2.5 concentrations in Allegheny County, USA. The data were collected using a network of 47 low-cost air quality sensors, and predictions were made for 50 × 50 m grids in Pittsburgh. Residential (PR) and commercial (PC) probability weighting values were assigned to each grid. The daily-average predictions were divided into “weekday” and “weekend” concentrations for each grid and averaged annually to estimate total annual exposure. Weighted stratified sampling was conducted using PR and PC values as probabilities, and weekdays and weekends as strata. Static models (population spends 24 hours per day in a fixed residential area) and dynamic models (estimates that account for time spent in residential and commercial areas) were created using these samples. The daily-average predicted concentrations across all grids ranged from 4–75 μg m−3 (μ = 12.0 μg m−3). Weekend concentrations were 10% higher than weekday concentrations, and commercial area concentrations were 9% higher than residential areas. These results support the hypotheses that exposure profiles vary due to movement between different areas and that exposure is underestimated when residents' mobility is ignored. Furthermore, exposure estimates may be affected due to the observed existence of temporal variations between weekdays and weekends. As low-cost sensor networks adoption grows, this work suggests that epidemiological exposure models can leverage these data to further refine exposure estimates and identify behaviors that may reduce exposure.
Environmental significanceThis study estimated the impact of spatiotemporal ambient PM2.5 variations on exposure using a low-cost air quality sensor (LCS) network in Pittsburgh, PA, USA. Exposure epidemiology typically relies on inferring exposure from residential address. We found that exposure estimates are consistently about 10% higher when the population spends more time in commercially-dense locations (dynamic model) vs. residentially-dense locations (static model) and that exposure was higher on weekends. This work demonstrates that LCS networks can be used to improve PM2.5 exposure estimates by informing concentration models that are more refined in space and time. Previous epidemiology research has shown that there is no PM2.5 concentration below which health effects are not observed, thus improvement in exposure estimates may improve or refine the existing knowledge on health impacts of low levels of PM2.5. |
To mitigate the public health effects of PM2.5, it is important to accurately estimate the population's exposure. The cumulative exposure of an individual is typically estimated by considering their exposure in three key contexts: (1) indoor environments, (2) during commuting, and (3) outdoor settings. To accurately gauge an individual's overall exposure, it is essential to account for a variety of factors specific to each context. Exposure assessments within indoor spaces are significantly affected by activities such as cooking and cleaning,5 and the building characteristics (e.g., presence of air filters, infiltration of ambient concentrations, windows open/closed).6 For commuting, factors such as the duration and mode of transportation (e.g., walking, driving, using public transit) play a dominant role in influencing the overall exposure levels.7 Ambient concentrations are inherently influenced by meteorological conditions (e.g., wind direction)8 and geography (e.g., elevation).9 These factors contribute to the complex interplay of exposure elements that need to be considered for a comprehensive estimation of an individual's exposure profile. However, there are challenges with properly assessing exposure considering these complex elements. Exposure is often inferred from PM2.5 concentrations taken via only a few centrally-located outdoor monitors.10–12 In contrast, previous studies have shown that small-scale spatial variations in PM2.5 exist.13,14 As such, peoples' movement exposes them to changing pollution concentrations, resulting in varying exposure profiles, which can impact a given person's consequent health.4,15 Additionally, exposure misclassification due to unaccounted human mobility can have effects on the epidemiological inferences derived and consequently, on relevant policies. Residence-only exposure profiles have been found to result in negative biases in the estimates,16–19 in essence, the relative risk is underestimated by ignoring mobility.
The effect of mobility on exposure levels has previously been assessed using wearable or portable personal monitors, which typically collect integrated filter samples for offline chemical analysis, and comparing the personal monitoring concentrations with ambient concentrations at home residences and applying correction factors. However, even though personal monitors are the most accurate method to estimate personal exposure, they have both logistic and cost constraints, such as recruiting an adequate number of individuals from representative populations to carry the monitors. Furthermore, the characteristics of participants (e.g., age, gender) may affect the accuracy of the correction factors used to infer personal exposure from ambient PM2.5.20 Personal monitors also suffer from measurement uncertainties, due to the low temporal resolution of the data (often 4–24 h).21 An alternate way of addressing the impacts of mobility and the discrepancy between personal exposure and at-residence concentrations is by including the spatial variability of the pollutant.15,22–24
Most exposure epidemiology studies are based on residential address;25 the daily movement of an individual (for work, recreation etc.) isn't typically accounted. Consequentially, the impact of spatial movement in a person's day is often not represented in epidemiology studies. There are some studies that have estimated movement-based exposure, using mobile phone data,16,17,26 activity-based data27 or agent-based models.18 However, their spatial resolution is coarse, ranging from 400 m16 up to 3 km.17,27
The lack of spatial resolution via ground measurements primarily exists because dense networks of regulatory monitoring stations aren't feasible due to their high initial capital investment and ongoing maintenance costs (USD 10000–100000 per pollutant). To overcome the shortcomings associated with monitoring stations, lower-cost sensing technologies have increasingly been used as an alternative due to a combination of improved sensor technologies and researcher-developed methods for sensor calibration.28–32 Due to the low-cost and low power demands of low-cost sensors (LCS), they can be deployed to form a dense network, which can assist in capturing small-scale spatial variations. As such, although there are disadvantages associated with using low-cost sensors (sensitivities to environmental conditions28,33 and other pollutants,29,34 drifting of sensor readings35 that typically require calibration across the full range of meteorological conditions and pollutant concentrations), there are opportunities to use LCS to increase our understanding of air pollution exposure. By combining high spatial and temporal resolution surface maps of PM2.5 modelled from dense LCS networks,36 indoor–outdoor ratios in different micro-environments (home, commercial buildings, vehicles) and activity-based breathing rates, more accurate personal exposures can be estimated.37
In our previous work,36 we used data from a network of 47 low-cost PM2.5 sensors deployed between January–December 2017 in Allegheny County, Pennsylvania to develop land use regression models to predict daily PM2.5 concentrations at each 50 m × 50 m grid in Allegheny County in 2017 (see Sections S1 and S2 of the ESI† for more details on the data collection, days of data for each sensor, links to data repositories and the QA/QC protocols for the data from this prior study). In this work, we use these daily ambient PM2.5 predictions to compare the base case used in epidemiology (PM2.5 estimated at home addresses; static models) with an estimate where people spend time at both home and work/commercial locations (dynamic models). In this work, we use the term ‘exposure’ as a proxy for time-weighted ambient concentrations that the population experiences. As such, indoor concentrations are not considered for this work. Additionally, we have excluded exposure during commuting or transit; i.e., we are not replicating personal exposure. Instead, the term ‘movement’ implies that people aren't necessarily always located in the same place and may move from one land-use type to another. Overall, we aim to highlight the potential utility of the high spatiotemporal resolution ambient PM2.5 concentrations surface maps made possible by LCS networks on these exposure estimates.
While full details are available in Jain et al.,36 briefly, to build the land use regression models the collected RAMP data was processed using signal decomposition (wavelet decomposition) into 4 separate signals:40 (1) regional concentrations, (2) persistent enhancements above the regional background (lasting >8 h), (3) long-lived (2–8 h) events and (4) short-lived (<2 h) events. The latter three signals were individually modeled using land-use random forests (LURF) and subsequently added together with the regional concentrations to re-create the total concentration and tested for validation using the leave-one location-out cross-validation (LOLOCV) technique.41,42 Various spatial and temporal variables were used as predictors in the model. The variables used in the final models can be found in Fig. 1 (blue box). Full details of all variables assessed are detailed in Jain et al.36 (see Table S2 in Section S4 in the ESI† for a summary). The value in brackets refers to the buffer sizes. Multiple buffer sizes represent different buffers used for different signals. Detailed information on the steps followed for prediction modeling for this work can be found in Section S5 of ESI.†
Fig. 1 Flowchart of steps involved in this work. The blue box at the top represents the results from Jain et al.36 used for this work. The grey boxes are the outcomes. EPA CO and EPA PM in the blue box refer to daily measurements of CO and PM2.5 by the US EPA's Lawrenceville site in the city of Pittsburgh. |
The land use random forest model from Jain et al.36 was then applied to the City of Pittsburgh using a grid size of 50 × 50 m (total grids = 57768) to quantify the small-scale variations in PM2.5 concentrations. Grids where ≥50% of the spatial predictor variables exceeded the training model limits (both upper and lower limits from 47 training sites) were excluded from the assessment since random forests are incapable of extrapolation (remaining grids = 44595, 77% retained).
Daily predictions were then consolidated at each grid in three ways: (1) annual average concentrations, (2) average winter (November–April) and average summer concentrations (May–October) and (3) average weekday and weekend concentrations. As such, daily predictions were separated seasonally (either summer or winter) or weekly (either weekday or weekend) and then averaged for each grid.
We chose to assess population exposure using this split between residential and commercial areas to acknowledge that a population spends time in both areas almost every day, but the exposure profiles might be different due to various factors (e.g., higher vehicle emissions in commercially-dense areas). The exposure estimate can be further improved by tracking individual people via personal notes or cellular network data. However due to lack of movement data, this is a recognized limitation of this work.
(1) |
In eqn (1), z represents the desired confidence level (z = 1.96 at 95% CI) and σ is the standard deviation (annual average at each grid cell, 1.88 μg m−3). MOE, margin of error, is the acceptable tolerance level or sensitivity, set as the least count of PM2.5 measured by the RAMPs for this work (=0.01 μg m−3). With these inputs, the number of samples was determined to be approximately 140000, as shown in eqn (2).
(2) |
The daily predictions were then separated into ‘weekday’ and ‘weekend’ concentrations for each grid to acknowledge difference in human movement patterns during different days of the week and then averaged annually to estimate total annual exposure for different models (i.e., the weekend concentration of a sample is the average concentration over all 52 Saturdays and Sundays in 2017 at the selected grid cell. See Section S8 in the ESI† for more details).
Sampling was achieved using a weighted stratified sampling (with replacement) method, in which the population is divided into homogeneous strata (strata for this work: weekdays and weekends) and samples are selected from each stratum based on the assigned probability weights (Fig. 2). The probability weights PR and PC were used to calculate sampling fraction, such that, grids with higher probability weights were sampled more. For instance, between two grid cells with PR values as 0.1 and 0.2, the latter is twice as likely to be picked as a sample.
Fig. 2 Flowchart for weighted stratified samples and resultant static and dynamic models. ‘a’ and ‘b’ represent the hours spent in the selected land-use type over weekdays or weekends. |
For each of the land use types (residential and commercial), a total of 140000 samples were taken. The total number of samples were then divided into weekday and weekend concentrations based on population size of the strata (weekday size = 5 days, weekend size = 2 days). Therefore, 5/7 × 140000 = 100000 samples were taken of weekday concentrations and 2/7 × 140000 = 40000 samples were taken of weekend concentrations. The samples for residential and commercial areas were assessed for statistically significant differences using the Welch two sample t-test. The Welch two sample t-test was used since it doesn't assume that the two data sets have equal variances.
(3) |
The ‘Static’ models assumes that residents spend 24 hours in a day in residential areas. ‘Dynamic’ models were defined as the models that account for movement between commercial and residential areas. These models were created using eqn (4) and (5) and were used to estimate difference in exposure to ambient PM2.5 due to daily movement.
(4) |
(5) |
In eqn (4) and (5), R and C refer to sample concentrations taken for residential and commercial areas respectively. Subscripts D and E are the time periods, used for weekdays and weekends respectively (e.g., RD is the sampled concentration for the sample residential area over weekdays). α represents the number of hours spent in residential areas over weekdays, whereas β represents the number of hours spent in residential areas over weekends. As such, the individual exposure level will vary depending on the amount of time spent in each area.
For our analysis, α and β were estimated as 12 and 18 hours respectively for the Dynamic models, informed by data provided by US Bureau of Labor Statistics,45 to facilitate comparison with static models. The concentrations for static and dynamic models were assessed for statistically significant differences using the Welch two sample t-test. As mentioned previously, this test was chosen as it doesn't assume that the populations have equal variances.
The models (eqn (4) and (5)) are a fractional split and are not a true representation of time spent in residential or commercial areas (i.e., the model isn't informed by sub-daily movement; PM2.5 concentrations are modeled as daily averages). While the RAMP sensors have sub-daily measurement time resolution, PM2.5 concentrations were modeled as daily averages due to prediction modeling constraints (specifically a lack of sub-daily model inputs such as hourly traffic) and this is an identified limitation of this work.
Nonetheless, we observed sub-daily variations at the 47 sites where low-cost sensors were deployed (Fig. 3). Although the nighttime concentrations were similar across the residential and commercial land use types, the sites with high commercial density (PC) were characterized by higher concentrations during daytime. As such, the static and dynamic models compared in this study are likely to have larger differences than reported here - this is because people are more likely to be in a commercial area at 2 PM (when the concentrations are higher in commercial areas) than 2 AM.
On average, summer (May–October) had higher concentrations than winter (November–April), with mean summer and winter concentrations of 13.4 and 11.0 μg m−3, respectively. Weekend (Saturday–Sunday) concentrations were 10% higher than weekday concentrations, with average concentrations of 11.9 and 13.1 μg m−3, respectively, across all grids. Fig. 5 shows the spatial variations in annual averages of predicted PM2.5 during weekdays (Monday–Friday) and weekends (Saturday–Sunday) and during summer (May–October) and winter (November–April) seasons.
Fig. 5 Spatial variations in annual averages of predicted PM2.5 during weekday (Monday–Friday) and weekends (Saturday–Sunday) and during summer (May–October) and winter (November–April) seasons. |
Overall, our results highlight important temporal variations in concentrations. Summer concentrations were 20% higher than winter concentrations, and weekend concentrations were 10% higher than weekday concentrations. These patterns are consistent with PM2.5 data obtained via a US EPA monitor in the City of Pittsburgh, which showed approximately 1.1 μg m−3 higher concentrations over weekends compared to weekdays.47 The higher weekend concentrations are likely due to increased traffic (especially trucks) in Allegheny County on weekends.48 While previous studies have examined daily variations in concentrations,16,18 our findings reinforce the existence of temporal variations and suggest the potential for improving short-term exposure through behavioral changes, such as choosing lower traffic roads for periods of active transportation (e.g., walking, cycling) when out on weekends.
Upon visual inspection, highways and major roads were found to have higher predicted PM2.5 concentration (red lines in Fig. 5), which is a typical pattern that was expected since highways and major roadways experience elevated PM2.5 concentrations due to emissions from combustion, brake wear, tire wear, and resuspended dust.49 The figure also indicates downtown Pittsburgh had higher concentrations, which can be attributed to higher traffic densities and high restaurant density.50 These results are also comparable to black carbon spatial maps prepared in the Breathe Project.51 As such, along with details about personal movement between different areas (e.g., between different grid cells), the maps developed in Fig. 5 can be a useful tool in estimating the exposure of an individual.
By separating grid cells labeled as residential or commercial from the weighted stratified sampling, the commercial areas had 0.4 μg m−3 higher median values. The mean for commercial areas was 1.1 μg m−3 higher (sample standard deviations: σresidential: 0.7 μg m−3; σcommercial: 2.6 μg m−3) (Fig. 4) and the difference between averages were found to be statistically significant (p < 0.05). Additionally, the overall range of concentration that the modeled population was exposed to in commercial areas was noticeably higher, with the difference in 90th percentile concentrations up to 3.7 μg m−3 (30%).
Both measurement and modeling uncertainties pertain to this work. We estimated substantially higher normalized mean error (10–50% higher) for modeling by considering the range of outputs from the random forests, and therefore assumed uncertainties in modeling to have an overall higher effect. For uncertainties due to random forest modeling, we extracted the 5th and 95th percentile values, along with mean values, from the decision trees in the random forest model. This detailed uncertainty analysis can be found in the ESI, Section S9.† Broadly speaking, although absolute difference between static and dynamic models differed when uncertainties were taken into account, we found that average concentrations at commercial areas were always higher. As such, addressing the uncertainties reinforced our results that the average ambient PM2.5 concentration that the population was exposed to was always higher when the population stays in commercially-dense areas or moves between residential and commercial areas vs. when the population stays in residential areas only.
We used the annual average of the daily predicted concentrations for assessment of the static and dynamic models. The differences between the static and dynamic models were found to be statistically significant (p < 0.05), observed for 140000 samples. The difference in concentration between different land-use types (residential and commercial) resulted in variations in exposure, i.e., this resulted in higher exposure across population for dynamic models compared to static models. In all instances, the dynamic model had higher average exposure compared to the static model, with an average difference up to 1.1 μg m−3, when population spends all their time in commercial areas (Fig. 6). To understand potential individual mobility effects, we also report the 10th and 90th percentile of differences between the static and dynamic models (10th percentile: 0.1; 90th percentile: 3.73 μg m−3), suggesting that for an individual, pollutant exposure differences may be as high as approximately 4 μg m−3.
Fig. 6 Scalar graph of difference in exposure between static and dynamic models informed by amount of time spent in residential area over weekdays and weekends separately, calculated using eqn (4) and (5). |
When assessed for α and β as 12 and 18 hours respectively in eqn (5), the mean exposure using the dynamic model was 0.5 μg m−3 higher compared to the static model and the difference between averages were found to be statistically significant (p < 0.05), (static = 11.8 μg m−3, s = 1.0 μg m−3; dynamic = 12.3 μg m−3, s = 1.3 μg m−3) (see Section S10, ESI†). The 90th percentile concentration for the dynamic model was 0.9 μg m−3 (7%) more than the static model. The mean difference (MD) and mean absolute error (MAE) were 0.5 μg m−3 and 0.7 μg m−3, respectively. The mean difference was higher over the weekdays (0.6 μg m−3) compared to weekends (0.3 μg m−3).
A few studies have previously calculated dynamic exposures and the impact of movement on exposure estimates. Nyhan et al.16 used mobile network data for mobility and estimated a difference between static and dynamic model of 0.02 μg m−3. Similarly, Lu18 used agent-based models and estimated a difference of 0.05 μg m−3. However, the above-mentioned research lacked fine spatial resolution (≤50 m) and is potentially one of the reasons behind smaller differences between the static and dynamic models than what was observed here (0.5 μg m−3, for the typical case we have considered). This may be due to our models capturing fine spatial scale variations in PM2.5 concentration. Our work also supports the importance of low-cost sensors to improve exposure estimates. This is in line the findings of Lu.18 However, our approach to is likely less intensive computationally when compared to agent-based models and as a result, may be more readily applied elsewhere. Additionally, none of the previous studies to our knowledge have separately analyzed weekday and weekend concentrations, which is an important outcome of our work and is recommended in future studies.
While this study excludes important aspects of true personal exposure (time indoors, exposure during commuting), the relative impact of higher concentrations in commercial areas is supported by other research on indoor/outdoor ratios used to convert ambient concentrations into indoor pollution estimates. For example, in Stamp et al.,52 indoor–outdoor ratios were determined hourly in London in several environments including an office building and apartments over a 6–9 month period. They found that the indoor–outdoor ratios were strongly influenced by building activity patterns with an average increase in the office indoor–outdoor ratio from 0.5 during non-operating hours to approximately 0.71 during operating hours. While this is only one study, it underscores that time spent in commercial zones (whether indoors or outdoors) is likely an important consideration for improving personal exposure estimates.
Given our findings, a centrally-located monitoring station is not recommended for exposure assessment of the whole population as it could result in negative biases in health effect estimates, i.e., we may be underestimating exposure by using a few centrally-located monitors and residential address. Even though absolute PM2.5 concentration differences in this study were small, the resulting impact on health may still be substantial. This is supported by a recent report from the Health Effects Institute describing that even a 4.16 μg m−3 (one interquartile range in the study population long-term concentrations) increase in average annual PM2.5 concentration is associated with a 1.034 hazard ratio for total nonaccidental death [95% CI: 1.030–1.039].53 Furthermore, this same study concluded that there was no PM2.5 concentration below which no health effects were observed.53 This suggests that even small-scale reductions in PM2.5 concentration are beneficial and this warrants further research.
This study used low-cost sensor network data to create spatiotemporal pollutant concentration models and investigated the model's utility to identify hotspots and subsequent variations in exposure to ambient PM2.5 based on location and movement patterns. However, this work doesn't capture the unique movement of an individual and is one of the identified limitations. This would require movement data via cellular networks or personal notes, both of which were outside the scope for this work. Additionally, daily pollutant concentrations are approximated for sub-daily movement. This is due to a lack of time resolution in prediction model inputs, such as hourly average traffic volume, and is another identified limitation of this work. Going forward, if the appropriate sub-daily predictors become available, we recommend the development of hourly pollutant land use regression models, which could then be paired with agent-based models to simulate individual daily exposures. This work also assumes that indoor concentrations of PM2.5 are comparable to outdoor concentrations. To date, most epidemiology studies assume a static indoor–outdoor ratio;54 as such, the conclusions of this work may remain unchanged if indoor concentrations are introduced under this assumption. However, we recommend that this assumption should be routinely reassessed as more dynamic indoor–outdoor ratios (hourly or better) across many micro-environments are made available. Lastly, this work defines residential and commercial areas based on the residential and commercial densities via land cover information.43 The outcomes of this research may vary if alternative definitions are used for demarcating these areas. As such, different categorizations could potentially lead to varying results in the findings of this study. Going forward, it is likely that buildings and vehicles will become increasingly optimized using Internet of Things (IoT) devices as part of smart city infrastructure; such infrastructure will likely include air quality sensors. This IoT infrastructure could be used to refine indoor–outdoor ratios and activity patterns, which paired with ambient LCS networks, such as the one used here, could address many of these gaps.
Footnote |
† Electronic supplementary information (ESI) available: More information on the selection of prediction models and variables, spatial distribution at 100 m buffers for residential and commercial areas, effect of LOD on total amount of data, average daily concentrations and uncertainties in measurements and models. See DOI: https://doi.org/10.1039/d3ea00051f |
This journal is © The Royal Society of Chemistry 2023 |