Skip to content

armahdavi/unsupervised-clustering-ml---pm_source_detection-indoor-air

Repository files navigation

Unsupervised Clustering Machine Learning (ML) to Estimate Indoor PM Emission Duration

How long do you cook or smoke indoors?

Combustion activities such as cooking or smoking are significant contributors to indoor particulate matter (PM) emissions. Given we spend a substantial portion of our lives indoors (> 85%), primarily at home, understanding and identifying combustion-driven PM sources is crucial. Determining the duration and intensity of these activities enables us to assess the urgency of developing effective exposure mitigation strategies—such as reducing smoke or cook time and using a range hood—to address airborne particulate matter (PM) exposure. PM poses significant long-term health risks, including lung cancer, heart disease, asthma exacerbation, and premature death.

This Repository

In this repository, I developed an unsupervised Machine Learning (ML) algorithm using k-means clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to detect and characterize combustion-driven source regimes for PM2.5. The algorithm calculates the duration and emission rate of PM2.5 within these regimes. This approach is feasible due to the unique temporal patterns exhibited by PM2.5 concentrations during source emission regimes compared to other courses. The absence of labeled data on PM2.5 sources necessitated an unsupervised learning framework. This research is a side part of the Mahdavi et al. (2021) paper published in "Environmental Pollution" where PM concentrations were measured using optical sensors in a single-family home for a duration of close to 6 weeks in Summer 2018. The full-length article can be found in the "About" section.

Brief Abstract to the Work

After measuring PM2.5 and other particulate metrics (such as total suspended particles (TSP), PM10, and PM1) using a TSI DustTrak particle monitor over a six-week period, the minutely time-series concentrations were analyzed. Key features were extracted, including concentration, rate of change in concentration, outdoor concentration, nearby peak concentrations, proximity to baseline levels, wind speed and direction (impacting air exchange rate), and HVAC runtime status. These features were used to identify source, decay, and baseline clusters, with the source clusters determined using k-means and/or DBSCAN unsupervised machine learning (ML) algorithms.

Results indicated that k-means and DBSCAN are effective for identifying extended source emission durations for PM2.5. Algorithms using fewer features (e.g., concentration and its rate of change) significantly outperformed those using multiple features, even after dimensionality reduction techniques such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE). Estimated durations of extensive source emissions were 1.2% and 2.9% by k-means and DBSCAN, respectively, showing that high-emission events, such as cooking, are typically brief—a finding consistent with survey data. However, DBSCAN slightly overestimated durations, likely capturing normal concentration increases from other sources, such as exterior infiltration.

The characterization of source regimes by unsupervised ML algorithms can subsequently inform mass balance equations, supporting IAQ experts with more accurate estimates of indoor particle concentrations and emission rates.