Engineering Design Projects (ES 100), the capstone course at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS), challenges seniors to engineer a creative solution to a real-world problem.
Real-time, high-resolution downscaling of fine particulate matter (PM2.5) in the U.S. using machine learning
Maggie Shultz, S.B. '23, Environmental Science and Engineering
Please give a brief summary of your project.
Air pollution accounts for 100,000 deaths each year in the United States. Accurate measurements are paramount to better understanding the causes and effects of particulate pollution and thus designing emissions reduction strategies. This project uses microscale features of the local environment to create an accurate model of fine particulate matter pollution, PM2.5, in the U.S. in real time. I used a machine learning algorithm known as a random forest model, which was then trained on two years of historical data, including temperature, humidity, wind, population, elevation, longitude and latitude, and fires. The trained model was then applied to real-time sources of this data to predict PM2.5 concentrations and the associated air quality index for a given latitude, longitude, and time.
How did you come up with this idea for your final project?
While studying abroad in Paris in 2019, I was running along a heavily trafficked highway and could see lots of smog in the air. I wondered if there was a way to plan running routes for the times and locations that would minimize exposure to air pollution.
What real-world challenge does this project address?
This project aims to build a public model for predicting exposure to PM2.5 in real-time, which is not currently available within academia.
What was the timeline of your project?
I started working on my project in January of 2022 and conducted a thorough literature review to understand the current state of the field. Then over that spring, I acquired the data I needed for the project and built an initial prototype of the model for a subset of the data. I expanded to the full dataset to finalize the model at the beginning of this fall. I spent the subsequent few weeks building a script to acquire the data I needed the model in real time to output a prediction with a visual graphic illustrating the user's current PM2.5 concentration prediction.
What part of the project proved the most challenging?
The size of the data set and model was too large for my laptop's memory, so I had to do my work remotely on a supercomputing cluster which I had never worked with before. There was a steep learning curve.
What part of the project did you enjoy the most?
I enjoyed working with the data and learning something new every day. I liked the challenge of running into roadblocks and having to troubleshoot ways to work around a problem.
What did you learn through this project?
I learned basically everything I know about coding and machine learning through this project. I went into the project being vaguely familiar with Python, and now I can do thorough data analysis in Python and R, which is a huge accomplishment for me.
Press Contact
Matt Goisman | mgoisman@g.harvard.edu