A team of four Harvard students recently took second place at the inaugural Citadel Women’s Datathon in Miami.
Second-year students Christina Li, Hannah Zhou, Doris Yang and Vivian Yee had just six hours to analyze data sets about the development of green spaces such as parks and tree cover in cities and correlate them to data related to mental and physical health. By bringing in outside data related to historical financial discrimination in certain areas of cities, a practice known as “redlining,” the team was able to make specific policy recommendations regarding how and where to develop future green spaces.
“We realized that in the most heavily redlined areas, a higher green space index actually correlated with negative mental and physical health, whereas in areas that weren’t as heavily redlined, there were indicators that green spaces positively affected mental health,” said Yang, a statistics concentrator.
“Because green spaces obviously require a lot of land, developing them comes at the cost of perhaps more urgent needs, such as developing more affordable housing units or building a community health clinic.”
Zhou and Li, both computer science concentrators at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS), discovered the datathon through their involvement with the Harvard Women in Computer Science (WiCS) student organization. When the four realized they’d all signed up as individuals, they decided to form a team.
“The datathon was very much a high-energy and focused environment,” said Yee, who’s concentrating in applied math and economics with a secondary in computer science. “Everyone was so fixated on their projects and delivering high-quality products. Although I was initially a little bit stressed, the collaborative nature of working with my team helped to calm my nerves.”
Redlining is a concept Zhou learned about in one of her current courses, “ECON50: Using Big Data to Solve Economic and Social Problems.” The practice dates back to the Great Depression, when banks refused to invest in certain areas of cities, which they marked off with red lines on maps.
Once the team saw that the provided data sets were organized by census demographics, that inspired them to bring in redlining data, giving them a clear focus for their project.
“Those places that were redlined were frequently populated by underrepresented minorities, so that just greatly exacerbated the racial inequities still at play today,” Li said.
Zhou added, “A really important part of research is defining the problem you want to solve, and that was something really great that the datathon forced me to think about.”
With their problem statement now defined, the team could spend the bulk of their six hours focused on cleaning and wrangling data and writing the report. Zhou focused more on the green space data, Yee on the redlining data, and Yang on the health data. As they tackled their individual sections, Li documented their process so the actual report would be easier to write at the end.
“Because we were writing as we went, we were able to ultimately cover more ground,” Li said. “As they were cleaning the data, I was writing down exactly what we did, which gave us more time to create more data plots. Even if we all tried to code at once, there were only so many things we could code at a certain point. It was good that we parallel processed, and that ultimately contributed to our success. When the judges gave us feedback at the award ceremony, they really liked that we had an actionable plan.”
Learning how to define a problem, divide labor and work under pressure are invaluable skills for engineers. But the datathon also taught the quartet that sometimes simpler data visualizations and models are the best choice, as their report used linear regressions.
“In this age where artificial intelligence is so sexy and everyone is talking about neural networks and more complicated data science tools, at the end of the day our linear regressions worked pretty well,” Li said. “In class, we talk about how we balance interpretability with complex models, and the linear regressions that we did at the datathon were very interpretable.”
Yang added, “My professors have really emphasized linear regressions. Those are the most common models used in the real world, and you can derive a lot of insights from them. All the more fancy-sounding models are built off the principles used in linear regressions, and it was cool to see through this experience that we did derive super meaningful conclusions from just using linear regressions.”
A second-place finish qualified the team to compete in the world championships later this year in New York City. Between now and then, the quartet will get to take the skills they developed at the datathon back into their classes.
“I was happy that we were to consolidate all these data sets into a few meaningful graphs that were quite interpretable,” said Zhou. “We haven’t had a chance to work on individual projects in class yet, but I’m excited to see how I can apply the skills that I learned from this on a larger-scale project.”
The datathon's first-place team included Mila Ivanovska, a second-year student studying computer science and physics at Harvard. That team's data analysis found that designating large areas with natural grass cover alongside walkable streets had a positive effect on mental health in living areas.