MIT masters student Doron Hazan, left, and SEAS students Aakash Mishra, Viet Vu and Frank D’Agostino won first place at the Citadel Boston Regional Datathon. (Aakash Mishra)
Last year, Aakash Mishra and Frank D’Agostino learned an important distinction in data science while competing in the Citadel Boston Regional Datathon. Their team built a model to accurately predict Airbnb real estate prices in the southern United States, but failed to place. Another Harvard team won the competition by linking public trust in the government with increased mortality rates during the COVID-19 pandemic.
That loss taught the two incoming fourth-year students at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) that no matter how good a model is, data scientists should seek to use that model to address a concrete challenge.
“You have to understand what is an important issue that needs an answer, and then you need to use your technical know-how to answer that question,” said D’Agostino, who’s pursuing an A.B. in applied mathematics. “It’s the combination of not only the topic and how you come up with a problem, but also how you approach it in a way that’s rigorous enough to be accepted.”
Mishra and D’Agostino took that lesson into their second attempt at the Boston Regional Datathon earlier this year, with much better results. Along with Viet Vu (A.B. ‘23, statistics) and MIT master’s student Doron Hazan, the team proved a causal relationship between the FDA-approved drug glipizide and an increased rate of heart failure in diabetic patients. Their efforts netted them the $15,000 first prize and a trip to the World Championship Datathon later this year in New York.
“What we did this time was very actionable,” Mishra said. “Knowing the effect of glipizide or another type of drug on diabetic patients, and being able to provide exact numbers linking it to heart failure, could help inform doctors.“
The 2022 datathon was a one-day event in which teams were given data sets in the morning, then had six hours to complete their analyses and submit their methodology and reports to the judges. For the SEAS team, the data consisted of 70,000 anonymized patient records and medical histories.
“When we originally got the data, it was really messy,” said Mishra, who’s pursuing an A.B. in computer science. “There were a lot of missing rows, parts of the data that didn’t make sense, and parts of the data set that weren’t filled out correctly. We had to figure out what we were going to keep, what we were going to throw away, and what we were going to infer.”
Cleaning up the data and deciding what challenge to address took about three hours. The second half of the day consisted of running the models while team members worked on the background portions of the report, then analyzing the results in the final hour.
“We all have different backgrounds,” Mishra said. “Frank is more into data science, Viet is into computational biology, and I’m computer science. So, a lot of the things I did early on were making sure we had proper features to train our models on and developing the code for the models themselves. Frank tried to figure out the best way to approach creating these models and what we could infer from them, and Viet came up with the mathematical background for them and what results we could take away.”
The Datathon forced the team to draw on numerous lessons from their coursework at SEAS. They derived their results using concepts such as hierarchical linear models and synthetics controls, both of which they learned in Harvard courses, as well as an overall approach to data analysis.
“In class, we’re taught that you have to explore the data first, and then after that you choose some features,” Mishra said. “Then you want to figure out what the most interesting trends in the data are, and after that try to develop a model. Having that in mind helps with doing this quickly.”
That approach paid off for Mishra, D’Agostino and their teammates, and they'll need to stick to that formula if they want to capture the $100,000 prize at the world championships in New York City.
Just like in the Boston Regional Datathon, the key will be coming up with the right question to answer using the data.
“In a lot of coursework, they kind of just give us the questions in a problem set and we solve them,” D’Agostino said. “In the real world, you don’t even know the question half the time. Once you have the question, then it’s easy to answer it.”
Press Contact
Matt Goisman | mgoisman@g.harvard.edu
