Master's student capstone spotlight: AI-Enabled Information Extraction for Investment Management

Extracting complicated data from long documents

Harvard SEAS students Sudhan Chitgopkar, Noah Dohrmann, Stephanie Monson and Jimmy Mendez with a poster for their master's capstone projects

For their master's capstone project, Sudhan Chitgopkar, Noah Dohrmann, Stephanie Monson and Jimmy Mendez built a machine learning model to extract information from limited partnership agreements (Brittany Buzzutto/SEAS)

Data science and computational science and engineering master’s students at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS) take “AC297R: Computational Science and Engineering Capstone Project.” Taught by Weiwei Pan, Assistant Director for Graduate Studies in Data Science, the course groups students together for semester-long research projects in which they work with “client” organizations to tackle real-world challenges.

AI-Enabled Information Extraction for Investment Management

Noah Dohrmann (S.M. ‘24), Sudhan Chitgopkar (S.M. ‘24), Jimmy Mendez (S.M. ‘24), Stephanie Monson (S.M. ‘24, M.E. ‘25)

Client: Harvard Management Company

What real-world challenge does this project address?

This project focuses on extracting information from Limited Partnership Agreements – legal documents which outline terms for a monetary partnership between two entities. These documents are long, verbose, and difficult to parse. As a result, entire legal teams are sometimes necessary to summarize these documents. Here, we streamline the process by developing a machine learning model able to ingest and extract salient information from LPAs, reducing the workload and complexity that financial firms like the Harvard Management Company face.

How does this research attempt to solve that real-world challenge?

This research helps take large semi-structured data (like long documents that have different sections or subsections) and extract a set of terms from the data. Presently, machine learning models aren't well-suited to understanding and retaining large amounts of context (as you might need when reading 100+ page documents). To solve this problem, our research has developed both machine learning and classical software engineering applications to help pass only the most relevant context to large language models (LLMs). We also develop some of the existing literature on prompt engineering to help LLMs generate accurate answers to tough, industry-specific questions.

How did you apply the skills you learned at SEAS to your project?

The skills we’ve learned at SEAS helped us design and develop a cohesive and production-level application end-to-end. SEAS has taught us how to design good, robust, and versatile software and how to turn those designs into reality at scale to be deployed at companies of all sizes – which is hopefully the future for this project!

What part of the project proved the most challenging?

LLMs are very prone to “hallucination,” or providing a false response if it does not find the query or if the query answer does not exist in the document at all. These false positives are quite undesirable when the information extracted from the document will be used to make key business decisions, such as investment. Through prompt engineering techniques, we were able to greatly reduce the false positive rates.

What part of the project did you enjoy the most?

We really enjoyed working closely alongside the C-suite, data science team, and lawyers at the Harvard Management Company, and gaining insights from different stakeholders in the project. Often, we think software projects like these are limited to developers and their direct managers, but having such a diverse set of people working alongside us gave us fresh new perspectives and helped us consider new ideas.

What did you learn, or skills did you gain, through this project?

We were able to broaden our understanding of different machine learning paradigms while getting rigorous hands-on experience and developing production-level software with some great people!

Topics: Academics, AI / Machine Learning, Applied Computation, Computer Science, Industry

Press Contact

Matt Goisman |