Imperiled information

Students find website data leaks pose greater risks than most people realize

By Adam Zewe | Press contact

January 17, 2020

Facebook Twitter Email LinkedIn

Computer science concentrator Kian Attari, A.B. ’22 (left) and statistics concentrator Dasha Metropolitansky, A.B. '22, showed that data leaks post much greater threats than most people realize. (Photo courtesy of Kian Attari and Dasha Metropolitansky)

It seems that every few weeks, news breaks of another company attacked by hackers, with personal data provided by thousands or millions of individual users stolen.

But how dangerous are these data leaks? Does the average person face any real risk if a hacker manages to steal one of their account passwords?

It turns out data leaks pose much greater threats than most people realize, and a hacker could easily find and exploit sensitive information on not only a person’s virtual identity, but also his or her real identity.

That’s the conclusion reached by two students at the Harvard John A. Paulson School of Engineering and Applied Sciences, who explored data leaks for their final project in Privacy and Technology (CS 105), taught by Jim Waldo, Gordon McKay Professor of the Practice of Computer Science.

“The immediate response to a company being breached is fear and outrage, but quickly the public response dissipates and people move on with their lives,” said Dasha Metropolitansky, A.B. ’22, a statistics concentrator. “What hacker has time to go through hundreds of thousands of login credentials and break into each one of them? Most of us just think we’re average individuals—why would a hacker want to target me or you if we’re not especially powerful or prominent?”

Metropolitansky and Kian Attari, A.B. ’22, a computer science concentrator, first wondered how easy it would be for a nefarious individual to find a dataset of leaked personal information. They began by searching the “dark web,” a peer-to-peer network that isn’t indexed by search engines like Google and must be accessed through software called Tor.

The students quickly found a number of forums where hackers share data leaks, making the information public for anyone to access.

“The hackers and malicious people who would exploit this kind of data can find it pretty easily,” Attari said.

The students found a dataset from a breach of credit reporting company Experian, which didn’t get much news coverage when it occurred in 2015. It contained personal information on six million individuals. The dataset was divided by state, so Metropolitansky and Attari decided to focus on Washington D.C. The data included 69 variables—everything from a person’s home address and phone number to their credit score, history of political donations, and even how many children they have.

But this was data from just one leak in isolation. Metropolitansky and Attari wondered if they could identify an individual across all other leaks that have occurred, combining stolen personal information from perhaps hundreds of sources.

There are sites on the dark web that archive data leaks, allowing an individual to enter an email and view all leaks in which the email appears. Attari built a tool that performs this look-up at scale.

“The program takes in a list of personally identifiable information, such as a list of emails or usernames, and searches across the leaks for all the credential data it can find for each person,” he said.

The Experian Washington dataset found by Metropolitansky and Attari contained more than 40,000 unique email addresses. Attari extracted these unique emails and entered them into the tool, which searched for all data leaks in which the emails appear as well as leaked credentials, such as passwords and usernames.

The tool output a dataset of the leaks and credentials associated with the Experian email addresses. Metropolitansky then joined this data with the complete 69-variable Experian dataset, linking users’ cyber identities with their real-world identities.

“What we were able to do is alarming because we can now find vulnerabilities in people’s online presence very quickly,” Metropolitansky said. “For instance, if I can aggregate all the leaked credentials associated with you in one place, then I can see the passwords and usernames that you use over and over again.”

Of the 96,000 passwords contained in the dataset the students used, only 26,000 were unique.

“We also showed that a cyber criminal doesn’t have to have a specific victim in mind. They can now search for victims who meet a certain set of criteria,” Metropolitansky said.

For example, in less than 10 seconds she produced a dataset with more than 1,000 people who have high net worth, are married, have children, and also have a username or password on a cheating website. Another query pulled up a list of senior-level politicians, revealing the credit scores, phone numbers, and addresses of three U.S. senators, three U.S. representatives, the mayor of Washington, D.C., and a Cabinet member.

“Hopefully, this serves as a wake-up call that leaks are much more dangerous than we think they are,” Metropolitansky said. “We’re two college students. If someone really wanted to do some damage, I’m sure they could use these same techniques to do something horrible.”

For Attari, the biggest surprise of this project was the fact that everything they used is publicly available. They didn’t even have to dive too deeply to find it.

“Once something is leaked, it becomes public information for anyone to access and use,” he said. “People feel like they still have a right to their information after it is leaked, but once it is leaked, it’s gone. That is something everyone has to realize.”

Their best advice is simple—don’t reuse passwords or usernames, since that was a very easy targeting mechanism that could identify personal information across websites.

“I think many of us are aware that some of our data has been leaked. What Dasha and Kian showed was how linking different leaked data sources can reveal much more than any of us are aware of, or comfortable with, being out in the network,” said Waldo. “It’s the accumulation of this data that can be dangerous for all of us.”

Another important issue raised by this project is that most major companies have already been breached, and not all data leaks make headlines, said Metropolitansky.

“I hope this project leads to a mindset change. Most people just wait until the media tells them that a company has been breached to act. We should all operate under the assumption that a company has already been breached,” she said. “There are a ton of breaches that happen all the time that none of us know about, and many of these companies are not held accountable. Hopefully, if more people are aware of the danger of leaks, there could be more pressure to enforce stringent protections against companies that don’t do enough to keep user data secure.”

Topics: Computer Science