What kind of data could your PDF files be leaking online?

pdf files leaked online

In the 2003 timeframe leading up to the Iraq war, a British government representative published a Word document on their website, containing information related to “Iraq’s security and intelligence organizations.” This information had been previously referenced by Colin Powel during a United Nations address. When the British dossier was made public, it was revealed the information had been plagiarized from a researcher in the US.

That was the least of their concerns. According to cybersecurity researcher Richard M. Smith, an analysis of the revision log identified the four officials involved in creating the dossier. Smith explains that “word document files which are converted to RTF files, HTML files, or PDF files will not contain revision logs and other metadata.”

How PDFs could be a privacy and security risk

But here is some food for thought: did you know that a malicious actor can tell how long it’s been since you last updated your software by simply gaining access to a PDF file you shared? If they are not sanitized, PDF files are a great source of information leakage for those interested in targeting users or companies with poor security habits.

A paper published by researchers at the University Grenoble Alpes, Inria- France, analyzed nearly 40,00 PDF files released by 75 security agencies from 47 countries to determine the type of information possibly leaked.

As little as 24% of the files had been sanitized before being published, the researchers point out. What’s worse is that only 3 out of 75 security agencies worldwide had implemented a satisfactory sanitization process. Despite the methods applied by other agencies, sensitive information could still be exfiltrated from the PDFs collected.

The team found that some PDF files are exposing system architecture details, user PATH variable information, hardware and network information, email, geo-location, OS used, vulnerable or outdated software, and name, among others, making it easy to find weak entry points in an organization. It is not only about the visible content that leaks critical personal information about an organization, but the concern is more around hidden content. Hidden information could be stored not only in text, but in pictures and videos, the paper explains.

Confidential data could still be accessed, provided the malicious actor has access to extraction tools. According to the National Security Agency (NSA), an electronic document may contain hidden and embedded data in:

  1. Metadata
  2. Embedded Content and Attached Files
  3. Scripts
  4. Hidden Layers
  5. Embedded Search Index
  6. Stored Interactive Form Data
  7. Reviewing and Commenting
  8. Hidden Page, Image, and Update Data
  9. Obscured Text and Images
  10. PDF Comments (Non-Displayed)
  11. Unreferenced Data

Organizations are not effectively sanitizing their documents before distributing them, which could be exposing them to cyberattacks. Considering the world has gone completely digital, this is an issue that needs to be immediately addressed. Once malicious actors gain access to this type of information, they are likely one step away from exploiting your digital footprint to launch an attack against your organization.

If efficient sanitization methods are not implemented and documents are published on an organization’s website, for instance, malicious actors can use relatively cheap methods to recover information. The impact could be disastrous and possibly even turn into a national security threat, if a government agency is targeted. Always remember that the security choices you make today will affect your business, team, customers and partners tomorrow.

Share This Article

Research Team

Flare’s research team conducts investigations and experiments in order to gather data, create new knowledge, and develop new ideas. This helps our team stay ahead of emerging threats and also add insight to our product roadmap.

Related Content