Are Your Documents Leaking Sensitive Information? Scrub Your Metadata!

min read

With high-profile data breaches making the headlines on a near daily basis, most people have become aware of the dangers posed by hackers, yet far fewer are aware of the data leaks they themselves may cause when sharing their own files. Sharing more information than intended is called "oversharing" and often occurs when sending document files that contain metadata.

What Is Metadata?

Basically, metadata is data about data, which includes any data in your files beyond the intended (explicit) content. Metadata exists in some form in nearly every type of document file, but most problems occur in images, PDFs, and Microsoft Office files. Since metadata is generally not included in a file on purpose, it may be hard to examine or remove.

What Kind of Information Can Be Found in Metadata?

  • Tracked changes: Inserted or deleted text you thought was gone
  • Speaker notes
  • Hidden cells
  • Comments
  • Your name and/or initials
  • Your e-mail address
  • Your company or organization's name
  • The name of your computer
  • The name of the network server or hard disk on which you saved the document
  • Other file properties and summary information
  • The names of previous document authors
  • Document revisions
  • Document versions
  • Template information
  • Hidden text
  • Macros
  • Hyperlinks
  • Routing information
  • Nonvisible portions of embedded Object Linking and Embedding (OLE) objects
  • GPS coordinates
  • Image features thought to be blurred, removed, or hidden under an upper layer

Why Should I Care about Metadata?

Oversharing through metadata can have consequences ranging from merely being embarrassing to having a severe financial or regulatory impact on a person or organization. Consider the following cases of oversharing metadata.

  • Authorship information: Documents with author name metadata that differs from its source can prove embarrassing or reveal plagiarism. For example, a 2005 speech by President Bush to the U.S. Naval Academy was discovered to have been largely written by a political scientist at Duke University.
  • Improperly redacted data: Spreadsheets with columns, rows, or cells deleted to remove identifying information may be recovered by viewing revisions stored with the "Track changes" feature. If the recovered data includes payment information such as credit card numbers, or associates specific individuals with health information data, this could constitute a PCI or HIPAA violation and could trigger state data breach notification requirements.
  • Notes and comments: Presentation slides often contain speaker notes that are not intended for the audience. Similarly, spreadsheets and text documents may contain comments and tracked changes to content that reveal the private thought processes of the authors. Both metadata types can be particularly destructive to business relationships and negotiations. In the context of legal proceedings, this oversharing could affect determinations of criminal conduct, which is why attorneys must be particularly wary of metadata when exchanging documents.
  • Hacker reconnaissance: Publicly shared documents that contain their full file path or network location are a valuable aid to hackers looking for locations where sensitive information might be stored.
  • Images: Images are often altered to conceal the identity of the vulnerable, such as minors, or to redact confidential information. If these files are not saved correctly, these changes can be undone. Furthermore, many cameras and smartphones embed the GPS coordinates of where a photograph was taken into the metadata of the image (EXIF data). Sharing these images could compromise the privacy and security of their subjects.

How Do I Prevent Metadata Oversharing?

  • The most important thing to prevent metadata oversharing in an organization is to raise awareness of the problem through proper training and monitoring. Most organizations already have some sort of information security training about best practices as part of their compliance plan. Adding content regarding document metadata is an efficient way to build this awareness as part of an existing process.
  • Once awareness is built, metadata can generally be scrubbed from documents fairly easily. This metadata scrubbing video shows how to use the Microsoft Office "inspect document" feature to remove metadata before sharing.
  • Another way to remove metadata from Microsoft Office documents is to export them as a PDF file. Beware that PDF files have their own metadata such as author and creation date, so PDF files may require further scrubbing.
  • A number of tools such as Metadata Assistant or BigHand Scrub 8 can be used to automate the metadata process for a batch of files. (Note that while some of these companies offer free trial versions, many of these tools are available for individual or enterprise-wide purchase.)
  • Similarly, EXIF data can be scrubbed from JPEG and PNG image files using a variety of free tools.
  • When creating images where information is altered or concealed, be sure to use the "flatten layers" command in a photo-editing tool before sharing the file.

What Should I Do If I Receive Sensitive Information in Document Metadata?

If you happen to find sensitive information that may have been overshared in a document, the best practice is to immediately delete the file, including backups, and inform the sender about the steps you have taken to remediate the problem. While you may not have an obligation to do anything, deleting sensitive information such as payment information, customer records, or health information protects your organization from any liability should that data be breached. The possibility of a data breach could be greater if the document was received by e-mail on a less secure part of the organization's network that is not configured to handle sensitive information, which is why immediate deletion is the best policy and should be included in any basic information security awareness and training program.


Michael Spiegel is a technology attorney and information security professional in Pittsburgh specializing in compliance, privacy, and intellectual proper issues.

Disclaimer: This post does not constitute legal advice and does not establish an attorney-client relationship.

© 2017 Michael Spiegel. This EDUCAUSE Review blog is licensed under Creative Commons BY-NC-SA 4.0.