Redacting PDFs – A Simple Task With Lots Of Consequences

1027 0

In 2019, lawyers representing the former political advisor Paul Manafort filed a response to special counsel Robert Mueller’s claims that Manafort violated his cooperation agreement by repeatedly lying to prosecutors. Specific sections of this response were redacted before it was released to the public due to the sensitive nature of some of the content. Or at least that’s what they thought.

This trial put PDF redaction in a whole new light, showcasing the importance of doing it right.  Although parts of the public version of this filing appeared to be redacted by black bars at first glance, it quickly became apparent that anyone with Adobe Acrobat, or other PDF viewing tool, or even browser-based viewing tools, could easily copy and paste the text that still existed under the redaction blocks to another document to simply reveal the passages that had been redacted.

However, this was not the first time this happened. A similar incident happened back in 2011 with the Ministry of Defence where a technical error meant blacked-out parts of an online MoD report could be read by pasting into another document.

What probably went wrong?

PDF provides a number of different types of documents that could have played a role in how the redaction was incorrectly carried out. Typically, a document that is scanned in is referred to as a PDF–Image. A scanned document, like a fax, is made up of black and white or colour dots and does not contain any additional text for searching or copying. Redacting this type of document simply involves converting the dots that represent the image of the text to black.

Given lawyers’ workloads and the number of scanned documents required for signatures, we can imagine that whoever performed the redaction believed the document was a scanned document. By drawing black boxes over the text, it would successfully redact the ability to read what was beneath, and remove any ability to view the black and white dots below that made up the actual words.

In fact, there are two other types of PDF documents that contain both the image of text and text data placed “underneath” the text. In these PDFs, the text data is what allows searching within the document and copy-pasting of the document’s text. Such documents can be created in two ways.

Either the image document is run through an Optical Character Recognition (OCR) Module with the text embedded behind the image to enable search and other text capabilities like copy and paste.

Alternatively, the document could be created from a word processing or font capable program directly into a PDF, including text and fonts.

Making an educated guess, the document was likely created in this manner and never scanned. We can draw this conclusion because the Manafort document is very clean, which is evident in the fact that there are no stray dots typically associated with scanned documents, with a very small file size unlike the file size of images with text. In either of these cases, simply drawing a box over the words will not remove the underlying text.

What can safeguard against human error when it comes to redaction? 

While this was a very public redactions mishap which had extreme consequences, this could be something that happened to any of us. The simple act of drawing blocks over text does not suffice where redaction is required, so proper procedures should be in place to help safeguard against erroneous action. There are software tools that support various different redaction needs.

There are tools that provide ‘suggested redactions.’ Options to redact text that is based on patterns and which can be pre-configured. In this manner phone numbers, social security numbers and other consistent patterns can be identified, and the user is given a choice of whether to redact or not redact. Additionally, to cross-check individually, a ‘search and redact’ feature can provide the ability to enter a search term to select a specific word, such as PII data like names or addresses, for redaction.

In terms of targeting content for selection, ‘selective text redaction’ means any text can be selected and redacted. Users can also have the ability to draw a box around text or graphics to redact both the image and the text underneath with ‘selected box redaction.’ Sensitive information will be highlighted and users can confirm redaction when finalising the document – this efficiently exfoliates a document and permanently removes the information. It’s important to remember if you need to remove document metadata as well, which could list your name as the author for instance. You may also want to sanitise content that can alter the document’s appearance. JavaScript, actions, and form fields are types of content that can affect this.

To redact at speed, bulk redaction on case or folder capabilities means users can redact common attributes or values across documents without having to individually sift through text. Bulk redaction with search capabilities or on migration can also help redact common attributes or values across documents. When you’ve isolated all the texts that needs to be redacted, it is key to ensure best practice with file naming and content management after redaction. You can append ‘_redacted’ to the file name to distinguish between versions and delete any content that is past its sell-by date.

While many of us may not find ourselves in the middle of a trial surrounded by global media attention, dealing with sensitive data of many kinds does fall into lots of job descriptions. Understanding how to protect that information through redaction is key to ensuring privacy and data protection, but asking individuals to manually get this right every time, on possibly thousands of documents is unfair. This is where the redaction software can help.

Dave Giordano
Dave Giordano, Chief Strategy Officer at Alfresco

Dave Giordano Web Site

In this article