TIP 139: Creating Editable Text from an Image PDF
A scanned or image PDF is only an image of a page, and you can't manipulate its content by extracting images or modifying the text. However, Acrobat can convert the image of the document into actual text or add a text layer to the document using optical character recognition (OCR). Be sure to evaluate the captured document when the OCR process is complete to make sure Acrobat interpreted the content correctly. It is easy to confuse a bitmap that may be the letter I with the number 1, for example.
Ways and Means
You can convert an image to captured text in three ways, and choose from four different image options. The sample document in this tip shows the worst possible outcomes. The text uses a wide range of fonts, some of the text isn't recognized at all, and the background graphic's gradient is strongly banded. In the Recognize Text - Settings dialog, click the PDF Output Style pull-down menu and choose from three options: Searchable Image (Exact) keeps the foreground of the page intact and places the searchable text behind the image. Searchable Image (Compact) compresses the foreground and places the searchable text behind the image; compressing affects the image quality. Formatted Text & Graphics rebuilds the entire page, converting the content into text, fonts, and graphics.
As well as choosing a conversion, choose an Image Downsampling option. Click the Downsample Image pull-down arrow and choose from four optionsanywhere from 600 down to 72 dpi. Downsampling will reduce file size, but can result in unusable images. |
To capture the content of a scanned document:
1. | Choose Document > Recognize Text Using OCR > Start. The Recognize Text dialog opens (Figure 139a). Specify whether you want to capture the current page, or an entire document, or specified pages in a multipage document.
Figure 139a. Choose settings for working with OCR in the Recognize dialog.
| 2. | Click the Edit button to open the Recognize Text - Settings dialog (Figure 139b). Choose a language, PDF Output Style, and Downsample Image setting, and then click OK to return to the Recognize Text dialog. (See the sidebar "Ways and Means" for more information about the choices).
Figure 139b. You can convert the content in different ways.
Do You Have to Convert a Page?
The answer is: it depends. Why are you scanning the page into Acrobat in the first place? Do you need a visual image of a document to put into storage, or to use as part of your customer service information package? For either of these purposes, you probably don't have to convert the content. Here are some reasons you'd need to convert content from an image PDF to text and images: You need to be able to search the text, as within a document collection. You want to make the content available to people using a screen reader or other assistive device. You want to repurpose the content for different output, such as a Web page or a text document. You want to reuse or change the content, such as moving paragraphs or extracting tables.
|
| 3. | Click OK to start the capture process. Be patient. Depending on the size and complexity of the document, the process can take a minute or two. When it is complete, the dialog closes. |
Converting a bitmap of letters and numbers into actual letters and numbers may result in items that can't be definitively identified, known as suspects. First take a quick look at the job ahead. Choose Document > Recognize Text Using OCR > Find All OCR Suspects. All content on the page that needs confirmation is outlined with red boxes (Figure 139c). The sample document was captured using the Formatted Text & Graphics option.
Figure 139c. Show all the capture suspects to evaluate the conversion
Scan and Convert
If you are scanning a document, you can convert it to searchable text as part of the scan. Choose Create PDF > From Scanner to open the dialog. Select Recognize Text Using OCR, and click Settings to open the Recognize Text - Settings dialog shown in Figure 139b. |
Select the TouchUp Text tool on the Advanced Editing toolbar and click a suspect on the document to open the Find Element dialog (you can also select Document > Recognize Text Using OCR > Find First OCR Suspect).
The Usual Suspects
Here are some tips for working with scanned or image documents with a minimum of suspects: Evaluate the content of the document. Determine whether you can simply scan or create an image PDF (such as those you create in Photoshop), or whether you must scan and capture the document, creating editable, searchable text. If you plan to capture the content, scan using specific resolutionsscan black and white at 200600 dpi, with 300 dpi an optimal resolution, and scan at 200400 for grayscale or color. Acrobat requires a minimum of 144 dpi to perform OCR; otherwise you see a warning message and have to rescan or reconvert the image. Not all fonts and colors scan well. In the sample document, the decorative "T" wasn't recognized as a letter, and much of the font information is lost when converting to letters. The word "before" isn't captured at all since it overlays the background graphic. Use OCR fonts if possible, or any clear font at about 12 points. Black text on a white background scans and converts the best while colored or decorative fonts are the most difficult.
|
In Figure 139d, the word "the" is suspect. Acrobat's interpretation of the word is spelled "tlie" because of the shape of the font's letters. Click the text in the Suspect field and type the correct letters. If the suspect isn't a word at all, click Not Text. Click Find Next to go to the next suspect, click Accept and Find to confirm the interpretation, and go to the next suspect, or click Close to end the process.
Figure 139d. Confirm or modify suspect entries in this dialog.
Depending on the characteristics of the document's text, you may have to modify some conversion results, such as the font or character spacing (Figure 139e). Use the TouchUp text tool. When you are pleased with the results, save the document; if you want to start again, choose File > Revert or save the document with an alternate name.
Figure 139e. Depending on the characteristics of the document and the conversion settings you choose, the results can be dreadful.
|