Dealing with PDF files during a translation project

by Nancy Matis

Introduction

Dealing with the translation of PDF files could turn into a nightmare during some translation projects. Getting them without their source files often causes trouble that would probably have not arisen if clients had sent the text in its original format. So why do some customers send PDF files for translation? There’s certainly not just one answer to this question. What can we do when there is no chance of receiving the source files? Should we charge more when working on PDF files? This article explores some aspects of handling PDFs during a translation project.

Why can receiving only PDF files for a translation project be an issue?

Nowadays, most translators overwrite the source text they receive in an electronic format for translation. They open the original file in the same program as the one used by the author (or by the publishing team) or they process it using a translation memory (TM) tool. Any layout work required on the target text is then done by the translators or by DTP (desktop publishing) experts using the source program.

PDF files are not normally supposed to be overwritten. This format type is actually generated from the original files and used to easily exchange data without any specific program requirement to read the content. Receiving them as a reference during a translation project can, therefore, be highly practical. They can be given to linguists who do not have the original program and/or who work within TM tools that don’t let them visualize the actual layout. They can also be a useful reference during DTP to ensure the same final output is obtained. Or they can even be sent for QA or client validation, as it is easy to insert comments on the translated text or layout.

Although these files are helpful for reference, providing them for translation could prove quite inconvenient. Overwriting text in Acrobat Pro is feasible, but not very practical because it’s problematic when the target text is longer (or much shorter) than the source. Despite more and more translation memory tools supporting the PDF format, the result is not always optimal. Sometimes the text to be translated is not correctly extracted. More issues may appear when the original layout is complex or when the translation has to be delivered in a particular format, like Adobe InDesign or Microsoft PowerPoint.

As we will see later in this article, there are techniques to extract the text from a PDF file and rework the layout. However, when the files contain images and have to be printed in high resolution, these processes might not give the expected results.

That’s why it’s generally advisable to ask the clients for the source files and only to use the PDF files for reference.

shutterstock_576831907Converted-5a42bd19494ec900366a90fb

Why do some clients send PDF files for translation?

Unfortunately, not all clients send the files to be translated in their original format.

Some of them think it’s easy to overwrite the content in the PDF itself. Others assume that translators create brand news files (for instance in Microsoft Word) in which they directly type the target content. They quite often fail to realize that it’s far easier to receive their editable document and overwrite it with the translation. And, in most cases, this simplifies the final layout process. I remember when a client sent us a PDF file generated from a text he had just written himself. We immediately asked for the source file, but his reply was quite puzzling. After generating the PDF file, he had deleted his editable file, not realizing anyone else would ever need it. Fortunately, a simple explanation will often persuade clients to send us the appropriate files for our translation projects.

Some translation requestors send PDF files when they judge the source files too complex for translators, such as Adobe InDesign documents. It’s true that not all linguists own this layout program. But providing them with a format compatible with translation memory tools (in this case IDML files) means that the source text can be overwritten and the original formatting easily recovered for target layout adaptation.

Unfortunately, from time to time, we might come across clients who are not fully responsive. They feel we should find the solution on our own or be talented enough to handle any file format.

In some client companies, the people requesting translation (for their own or others’ purposes) haven’t worked on the source files at all. They were provided with PDF files or retrieved them from a common repository and they don’t have any idea who created them. In these cases, it can, of course, prove difficult for them to provide us with editable files for translation. And when the source files were created by an external team, such as a PR agency, it can be virtually impossible. They may only deliver PDF files to their clients and not share the original material they created, either to also charge for the layout of target languages in case of need or simply due to a lack of understanding of translation teams’ requirements.

Finally, there are exceptional situations, for instance when the source text is only available on paper and then scanned. With the best will in the world, the client won’t be able to send us the actual sources in their original editable format.

How can we handle PDF files?

The methods used to handle PDF files depend on the programs we work with, the complexity of the PDF files, our own skills or even the client’s expectations.

  1. Retype

Creating a new document and writing the target text is not really complex. Nonetheless, translators used to overwriting may find it takes them longer to type from scratch. This isn’t really an issue if you use dictation software, as dictating from a printout or a second screen can be relatively fast. For repetitive texts or similar projects, however, you will also lose the advantage of retrieving existing translations from a translation memory. And when the client asks for the layout to be retained, attempting to format the target file to match the original PDF can increase project time significantly.

In any case, this method will sometimes be recommended for scanned text or non-extractable portions of text.

  1. Copy and paste

As long as the source content wasn’t scanned, you can select the text of a PDF file, copy it and paste it into a new document. For plain text, some adjustments may be essential, such as removing carriage returns at the end of each line. You might also need to redo tables. Obviously, the more complex the original layout, the more work you will have to do, not only to obtain an editable text but also to reproduce a format similar to the original one.

  1. Save as

You can also save the accessible PDF content in editable formats. Most solutions must be paid for, but will result in quite a good output, requiring only a few adjustments. Nevertheless, some source formats might not be properly supported and complex layouts, frames, tables, org charts, etc., will more often than not complicate the task and require preparation of the source content and/or hard work on the final target layout.

  1. Use a TM tool

For some time, more and more translation memory tool editors have integrated PDF support. They might even include some features handling scanned text within PDF files to translate. The result is often very good for simple layout, and even heavily formatted content, with tables, graphics, etc., might be correctly processed. Checking whether all the segments to be translated are actually made available to the translator and whether any adjustments are needed (correcting double or missing spaces, for example) is strongly recommended.

As far as layout is concerned, the TM tool output could suit the client, possibly with some adaptation. However, if the client expressly requests a specific format, other than the proposed output, major formatting work may be necessary, potentially leading to full page layout recreation using the same program(s) as the source file creator.

  1. Extract the content into an editable format

Several PDF extraction tools are available, even sometimes for free. Do make sure, however, that you are not contravening any NDA and/or contract you have signed with the client by uploading files to online sites.

These tools may allow you to select the required format and most of them will correctly extract the text while keeping most of the original layout intact. Once again, the result will mostly depend on the complexity of the source material. A preparation or pre-layout step is, therefore, recommended, particularly when projects involve multiple target languages. Any work you do in advance won’t be needed afterwards for each target language, which will speed up DTP time. Translators should be aware of any potential issues occurring during extraction. You may decide to fix the source text before starting to translate or problematic segments as you go along. When you receive these extractions from a translation agency or from a client, you should check that they optimized the extracted source text first and if not, inform them that it might need fixing.

Some extraction tools might suit all your needs or, on the contrary, you might need several to extract various content types into different formats. For example, tables appearing in some PDF files might be extremely well extracted with one tool, whereas another will be a must for extracting org charts. Some are also limited to one extraction format, like Microsoft Word, while others will give you the opportunity to properly extract an MS Excel spreadsheet, an MS PowerPoint presentation or even an Adobe InDesign file.

  1. Use optical recognition

Instead of being generated from a specific application, some PDF files result from a scanning process. In this case, you can turn to optical character recognition (OCR) software. The output will vary greatly depending on the quality and resolution of the scanned document and correct language detection (if possible, define the source language of the PDF to be processed). It goes without saying that it’s also preferable to carefully check it before launching into the translation. Spotting mistakes linked not only to the format or to some missing text but also to badly recognized letters or figures will often be crucial and prevent serious quality problems in the target text (for instance an “i” extracted as an “l” or “3 cm³” extracted as “3 cm²”).

Top-10-OCR-Software-for-Data-Entry-Projects-Invensis

What are the costs linked to PDF translation?

Basically, the key is to assess the steps needed to produce the expected result from a PDF file and to make sure you are compensated for the extra work. Sometimes it’s quite hard to make the right guess, but often a few minutes is all you need to make an estimate, either based on your experience or on some tests, opening the file in a TM tool or checking the rough output from a text extractor.

If the client only expects translated content, without any layout, the extra effort might be minimal. It will then be a question of deciding whether the work should be paid like any other job or whether you should charge slightly more, either by increasing the rate (per word, line, character, etc.) or adding billable minutes or hours, or even apply a flat rate (e.g., 25 euros more when processing PDF files).

If the request is to deliver the target text following the original file layout, I would advise you carefully analyze the scope of the task and rate it, especially for a complex layout. For instance, you might add preparation hours to the quote as well as the usual DTP work negotiated per page and/or illustration. Or you could increase the rate per page for DTP when PDF files have to be processed, for instance invoicing 15 euros per page instead of 10.

In any case, the first action I would recommend is to always ask clients for the editable source files with illustrations containing editable text layers, any proprietary fonts, templates, etc. Explain to them that the goal is not only to ease the linguists’ work but also to reduce costs and guarantee a proper file resolution. Looking at the PDF file properties (via the File menu in Adobe Reader for instance) may also give you a good indication of what the source files were.

Conclusion

We might encounter clients who won’t be able to send us any other format than PDF files for translation and who might even forget to unlock them at times. Knowing how to handle these files in the various cases is extremely helpful. Whether you write a brand new text, make a basic extraction without any layout, prepare the file for the client so they can easily format it or recreate the full layout yourself, I suggest you measure the approximate time and effort it takes you to complete the tasks and include the related costs in your project price. Clients should understand that extra work means extra charges. But they might need some clear explanations on the challenges posed by the files they send and they should definitely be warned in advance of any potential price increase.

Nancy Matis has been involved in the translation business for more than 20 years, working as a translator, reviser, technical specialist, project manager, and teacher, among other roles. After obtaining degrees in translation and social and economic sciences, she worked for an international translation firm for several years. She currently manages her own company based in Belgium, specializing in localization, translation project management, consulting, and training. She also teaches translation project management at Université Lille 3 (France), KU Leuven (Belgium), Université Libre de Bruxelles (Belgium), and through webinars. Besides publishing articles on project management and the importance of teaching this subject to future translators, she has also written about terminology management in projects and quality assurance in translation. She is the author of How to Manage Your Translation Projects. Contact: nancy@nmatis.be. Website: www.translation-project-management.com.

Με χαρά διαβάζουμε τα σχόλιά σας!