tekom - conferences

Using LLMs to Convert PDF Documents into DITA Source

  • Presentation
  • Artificial Intelligence (AI) in Technical Communication
  • 12. November
  • 14:00 - 14:45 PM (CET)
  • C6.1
  • Dr. Zhijun Gao

    Dr. Zhijun Gao

    • Peking University

Contents

We will introduce our research on automatically converting PDF documents into structured DITA sources. The conversion of legacy documents into DITA format traditionally poses significant challenges due to the extensive time and manual labor required. To address these challenges, we first optimized layout parsing using PaddleX, enhanced by manually labeled DITA layout data. This optimization enables the model to more accurately identify various content types such as headings, paragraphs, and code blocks.

Furthermore, we fine-tuned the Qwen2.5-7B large language model to improve the accuracy of converting content into appropriate DITA topics. After obtaining DITA source files, we scan them to identify reusable elements, transforming them into DITA content references (<conref>). We also applied clustering algorithms to automatically generate relation table (<reltable>). Our approach also extracts terms and tags content with them (<indexterm>).

Finally, we will demonstrate our tool, which automatically converts PDFs while keeping technical writers in the loop.

Takeaways

  1. Automated PDF → DITA conversion pipeline
  2. Optimized layout parsing with PaddleX
  3. LLM-driven semantic tagging of topics
  4. Automatic <conref>, <reltable>, <indexterm> generation

Prior knowledge

Basic understanding of DITA and LLMS

 

Speaker

Dr. Zhijun Gao

Dr. Zhijun Gao

  • Peking University
Biography

Gao Zhijun is an Assistant Professor in the School of Software and Microelectronics at Peking University (Beijing, China) and Secretary-General of the China Technical Communication Alliance (CTCA). He holds a Ph.D. in Technical Communication from the University of Twente and leads the Information Experience Design research group at Peking University. Dr. Gao chairs the development of the Chinese standard “Guidelines for Evaluating the User Experience of Technical Documentation.” His work centers on AI-driven technical writing and information experience design.