Using LLMs to Convert PDF Documents into DITA Source

Presentation
Artificial Intelligence (AI) in Technical Communication

12. November
14:00 - 14:45 PM (CET)
C6.1

On site

Bookmark participation The bookmarked session will be inserted in 'my conference program' and in your certificate of participation.

Login
iCal

Dr. Zhijun Gao
- Peking University

We will introduce our research on automatically converting PDF documents into structured DITA sources. The conversion of legacy documents into DITA format traditionally poses significant challenges due to the extensive time and manual labor required. To address these challenges, we first optimized layout parsing using PaddleX, enhanced by manually labeled DITA layout data. This optimization enables the model to more accurately identify various content types such as headings, paragraphs, and code blocks.

Furthermore, we fine-tuned the Qwen2.5-7B large language model to improve the accuracy of converting content into appropriate DITA topics. After obtaining DITA source files, we scan them to identify reusable elements, transforming them into DITA content references (<conref>). We also applied clustering algorithms to automatically generate relation table (<reltable>). Our approach also extracts terms and tags content with them (<indexterm>).

Finally, we will demonstrate our tool, which automatically converts PDFs while keeping technical writers in the loop.

Takeaways

Automated PDF → DITA conversion pipeline
Optimized layout parsing with PaddleX
LLM-driven semantic tagging of topics
Automatic <conref>, <reltable>, <indexterm> generation

Prior knowledge

Basic understanding of DITA and LLMS

Speaker

Dr. Zhijun Gao

Peking University

Show profile

Biography

Gao Zhijun is an Assistant Professor in the School of Software and Microelectronics at Peking University (Beijing, China) and Secretary-General of the China Technical Communication Alliance (CTCA). He holds a Ph.D. in Technical Communication from the University of Twente and leads the Information Experience Design research group at Peking University. Dr. Gao chairs the development of the Chinese standard “Guidelines for Evaluating the User Experience of Technical Documentation.” His work centers on AI-driven technical writing and information experience design.

Back