Contents
We will introduce our research on automatically converting PDF documents into structured DITA sources. The conversion of legacy documents into DITA format traditionally poses significant challenges due to the extensive time and manual labor required. To address these challenges, we first optimized layout parsing using PaddleX, enhanced by manually labeled DITA layout data. This optimization enables the model to more accurately identify various content types such as headings, paragraphs, and code blocks.
Furthermore, we fine-tuned the Qwen2.5-7B large language model to improve the accuracy of converting content into appropriate DITA topics. After obtaining DITA source files, we scan them to identify reusable elements, transforming them into DITA content references (<conref>). We also applied clustering algorithms to automatically generate relation table (<reltable>). Our approach also extracts terms and tags content with them (<indexterm>).
Finally, we will demonstrate our tool, which automatically converts PDFs while keeping technical writers in the loop.
Takeaways
- Automated PDF → DITA conversion pipeline
- Optimized layout parsing with PaddleX
- LLM-driven semantic tagging of topics
- Automatic <conref>, <reltable>, <indexterm> generation
Prior knowledge
Basic understanding of DITA and LLMS