ONLINE ·

Technology

Article

สร้างเครื่องมือแปลง PDF เป็น EPUB/AZW3 สำหรับหนังสือภาษาไทย

Operator

Khomkrid Lerdprasert

Filed

April 23, 2026

Channel

▲ Technology

Read

~1 min

สร้างเครื่องมือแปลง PDF เป็น EPUB/AZW3 สำหรับหนังสือภาษาไทย

ใครที่เคยพยายามแปลง PDF ภาษาไทยเป็น EPUB หรือ AZW3 เพื่ออ่านบน Kindle น่าจะเคยเจอปัญหาพวกตัวอักษรเพี้ยน สระหาย วรรณยุกต์ผิดที่ หรือย่อหน้าแตกกระจาย บทความนี้จะพาไปดูว่าผมสร้างเครื่องมือ PDF to EPUB/AZW3 Converter ขึ้นมาแก้ปัญหาเหล่านี้ได้อย่างไร โดยมีทั้ง Go version (เร็วแรง) และ Python version (รองรับ OCR)

ปัญหาของ PDF ภาษาไทย

PDF ภาษาไทยจำนวนมากใช้ฟอนต์ที่ encode ตัวอักษรไว้ใน Private Use Area (PUA) ของ Unicode ทำให้เวลาดึงข้อความออกมา ได้ตัวอักษรที่ผิดเพี้ยน เช่น:

สระ อิ อี อึ อื ถูก encode เป็น U+F701 - U+F704
วรรณยุกต์ เอก โท ตรี จัตวา เป็น U+F705 - U+F708
สระ อำ ถูกแยกเป็น นิคหิต + สระอา (ํา แทน ำ)
บางครั้ง พยัญชนะ + ช่องว่าง + สระอา (ก า แทน กำ)

นอกจากนี้ยังมีปัญหาเรื่องการจัดย่อหน้า เพราะ PDF เก็บข้อความแบบ line-by-line ทำให้เวลาแปลงเป็น EPUB จะได้ย่อหน้าที่ขาดเป็นท่อนๆ

Solution: Go + Python Dual Approach

Go Version — เร็วแรง แปลงไฟล์ 9,479 หน้าใน 2 นาที

Go version ใช้ library go-fitz (binding ของ MuPDF) สำหรับอ่าน PDF และ go-epub สำหรับสร้าง EPUB

1. แก้ไขตัวอักษร PUA

หัวใจสำคัญคือ mapping table ที่แปลง PUA characters กลับเป็น Unicode มาตรฐาน:

var thaiPUAMap = map[rune]string{
    '\uf701': "\u0e34", // สระ อิ
    '\uf702': "\u0e35", // สระ อี
    '\uf703': "\u0e36", // สระ อึ
    '\uf704': "\u0e37", // สระ อื
    '\uf705': "\u0e48", // ไม้เอก
    '\uf706': "\u0e49", // ไม้โท
    '\uf707': "\u0e4a", // ไม้ตรี
    '\uf708': "\u0e4b", // ไม้จัตวา
    '\uf709': "\u0e4c", // ทัณฑฆาต
    // ... และอื่นๆ
}

จากนั้น normalize สระอำที่แตก:

func fixThaiText(text string) string {
    // Replace PUA characters
    var result strings.Builder
    for _, r := range text {
        if replacement, ok := thaiPUAMap[r]; ok {
            result.WriteString(replacement)
        } else {
            result.WriteRune(r)
        }
    }
    text = result.String()

    // ํา → ำ
    text = strings.ReplaceAll(text, "\u0e4d\u0e32", "\u0e33")

    // พยัญชนะ + space + า → พยัญชนะ + ำ
    text = thaiConsonantPattern.ReplaceAllString(text, "$1ำ")

    return text
}

2. จัดย่อหน้าอัจฉริยะ

ฟังก์ชัน formatThaiParagraphs จะรวมบรรทัดที่ต่อเนื่องกันให้เป็นย่อหน้าเดียว โดยตรวจจับว่า:

บรรทัดว่าง = ขึ้นย่อหน้าใหม่
ขึ้นต้นด้วย “บทที่”, “ตอนที่”, “ภาคที่” = ขึ้นย่อหน้าใหม่
บรรทัดสั้นมากหลังจากย่อหน้ายาว = อาจเป็นหัวข้อ
ข้อความภาษาไทยต่อกัน ไม่ต้องเว้นวรรคระหว่างบรรทัด

// Thai characters range: 0x0E00-0x0E7F
isThai := (lastChar >= 0x0E00 && lastChar <= 0x0E7F) ||
          (firstChar >= 0x0E00 && firstChar <= 0x0E7F)

if isThai {
    // ภาษาไทยไม่ต้องเว้นวรรคระหว่างบรรทัด
    currentPara.WriteString(line)
} else {
    currentPara.WriteString(" " + line)
}

3. ดึงหน้าปกอัตโนมัติ

Go version จะ render หน้าแรกของ PDF เป็นภาพ PNG แล้วใส่เป็นหน้าปกใน EPUB/AZW3:

func extractCover(pdfPath string) (string, error) {
    doc, err := fitz.New(pdfPath)
    if err != nil {
        return "", err
    }
    defer doc.Close()

    img, err := doc.Image(0)  // render หน้าแรก
    if err != nil {
        return "", err
    }

    // บันทึกเป็น PNG
    tmpFile := filepath.Join(os.TempDir(), "cover_"+...)
    f, _ := os.Create(tmpFile)
    png.Encode(f, img)

    return tmpFile, nil
}

4. แบ่งบทอัตโนมัติ

ใช้ regex ตรวจจับ pattern บทภาษาไทย:

var chapterPattern = regexp.MustCompile(
    `(?m)^((?:บทที่|ตอนที่|ภาคที่|อารัมภบท)\s*(?:\d+)?(?:\s*[–\-])?.*)$`,
)

ไฟล์ทดสอบ Martial.pdf (9,479 หน้า) แบ่งได้ 1,384 บท โดยอัตโนมัติ

5. รองรับ AZW3 (Kindle Format)

สำหรับคนใช้ Kindle สามารถแปลงเป็น AZW3 ได้ด้วย Calibre:

func convertEPUBtoAZW3(epubPath, azw3Path string) error {
    cmd := exec.Command("ebook-convert", epubPath, azw3Path,
        "--enable-heuristics")
    output, err := cmd.CombinedOutput()
    if err != nil {
        return fmt.Errorf("conversion failed: %w", err)
    }
    return nil
}

Python Version — รองรับ OCR สำหรับ Scanned PDF

Python version มีข้อได้เปรียบคือรองรับ OCR (ผ่าน Tesseract) สำหรับ PDF ที่เป็นภาพสแกน และแปลงเป็น DOCX ได้:

def fix_thai_text(text: str) -> str:
    """Fix PUA characters and normalize Thai text."""
    text = text.translate(PUA_TRANS)
    text = text.replace("\u0e4d\u0e32", "\u0e33")  # ํา → ำ
    text = re.sub(rf'({THAI_CONSONANT}) า', r'\1ำ', text)
    return text

มีสามไฟล์ให้เลือกใช้:

pdf2epub.py — แปลง text-based PDF เป็น EPUB
pdf2docx.py — แปลง PDF เป็น DOCX (สำหรับแก้ไขใน Word)
pdf2epub_ocr.py — แปลง scanned PDF ด้วย OCR

การใช้งาน

Go Version (แนะนำ)

# Build
cd pdf-converter
go build -o main main.go
chmod +x main
mv main ../

# แปลงเป็น EPUB พร้อมหน้าปก + metadata
./main -i Martial.pdf

# แปลงเป็น AZW3 สำหรับ Kindle
./main -i Martial.pdf -t azw3

# กำหนดชื่อ output
./main -i book.pdf -o my-ebook.epub

Python Version

# ติดตั้ง dependencies
pip3 install PyMuPDF EbookLib python-docx pillow pytesseract

# แปลง text PDF → EPUB
python3 pdf2epub.py myfile.pdf

# แปลง scanned PDF ด้วย OCR
python3 pdf2epub_ocr.py scanned-book.pdf

# แปลง PDF → DOCX
python3 pdf2docx.py myfile.pdf

ผลลัพธ์เปรียบเทียบ

ทดสอบกับไฟล์ Martial.pdf (20MB, 9,479 หน้า, 1,384 บท):

เครื่องมือ	Output	ขนาด	หน้าปก	Metadata	เวลา
Go	Martial.epub	8.4MB	มี	มี	~2 นาที
Go	Martial.azw3	17MB	มี	มี	~2 นาที
Python	book.epub	~7MB	ไม่มี	ไม่มี	~10 นาที

เลือกใช้เวอร์ชันไหนดี?

Feature	Go Version	Python Version
ความเร็ว	เร็วมาก (~2 นาที/9,479 หน้า)	ช้ากว่า (~10 นาที)
EPUB	รองรับ + หน้าปก + metadata	รองรับ (ธรรมดา)
AZW3 (Kindle)	รองรับ	ไม่รองรับ
DOCX	ไม่รองรับ	รองรับ
OCR	ไม่รองรับ	รองรับ
จัดย่อหน้า	รวมบรรทัดต่อเนื่อง	แยกทุกบรรทัด
หน้าปก	ดึงจาก PDF อัตโนมัติ	ไม่มี

สรุป:

ใช้ Go version สำหรับแปลง PDF → EPUB/AZW3 ทั่วไป (เร็ว + คุณภาพสูง)
ใช้ Python version สำหรับ scanned PDF ที่ต้องการ OCR หรือแปลงเป็น DOCX

Dependencies

Go Version

# macOS — ต้องการ Calibre สำหรับ AZW3
brew install calibre

Python Version

pip3 install PyMuPDF EbookLib python-docx pillow pytesseract

# สำหรับ OCR (macOS)
brew install tesseract tesseract-lang

# สำหรับ OCR (Linux)
sudo apt install tesseract-ocr tesseract-ocr-tha

Source Code

โปรเจกต์นี้อยู่บน GitHub สามารถ clone ไปใช้งานได้เลย ถ้าใครมีปัญหาเรื่องแปลง PDF ภาษาไทยแล้วตัวอักษรเพี้ยน ลองเอาไปใช้ดูครับ ปัญหา PUA character เป็นปัญหาที่พบบ่อยมากในหนังสือไทยที่เป็น PDF

Download on GitHub

git clone https://github.com/aofiee/pdf-converter.git
cd pdf-converter
go build -o main main.go

◎ Tags

##golang ##python ##epub ##kindle ##pdf ##ebook

Operator

Khomkrid Lerdprasert

Technical Lead — building AI-powered platforms, omni-channel chat systems, and telemedicine solutions with Go, Next.js & clean architecture. 20+ years shipping software from crypto wallets to e-learning systems. Bangkok-based. Writes code late at night, brews beer on weekends.

Github Instagram

author archive

←

Previous · transmission