[python] 한컴뷰어 없이 hwp파일 텍스트 추출하기

IT/python 2021. 12. 12. 13:57

# 관련포스팅: [프로젝트][2021-09] konlpy를 이용하여 워드클라우드 이미지 생성

# 필요 라이브러리

########################################################################################

python 3.7 version (64bit)

olefile (pip install olefile)

Name: olefile
Version: 0.46
Summary: Python package to parse, read and write Microsoft OLE2 files (Structured Storage or Compound Document, Microsoft Office)
Home-page: https://www.decalage.info/python/olefileio
Author: Philippe Lagadec
Author-email: nospam@decalage.info
License: BSD
Location: c:\users\home\anaconda3\envs\py37\lib\site-packages
Requires:
Required-by: pyhwp

########################################################################################

# 코드

########################################################################################

import olefile
import zlib
import struct


def get_hwp_text(filename):
    f = olefile.OleFileIO(filename)
    dirs = f.listdir()

    # HWP 파일 검증
    if ["FileHeader"] not in dirs or \
            ["\x05HwpSummaryInformation"] not in dirs:
        raise Exception("Not Valid HWP.")

    # 문서 포맷 압축 여부 확인
    header = f.openstream("FileHeader")
    header_data = header.read()
    is_compressed = (header_data[36] & 1) == 1

    # Body Sections 불러오기
    nums = []
    for d in dirs:
        if d[0] == "BodyText":
            nums.append(int(d[1][len("Section"):]))
    sections = ["BodyText/Section" + str(x) for x in sorted(nums)]

    # 전체 text 추출
    text = ""
    for section in sections:
        bodytext = f.openstream(section)
        data = bodytext.read()
        if is_compressed:
            unpacked_data = zlib.decompress(data, -15)
        else:
            unpacked_data = data

        # 각 Section 내 text 추출
        section_text = ""
        i = 0
        size = len(unpacked_data)
        while i < size:
            header = struct.unpack_from("<I", unpacked_data, i)[0]
            rec_type = header & 0x3ff
            rec_len = (header >> 20) & 0xfff

            if rec_type in [67]:
                rec_data = unpacked_data[i + 4:i + 4 + rec_len]
                section_text += rec_data.decode('utf-16')
                section_text += "\n"

            i += 4 + rec_len

        text += section_text
        text += "\n"

    return text


text = get_hwp_text('text.hwp')
print(text)

########################################################################################

# 실행 화면

########################################################################################

########################################################################################

'IT > python' 카테고리의 다른 글

[python] PyJWT 오류 해결 (0)	2021.12.12
[python] 한컴뷰어 없이 hwp파일 pdf로 변환 (0)	2021.12.12
[python][업비트] pyupbit 기본 함수 (0)	2021.12.12
[python] konlpy 오류 해결 (0)	2021.12.07
[python] pyinstaller trojan 이슈 (0)	2021.12.07

ABOUT ME

내가보려고 만든 블로그 내가보려고 만든 블로그

'IT > python' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'IT > python' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바