-
[python] 한컴뷰어 없이 hwp파일 텍스트 추출하기IT/python 2021. 12. 12. 13:57
# 관련포스팅: [프로젝트][2021-09] konlpy를 이용하여 워드클라우드 이미지 생성
# 필요 라이브러리
########################################################################################
python 3.7 version (64bit)
olefile (pip install olefile)
Name: olefile
Version: 0.46
Summary: Python package to parse, read and write Microsoft OLE2 files (Structured Storage or Compound Document, Microsoft Office)
Home-page: https://www.decalage.info/python/olefileio
Author: Philippe Lagadec
Author-email: nospam@decalage.info
License: BSD
Location: c:\users\home\anaconda3\envs\py37\lib\site-packages
Requires:
Required-by: pyhwp########################################################################################
# 코드
########################################################################################
import olefile import zlib import struct def get_hwp_text(filename): f = olefile.OleFileIO(filename) dirs = f.listdir() # HWP 파일 검증 if ["FileHeader"] not in dirs or \ ["\x05HwpSummaryInformation"] not in dirs: raise Exception("Not Valid HWP.") # 문서 포맷 압축 여부 확인 header = f.openstream("FileHeader") header_data = header.read() is_compressed = (header_data[36] & 1) == 1 # Body Sections 불러오기 nums = [] for d in dirs: if d[0] == "BodyText": nums.append(int(d[1][len("Section"):])) sections = ["BodyText/Section" + str(x) for x in sorted(nums)] # 전체 text 추출 text = "" for section in sections: bodytext = f.openstream(section) data = bodytext.read() if is_compressed: unpacked_data = zlib.decompress(data, -15) else: unpacked_data = data # 각 Section 내 text 추출 section_text = "" i = 0 size = len(unpacked_data) while i < size: header = struct.unpack_from("<I", unpacked_data, i)[0] rec_type = header & 0x3ff rec_len = (header >> 20) & 0xfff if rec_type in [67]: rec_data = unpacked_data[i + 4:i + 4 + rec_len] section_text += rec_data.decode('utf-16') section_text += "\n" i += 4 + rec_len text += section_text text += "\n" return text text = get_hwp_text('text.hwp') print(text)
########################################################################################
# 실행 화면
########################################################################################
########################################################################################
'IT > python' 카테고리의 다른 글
[python] PyJWT 오류 해결 (0) 2021.12.12 [python] 한컴뷰어 없이 hwp파일 pdf로 변환 (0) 2021.12.12 [python][업비트] pyupbit 기본 함수 (0) 2021.12.12 [python] konlpy 오류 해결 (0) 2021.12.07 [python] pyinstaller trojan 이슈 (0) 2021.12.07