ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [python] 한컴뷰어 없이 hwp파일 텍스트 추출하기
    IT/python 2021. 12. 12. 13:57

    # 관련포스팅: [프로젝트][2021-09] konlpy를 이용하여 워드클라우드 이미지 생성

     

    # 필요 라이브러리 

    ########################################################################################

    python 3.7 version (64bit)

    olefile (pip install olefile)

    Name: olefile
    Version: 0.46
    Summary: Python package to parse, read and write Microsoft OLE2 files (Structured Storage or Compound Document, Microsoft Office)
    Home-page: https://www.decalage.info/python/olefileio
    Author: Philippe Lagadec
    Author-email: nospam@decalage.info
    License: BSD
    Location: c:\users\home\anaconda3\envs\py37\lib\site-packages
    Requires:
    Required-by: pyhwp

    ########################################################################################

     

    # 코드

    ########################################################################################

    import olefile
    import zlib
    import struct
    
    
    def get_hwp_text(filename):
        f = olefile.OleFileIO(filename)
        dirs = f.listdir()
    
        # HWP 파일 검증
        if ["FileHeader"] not in dirs or \
                ["\x05HwpSummaryInformation"] not in dirs:
            raise Exception("Not Valid HWP.")
    
        # 문서 포맷 압축 여부 확인
        header = f.openstream("FileHeader")
        header_data = header.read()
        is_compressed = (header_data[36] & 1) == 1
    
        # Body Sections 불러오기
        nums = []
        for d in dirs:
            if d[0] == "BodyText":
                nums.append(int(d[1][len("Section"):]))
        sections = ["BodyText/Section" + str(x) for x in sorted(nums)]
    
        # 전체 text 추출
        text = ""
        for section in sections:
            bodytext = f.openstream(section)
            data = bodytext.read()
            if is_compressed:
                unpacked_data = zlib.decompress(data, -15)
            else:
                unpacked_data = data
    
            # 각 Section 내 text 추출
            section_text = ""
            i = 0
            size = len(unpacked_data)
            while i < size:
                header = struct.unpack_from("<I", unpacked_data, i)[0]
                rec_type = header & 0x3ff
                rec_len = (header >> 20) & 0xfff
    
                if rec_type in [67]:
                    rec_data = unpacked_data[i + 4:i + 4 + rec_len]
                    section_text += rec_data.decode('utf-16')
                    section_text += "\n"
    
                i += 4 + rec_len
    
            text += section_text
            text += "\n"
    
        return text
    
    
    text = get_hwp_text('text.hwp')
    print(text)
    

    ########################################################################################

     

    # 실행 화면

    ########################################################################################

    ########################################################################################

    'IT > python' 카테고리의 다른 글

    [python] PyJWT 오류 해결  (0) 2021.12.12
    [python] 한컴뷰어 없이 hwp파일 pdf로 변환  (0) 2021.12.12
    [python][업비트] pyupbit 기본 함수  (0) 2021.12.12
    [python] konlpy 오류 해결  (0) 2021.12.07
    [python] pyinstaller trojan 이슈  (0) 2021.12.07
Designed by Tistory.