Repository files navigation
Extract Office Content
提取PPT:
提取ppt中的内容时,会丢失带有公式的文本框
提取的表格格式不全
PPT中的表格会提取为对应的excel文件,是否有更好的方式?
提取Word:
Installextract_office_content
$ pip install extract_office_content
Run by CLI.
Extract All office file's content.
$ extract_office_content -h
usage: extract_office_content [-h] [-img_dir SAVE_IMG_DIR] file_path
positional arguments:
file_path
optional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_office_content tests/test_files
Extract Word.
$ extract_word -h
usage: extract_word [-h] [-img_dir SAVE_IMG_DIR] word_path
positional arguments:
word_path
optional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_word tests/test_files/word_example.docx
Extract PPT.
$ extract_ppt -h
usage: extract_ppt [-h] [-img_dir SAVE_IMG_DIR] ppt_path
positional arguments:
ppt_path
optional arguments:
-h, --help show this help message and exit
-img_dir SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_ppt tests/test_files/ppt_example.pptx
Extract Excel.
$ extract_excel -h
usage: extract_excel [-h] [-f {markdown,html,latex,string}] [-o SAVE_IMG_DIR]
excel_path
positional arguments:
excel_path
optional arguments:
-h, --help show this help message and exit
-f {markdown,html,latex,string}, --output_format {markdown,html,latex,string}
-o SAVE_IMG_DIR, --save_img_dir SAVE_IMG_DIR
$ extract_excel tests/test_files/excel_example.xlsx
Run by python script.
Extract All.
from pathlib import Path
from extract_office_content import ExtractOfficeContent
extracter = ExtractOfficeContent ()
file_list = list (Path ('tests/test_files' ).iterdir ())
for file_path in file_list :
res = extracter (file_path )
print (res )
Extract Word.
from extract_office_content import ExtractWord
word_extract = ExtractWord ()
word_path = 'tests/test_files/word_example.docx'
text = word_extract (word_path , "outputs/word" )
# or bytes
with open (word_path , 'rb' ) as f :
word_content = f .read ()
text = word_extract (word_content , "outputs/word" )
print (text )
Extract PPT.
from pathlib import Path
from extract_office_content import ExtractPPT
ppt_extracter = ExtractPPT ()
ppt_path = 'tests/test_files/ppt_example.pptx'
save_dir = 'outputs'
save_img_dir = Path (save_dir ) / Path (ppt_path ).stem
res = ppt_extracter (ppt_path , save_img_dir = str (save_img_dir ))
# or bytes
with open (ppt_path , 'rb' ) as f :
ppt_content = f .read ()
res = ppt_extracter (ppt_content , save_img_dir = str (save_img_dir ))
print (res )
Extract Excel.
from extract_office_content import ExtractExcel
excel_extract = ExtractExcel ()
excel_path = 'tests/test_files/excel_with_image.xlsx'
res = excel_extract (excel_path , out_format = 'markdown' , save_img_dir = '1' )
# or bytes
with open (excel_path , 'rb' ) as f :
excel_content = f .read ()
res = excel_extract (excel_content , out_format = 'markdown' , save_img_dir = '1' )
print (res )
2023-07-02 v0.0.6 update:
2023-06-17 v0.0.4 update:
About
Extract content (include text, table, image) from the office files (Word, Excel, PPT).
Topics
Resources
Stars
Watchers
Forks
You can’t perform that action at this time.