hpr3596 :: Extracting text, tables and images from docx files using Python
In this episode, I describe how I used 2 python libraries to extract import data from docx files
Hosted by Mr. Young on Monday, 2022-05-16 is flagged as Clean and is released under a CC-BY-SA license.
python, docx.
(Be the first).
The show is available on the Internet Archive at: https://archive.org/details/hpr3596
Listen in ogg,
spx,
or mp3 format. Play now:
Duration: 00:08:37
A Little Bit of Python.
Initially based on the podcast "A Little Bit of Python", by Michael Foord, Andrew Kuchling, Steve Holden, Dr. Brett Cannon and Jesse Noller. https://www.voidspace.org.uk/python/weblog/arch_d7_2009_12_19.shtml#e1138
Now the series is open to all.
Tools to extract data from docx files:
Code Snippets
text = docx2txt.process(src, img_dest)
with open("data.txt", "wt") as f:
f.write(text)
document = docx.Document(src)
tables = document.tables
data = []
for table in tables:
table_data = []
for row in table.rows:
row_data = []
for cell in row.cells:
row_data.append(cell.text)
table_data.append(row_data)
data.append(table_table)
for i, table in enumerate(tables):
with open(f"{i}.csv", "wt") as f:
writer = csv.writer(f)
writer.writerows(table)