How to extract text from MS word files in Python?

Sometimes, we want to extract text from MS word files in Python.

In this article, we’ll look at how to extract text from MS word files in Python.

How to extract text from MS word files in Python?

To extract text from MS word files in Python, we can use the zipfile library.

For instance, we write

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')
content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|n)*?>','',content)
print(cleaned)

to create ZipFile object with the path string to the Word file.

Then we call read with 'word/document.xml' to read the Word file.

And we call decode to decode the text as Unicode.

Next, we call re.sub to replace the tags with empty strings.

Conclusion

To extract text from MS word files in Python, we can use the zipfile library.