How to extract text from a PDF file using PDFMiner in Python?

Sometimes, we want to extract text from a PDF file using PDFMiner in Python.

In this article, we’ll look at how to extract text from a PDF file using PDFMiner in Python.

How to extract text from a PDF file using PDFMiner in Python?

To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter to convert the text into a string.

For instance, we write

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

to open the example.pdf file with open.

Then we create the PDFParser object with the in_file.

Next, we create a PDFDocument object with the parser.

And then we create the TextConverter object with the PDFResourceManager object rsrcmgr and output_string.

Then we loop through the pages we get from PDFPage.create_pages(doc) with a for loop.

And we call interpreter.process_page with page to parse each page into text.

Then we get the parsed content as a string with output_string.getvalue.

Conclusion

To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter to convert the text into a string.