Sometimes, we want to extract text from a PDF file using PDFMiner in Python.
In this article, we’ll look at how to extract text from a PDF file using PDFMiner in Python.
How to extract text from a PDF file using PDFMiner in Python?
To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter
to convert the text into a string.
For instance, we write
from io import StringIO
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
to open the example.pdf file with open
.
Then we create the PDFParser
object with the in_file
.
Next, we create a PDFDocument
object with the parser
.
And then we create the TextConverter
object with the PDFResourceManager
object rsrcmgr
and output_string
.
Then we loop through the pages we get from PDFPage.create_pages(doc)
with a for loop.
And we call interpreter.process_page
with page
to parse each page into text.
Then we get the parsed content as a string with output_string.getvalue
.
Conclusion
To extract text from a PDF file using PDFMiner in Python, we can open the PDF file and then we use TextConverter
to convert the text into a string.