Sometimes, we want to grab visible webpage text with BeautifulSoup.
In this article, we’ll look at how to grab visible webpage text with BeautifulSoup.
How to grab visible webpage text with BeautifulSoup?
To grab visible webpage text with BeautifulSoup, we can call filter
when we’re grabbing the webpage content.
For instance, we write:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in [
'style', 'script', 'head', 'title', 'meta', '[document]'
]:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('https://yahoo.com').read()
print(text_from_html(html))
We have the tag_visible
function that checks for tags for invisible elements by checking the element.parent.name
for the tags that aren’t displayed.
We return True
for the visible tags and False
otherwise.
Then we define the text_from_html
function to grab the text.
We use the BeautifulSoup
constructor with body
to get the content.
Then we call soup.findAll
with text
set to True
to get all the nodes with text content.
And then we call filter
with tag_visible
and texts
to get the visible nodes.
And finally, we call join
to join all the results together.
We then get the HTML with urllib.request.urlopen
and call text_from_html
with the returned HTML.
Conclusion
To grab visible webpage text with BeautifulSoup, we can call filter
when we’re grabbing the webpage content.