This tutorial help to create an HTML parse script using python. We’ll use BeautifulSoup python module for the HTML parser.
I’m seeking a Python HTML Parser package that will allow me to extract tags as Python lists/dictionaries/objects.
The following is my HTML code:
<html> <head>Python html parse</head> <body class='responsive'> <div class='container'> <div id='class'>Div1 conten</div> <div>Div2 conten</div> </div> </body>
We need to figure out how to get to the nested tags using the name or id of the HTML tag, so that I can extract the content/text from the div tag with class=’container’ within the body tag, or anything similar.
What’s BeautifulSoup
Beautiful Soup is a Python package for parsing HTML and XML files and extracting data. It integrates with your preferred parser to offer fluent navigation, search, and modification of the parse tree. It is normal for programmers to save hours or even days of effort.
Install package
Let’s install packages:
$pip install beautifulsoup4 $pip install bs4
Python script
Let’s create a python script to parse HTML data. We ll find div text which has a ‘container’ class.
try: from BeautifulSoup import BeautifulSoup except ImportError: from bs4 import BeautifulSoup html = """<html> <head>Python html parse</head> <body class='responsive'> <div class='container'> <div id='class'>Div1 content</div> <div>Div2 content</div> </div> </body>""" parsed_html = BeautifulSoup(html, "html.parser") print(parsed_html.body.find('div', attrs={'class':'container'}).text)
Output
Div1 content Div2 content
How to Find by CSS selector
BeautifulSoup provides us select()
and select_one()
methods to find by css selector. The select()
method returns all the matching elements whereas select_one()
: returns the first matching element.
try: from BeautifulSoup import BeautifulSoup except ImportError: from bs4 import BeautifulSoup html = """<html> <head>Python html parse</head> <body class='responsive'> <div class='container'> <div id='class'>Div1 content</div> <div>Div2 content</div> </div> </body>""" parsed_html = BeautifulSoup(html, "html.parser") els = parsed_html.select('div > *') for el in els: print(el)
Output :
<div id="class">Div1 content</div> <div>Div2 content</div>