HTML Parser Python Using BeautifulSoup

This tutorial help to create an HTML parse script using python. We’ll use BeautifulSoup python module for the HTML parser.

I’m seeking a Python HTML Parser package that will allow me to extract tags as Python lists/dictionaries/objects.

The following is my HTML code:

<html>
<head>Python html parse</head>
<body class='responsive'>
    <div class='container'>
        <div id='class'>Div1 conten</div>
        <div>Div2 conten</div>
    </div>
</body>

We need to figure out how to get to the nested tags using the name or id of the HTML tag, so that I can extract the content/text from the div tag with class=’container’ within the body tag, or anything similar.

What’s BeautifulSoup

Beautiful Soup is a Python package for parsing HTML and XML files and extracting data. It integrates with your preferred parser to offer fluent navigation, search, and modification of the parse tree. It is normal for programmers to save hours or even days of effort.

Install package

Let’s install packages:

$pip install beautifulsoup4
$pip install bs4

Python script
Let’s create a python script to parse HTML data. We ll find div text which has a ‘container’ class.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = """<html>
<head>Python html parse</head>
<body class='responsive'>
    <div class='container'>
        <div id='class'>Div1 content</div>
        <div>Div2 content</div>
    </div>
</body>"""

parsed_html = BeautifulSoup(html, "html.parser")
print(parsed_html.body.find('div', attrs={'class':'container'}).text)

Output

Div1 content
Div2 content

How to Find by CSS selector

BeautifulSoup provides us select() and select_one() methods to find by css selector. The select() method returns all the matching elements whereas select_one(): returns the first matching element.

try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
html = """<html>
<head>Python html parse</head>
<body class='responsive'>
    <div class='container'>
        <div id='class'>Div1 content</div>
        <div>Div2 content</div>
    </div>
</body>"""

parsed_html = BeautifulSoup(html, "html.parser")
els = parsed_html.select('div > *')

for el in els:
    print(el)

Output :

<div id="class">Div1 content</div>
<div>Div2 content</div>

What’s BeautifulSoup

Install package

How to Find by CSS selector

Leave a Reply Cancel reply