本篇,我们来介绍一下 BeautifulSoup,使用它可以灵活又方便的进行网页解析,支持多种解析器,即使不编写正则表达式也可以进行网页信息的提取。
安装
pip install beautifulsoup4
|
解析器
Python 标准库
BeautifulSoup(markup, 'html.parser')
|
lxml HTML 解析器(推荐)
BeautifulSoup(markup, 'lxml')
|
lxml XML 解析器
BeautifulSoup(markup, 'xml')
|
html5lib
BeautifulSoup(markup, 'html5lib')
|
基本使用
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)
|
标签选择器
选择元素
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
|
获取名称
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.title.name)
|
获取属性
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.p['name'])
print(soup.p.attrs['name'])
|
获取内容
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.p.string)
|
嵌套选择
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.head.title.string)
|
子节点和父节点
html = """ <html> <head> <title>1ess's title</title> </head> <body> <p class="title" name="1ess"><b>1ess's title</b></p> <p class="story">no content</p> <a href="https://github.com/1ess" class="sister" id="link1">A</a> <a href="https://0xfee1dead.cn" class="sister" id="link2">B</a> </body> </html> """ import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.body.contents)
print(soup.a.parent)
|
标准选择器
find_all
find_all(name, attrs, recursive, text, **kwargs) 可根据标签名,属性,内容查找文档。
name
html = """ <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """
import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
for ul in soup.find_all('ul'): for li in ul.find_all('li'): print(li)
|
attrs
html = """ <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """
import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'class': 'element'}))
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
|
text
html = """ <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """
import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))
|
CSS 选择器
通过 select() 直接传入 CSS 选择器即可完成选择。
html = """ <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> """
import lxml from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
for li in soup.select('ul li'): print(li.get_text())
|