爬虫进阶（2）

网页解析

test_data = \"\"\"
        <div>
            <ul>
                 <li class=\"item-0\"><a href=\" 1.html\" id=\"places_neighbours__row\">9,596,960first item</a></li>
                 <li class=\"item-1\"><a href=\" 2.html\">second item</a></li>
                 <li class=\"item-inactive\"><a href=\" 3.html\">third item</a></li>
                 <li class=\"item-1\"><a href=\" 4.html\" id=\"places_neighbours__row\">fourth item</a></li>
                 <li class=\"item-0\"><a href=\" 5.html\">fifth item</a></li>
                 <li class=\"good-0\"><a href=\" 5.html\">fifth item</a></li>
             </ul>
             <book>
                    <  lang=\"aaengbb\">Harry Potter</ >
                    <price id=\"places_neighbours__row\">29.99</price>
            </book>
            <book>
                <  lang=\"zh\">Learning  </ >
                <price>39.95</price>
            </book>
            <book>
                < >Python</ >
                <price>40</price>
            </book>
         </div>
        \"\"\"

import re
import json
import l .html
import requests
from bs4 import BeautifulSoup

\"\"\"
/ 从根标签开始  必须具有严格的父子关系
// 从当前标签   后续节点含有即可选出
* 通配符 ，选择所有
//div/book[1]/  选择div下第一个book标签的 元素
//div/book/ [@lang=\'zh\'] 选择 属性含有lang且内容是zh的 元素
//div/book/   //book/   //   具有相同的结果，因为使用相对路径最终指向 
//book/ /@* 将 所有的属性选择出来
//book/ /text() 将 的内容选择出来，使用内置text（）函数
//a[@href=\" 1.html\" and @id=\"places_neighbours__row\"]
//a[@href=\" 1.html\" or @id=\"places_neighbours__row\"]
//div/book[last()]/ /text() 将最后一个book元素选出
//div/book[price > 39]/  将book子标签price数值大于39的选出
//li[starts-with(@class,\'item\')] 将class属性前缀是item的li标签选出
// [starts-with(@lang,\'eng\')] 将 属性lang含有eng关键字的标签选出
\"\"\"
html = l .html.fromstring(test_data)
html_data = html.xpath(\"//div/ul/li/a[@href=\' 1.html\']\")
html_data = html.xpath(\"//div/ul/li/a[@id]\")
html_data = html.xpath(\"//div/ul/li[2]/a\")
html_data = html.xpath(\"//div/book/ \")
html_data = html.xpath(\"//book/ \")
html_data = html.xpath(\"// \")
html_data = html.xpath(\"//book/ /text()\")
html_data = html.xpath(\'//a[@href=\" 1.html\" and @id=\"places_neighbours__row\"]\')
html_data = html.xpath(\'//a[@href=\" 1.html\" and @id=\"places_neighbours__row\"]/@href\')
html_data = html.xpath(\"//div/book[last()]/ /text()\")
html_data = html.xpath(\"//div/book[price > 39]/ /text()\")
# print(dir(html_data[0]))
for i in html_data:
    print(i)

爬虫进阶（2）

浏览：1901 2026-05-09

网页解析

继续阅读与本文标签相同的文章

【译】Linux不同的IO访问方式中，Scylla的选择和依据

JAVA-Web 学习

特别推荐 2026年05月18日星期一

精彩发现

热门标签

爬虫进阶（2）

浏览：1901 2026-05-09

网页解析

继续阅读与本文标签相同的文章

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-24栏目： 教程

特别推荐 2026年05月18日 星期一

精彩发现

热门标签

相关文章

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-24栏目：教程

特别推荐 2026年05月18日星期一