今天在做爬虫项目的时候出现了一个错误,通过pyquery获取不到元素。

​
from pyquery import PyQuery as pq

html = \'\'\'
<html  ns=\"http://www.w3.org/1999/xhtml\">
<head>
    < >TEST</ >
</head>
<body>
    <div class=\"warp\">
        <ul class=\"goodsList\">
            <li>this is the test1</li>
            <li>this is the test2</li>
            <li>this is the test3</li>
            <li>this is the test4</li>
        </ul>
    </div>
</body>
</html>
\'\'\'
doc = pq(html)
element = doc(\'.warp ul li:first-child\')
print(element)

运行结果:

None

但是pyquery中的选择器并没有错误,但是运行结果一直是None。这是为什么呢?后来通过查看相关文档得知,pyquery解析的是html类型的字符串,但是上面的类型是xhtml,所以会获取不到元素。可以在pq()方法初始化字符串时加上parser=\"html\"告诉pyquery使用html规范解析,即可解决上述问题。

from pyquery import PyQuery as pq

html = \'\'\'
<html  ns=\"http://www.w3.org/1999/xhtml\">
<head>
    < >TEST</ >
</head>
<body>
    <div class=\"warp\">
        <ul class=\"goodsList\">
            <li>this is the test1</li>
            <li>this is the test2</li>
            <li>this is the test3</li>
            <li>this is the test4</li>
        </ul>
    </div>
</body>
</html>
\'\'\'
doc = pq(html,parser=\"html\")
element = doc(\'.warp ul li:first-child\')
if element:
    print(element)
else:
    print(\'None\')

运行结果:

<li>this is the test1</li>

 

收藏 打印