Xpath如何实现精确匹配和模糊匹配百度推广信息
- 作者:zhanhy ——来源:原创 ——Xpath经常用到的地方是爬虫软件,一般在爬去数据时使用可以快速高效精确的定位到要爬取的数据。在使用过程中有时我们需要精确匹配节点,有时需要模糊匹配节点信息,具体如何做可以实现精确匹配和模糊匹配呢。这里来实际举例下Xpath如何实现精确匹配和模糊匹配百度推广信息,希望可以给你带来帮助。
百度搜索结果和搜索词如下图所示,搜索结果源码参考下方部分源码,目的是为了获取到推广标题和推广简介,要实现目的需要先用xpath获取数据所在的区域代码,之后从区域代码中再次提取到推广标题和推广简介。
部分代码如下(为了视觉效果更好,链接地址已经缩短,如果需要获取链接地址的话,可以自行修改语句):
<div id="3001" cmatchid="225" data-ecimtimesign="39" class="pJsPIR Naeqyh vLQtYD EUtKfl EC_ppim_new_gap_bottom" data-click="{"fm":"pp", "p1":3001, "p5":3001, "rsv_srcid":"49509"}"> <div class="kcJoV_"> <h3 class="t MdiUjv PdUOce"><a href="http://www.baidu.com/baidu.php" class="TtjIpz" data-is-main-url="true" data-landurl="https://www.vipkid.com.cn/web/sem?channel_id=212&channel_keyword=I5M427K65195O5_v1_10537462_66757469_2982263217_142001710787_37115329078_94_63_cl1_1&utm_source=sem_baidu&utm_medium=oldpc&utm_campaign=tongtou_tongyong&utm_content=zaixian&utm_term=N073853" target="_blank">学<font color="#CC0000">儿童英语哪里好</font>_【VIPKID】在线青<font color="#CC0000">少儿英语</font>_在家学英语</a></h3> </div> <div class="c-abstract gzGZwQ uBmWPn "> <div class=""> <a hidefocus="hidefocus" target="_blank" class="xVhiEH" href="http://www.baidu.com/baidu.php" data-landurl="https://www.vipkid.com.cn/web/sem?channel_id=212&channel_keyword=I5M427K65195O5_v1_10537462_66757469_2982263217_142001710787_37115329078_94_63_cl1_1&utm_source=sem_baidu&utm_medium=oldpc&utm_campaign=tongtou_tongyong&utm_content=zaixian&utm_term=N073853">VIPKID在线青<font color="#CC0000">少儿英语</font>,限时免费试听,真人外教在线实时互动,帮助孩子全方位提高<font color="#CC0000">英语</font>水平,现在注册即可在线免费领取价值288元试听礼包!</a> </div> <div class="XOQmu_"> <a class="cLrkye bWrcOE" target="_blank" href="http://www.baidu.com/baidu.php">刘涛代言</a> <a class="cLrkye " target="_blank" href="http://www.baidu.com/baidu.php">北美师资</a> <a class="cLrkye " target="_blank" href="http://www.baidu.com/baidu.php">1对1授课</a> </div> <div class="c-row wnAXjs c-gap-top-small "> <div class="i_vPxp c-span6 "> <a href="http://www.baidu.com/baidu.php" target="_blank"> <div class="ITjXAJ c-img c-img6 "> <span class="PkVuOH"></span> <img src="https://fc5tn.baidu.com/it/u=870837688,3029268417&fm=202" class="nLQkUy" /> <div class="UqJvhD"> </div> <div class="RxvMjk"> <span class="PkVuOH"></span> <div class="sSiIgq"> <p class="c-gap-bottom-small">刘涛代言</p> <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span> </div> </div> </div> <div class="yixtZh" title="刘涛代言"> 刘涛代言 </div></a> </div> <div class="i_vPxp c-span6 "> <a href="http://www.baidu.com/baidu.php" target="_blank"> <div class="ITjXAJ c-img c-img6 "> <span class="PkVuOH"></span> <img src="https://fc2tn.baidu.com/it/u=1798040235,525758260&fm=202" class="nLQkUy" /> <div class="UqJvhD"> </div> <div class="RxvMjk"> <span class="PkVuOH"></span> <div class="sSiIgq"> <p class="c-gap-bottom-small">北美外教</p> <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span> </div> </div> </div> <div class="yixtZh" title="北美外教"> 北美外教 </div></a> </div> <div class="i_vPxp c-span6 "> <a href="http://www.baidu.com/baidu.php" target="_blank"> <div class="ITjXAJ c-img c-img6 "> <span class="PkVuOH"></span> <img src="https://fc6tn.baidu.com/it/u=472412231,3219462096&fm=202" class="nLQkUy" /> <div class="UqJvhD"> </div> <div class="RxvMjk"> <span class="PkVuOH"></span> <div class="sSiIgq"> <p class="c-gap-bottom-small">在线授课</p> <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span> </div> </div> </div> <div class="yixtZh" title="在线授课"> 在线授课 </div></a> </div> <div class="i_vPxp c-span6 c-span-last"> <a href="http://www.baidu.com/baidu.php" target="_blank"> <div class="ITjXAJ c-img c-img6 "> <span class="PkVuOH"></span> <img src="https://fc5tn.baidu.com/it/u=1176971384,3599299217&fm=202" class="nLQkUy" /> <div class="UqJvhD"> </div> <div class="RxvMjk"> <span class="PkVuOH"></span> <div class="sSiIgq"> <p class="c-gap-bottom-small">多样课程</p> <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span> </div> </div> </div> <div class="yixtZh" title="多样课程"> 多样课程 </div></a> </div> </div> </div> <div class="bTwItd uBmWPn"> <a href="http://www.baidu.com/baidu.php" target="_blank" class="FZKtvO"><span class="FtGTVL">www.vipkid.com.cn</span> <span class="LHCArh">2020-08</span></a> <div id="tools_213_0" style="margin-left:5px;" class="c-tools"> <a class="c-tip-icon"><i class="c-icon c-icon-triangle-down-g"></i></a> </div> <span class="icons PdUOce FYwnyN"><a class="_haGlQ c-icon ec-baobiao ec-baobiao-first" data-baobiao="{"baobiao_text":"\u8be5\u4f01\u4e1a\u5df2\u901a\u8fc7\u5b9e\u540d\u8ba4\u8bc1\uff0c\u67e5\u770b <a href=\"https:\/\/www.baidu.com\/s?wd=%E5%8C%97%E4%BA%AC%E5%A4%A7%E7%B1%B3%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8@v&vmp_ec=1593772510&vmp_ectm=b8d240fb9c7877bf46bdd8YMykzd3fNdla92bdc3daN5DmedlTNEwfc3Xee160X39df887c6&from=fc\" target=\"_blank\">\u4f01\u4e1a\u6863\u6848<\/a>\u3002<\/br>\u60a8\u5df2\u767b\u5f55\u767e\u5ea6\u8d26\u53f7\uff0c <a href=\"http:\/\/baozhang.baidu.com\/guarantee\/?from=fcad\" target=\"_blank\">\u767e\u5ea6\u7f51\u6c11\u6743\u76ca\u4fdd\u969c\u8ba1\u5212<\/a> \u4e3a\u60a8\u641c\u7d22\u62a4\u822a\u3002","baobiao_title":"\u5317\u4eac\u5927\u7c73\u79d1\u6280\u6709\u9650\u516c\u53f8"}"></a></span> <font class="LbmOSd lNOPyK zpoRhUppouter " size="-1">- <a href="http://www.baidu.com/baidu.php" target="_blank" class="m enCEkc">评价</a></font> <font class="LbmOSd ec_tuiguang_ppouter ec_tuiguang_container" size="-1"><a class="PdUOce m IAu_Hy m ec_tuiguang_ppouter ec_tuiguang_pplink " target="_blank" href="http://www.baidu.com/baidu.php" style="margin-left:5px;"><span data-tuiguang="{"tuiguang_text":"\u672c\u641c\u7d22\u7ed3\u679c\u4e3a <a href=\"https:\/\/isite.baidu.com\/site\/e.baidu.com\/4a98a7ec-2715-49b3-b97a-6f87b8617926?refer=919\" target=\"_blank\">\u5546\u4e1a\u63a8\u5e7f<\/a> \u4fe1\u606f\uff0c\u8bf7\u6ce8\u610f\u53ef\u80fd\u7684\u98ce\u9669\u3002<br\/>","tuiguang_title":""}" class="MnZBAi">广告</span></a></font> </div> <a href="http://www.baidu.com/baidu.php" target="_blank" class=" FCmXHq" style="display: none;" data-rank="0">幼儿蒙氏英语</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">少儿英语</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">儿童学英语哪里好</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">幼儿英语教材</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">儿童英语培训</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">少儿英语应该如何学</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">少儿英语主要学什么</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">英孚少儿英语 价格</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;border:none;" data-rank="0">上海英孚儿童英语价格表</a> </div> <!-- pc jieou new --> <div id="3002" cmatchid="225" data-ecimtimesign="39" data-general-xst="TjYknH0vnjD4PWcKmWYkPb7jwRc1PDmLn16vwW-jrHK7fWFjfbRvnDmdwDRkw0715HDYrHfYrHn4rj6drjbYPjcdnW63g1czPNtk0gTqYlQhv_MuVEMHCV5EEP83tUZV0gDqzIhvSIrBYSODknjE8_nKIHYzP1b1n1n4r07Y5HDdrHTLnHmknH0KUgDqn0cs0BYKmv6quhPxTAnKnHndrHTLnWTvP6" class="pJsPIR Naeqyh vLQtYD sDhKNG EUtKfl EC_ppim_new_gap_bottom" data-click="{"fm":"pp", "p1":3002, "p5":3002, "rsv_srcid":"49509"}"> <div class="kcJoV_"> <h3 class="t MdiUjv PdUOce"><a href="http://www.baidu.com/baidu.php" class="TtjIpz" data-is-main-url="true" data-landurl="https://www.acadsoc.com.cn/sem/baidu/Perry/shaoer/children1-pc-kf.htm?utm_source=baidu&utm_keyword=162093323369&utm_terminal=PC&utm_seat=cl2&utm_page=1&C_idea=c265-195-159" target="_blank">阿卡索_专注3-18岁在线青<font color="#CC0000">少儿英语</font>_低至13.8/节</a></h3> </div> <div class="c-abstract gzGZwQ uBmWPn "> <div class="cIjW_e"> <span class="HjKhFx"><span class="qvlCax">获得网友: </span><a class="sTKFSc" href="http://koubei.baidu.com/s/d8b13979178ae080408f018e98cf50aa" target="_blank" data-click="{"rsv_ct":1031,"p2":1}"><span class="QAXGkf">89%好评</span></a></span> <span class="HjKhFx"><span class="QAXGkf">164条评价</span></span> <span class="HjKhFx GNJ_EY"><span class="QAXGkf">“性价比高 | 价格便宜 | 老师好”</span></span> </div> <div class=""> <a hidefocus="hidefocus" target="_blank" class="xVhiEH" href="http://www.baidu.com/baidu.php" data-landurl="https://www.acadsoc.com.cn/sem/baidu/Perry/shaoer/children1-pc-kf.htm?utm_source=baidu&utm_keyword=162093323369&utm_terminal=PC&utm_seat=cl2&utm_page=1&C_idea=c265-195-159">阿卡索<font color="#CC0000">少儿英语</font>,暖爸佟大为代言品牌,教材全面覆盖中小学新课标,外教1对1在线教学,严选全球10000+持TESOL证书外教,全英互动课堂,让孩子爱学敢说!</a> </div> </div> <div class="bTwItd uBmWPn"> <a href="http://www.baidu.com/baidu.php" target="_blank" class="FZKtvO"><span class="FtGTVL">www.acadsoc.com.cn</span> <span class="LHCArh">2020-08</span></a> <div id="tools_213_1" style="margin-left:5px;" class="c-tools"> <a class="c-tip-icon"><i class="c-icon c-icon-triangle-down-g"></i></a> </div> <span class="icons PdUOce FYwnyN"><a class="_haGlQ c-icon ec-baobiao ec-baobiao-first" data-baobiao="{"baobiao_text":"\u8be5\u4f01\u4e1a\u5df2\u901a\u8fc7\u5b9e\u540d\u8ba4\u8bc1\uff0c\u67e5\u770b <a href=\"https:\/\/www.baidu.com\/s?wd=%E6%B7%B1%E5%9C%B3%E5%B8%82%E9%98%BF%E5%8D%A1%E7%B4%A2%E8%B5%84%E8%AE%AF%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8@v&vmp_ec=1578044188&vmp_ectm=c263ee24fec303c978c79bakMId2zmXc2cfe71O85T243Mz25zl4kfXlfe4Nd4a1485247a2&from=fc\" target=\"_blank\">\u4f01\u4e1a\u6863\u6848<\/a>\u3002<\/br>\u60a8\u5df2\u767b\u5f55\u767e\u5ea6\u8d26\u53f7\uff0c <a href=\"http:\/\/baozhang.baidu.com\/guarantee\/?from=fcad\" target=\"_blank\">\u767e\u5ea6\u7f51\u6c11\u6743\u76ca\u4fdd\u969c\u8ba1\u5212<\/a> \u4e3a\u60a8\u641c\u7d22\u62a4\u822a\u3002","baobiao_title":"\u6df1\u5733\u5e02\u963f\u5361\u7d22\u8d44\u8baf\u80a1\u4efd\u6709\u9650\u516c\u53f8"}"></a></span> <font class="LbmOSd lNOPyK zpoRhUppouter " size="-1">- <a href="http://www.baidu.com/baidu.php" target="_blank" class="m enCEkc">评价</a></font> <font class="LbmOSd ec_tuiguang_ppouter ec_tuiguang_container" size="-1"><a class="PdUOce m IAu_Hy m ec_tuiguang_ppouter ec_tuiguang_pplink " target="_blank" href="http://www.baidu.com/baidu.php" style="margin-left:5px;"><span data-tuiguang="{"tuiguang_text":"\u672c\u641c\u7d22\u7ed3\u679c\u4e3a <a href=\"https:\/\/isite.baidu.com\/site\/e.baidu.com\/4a98a7ec-2715-49b3-b97a-6f87b8617926?refer=919\" target=\"_blank\">\u5546\u4e1a\u63a8\u5e7f<\/a> \u4fe1\u606f\uff0c\u8bf7\u6ce8\u610f\u53ef\u80fd\u7684\u98ce\u9669\u3002<br\/>","tuiguang_title":""}" class="MnZBAi">广告</span></a></font> </div> <a href="http://www.baidu.com/baidu.php" target="_blank" class=" FCmXHq" style="display: none;" data-rank="1">幼儿蒙氏英语</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">少儿英语</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">儿童学英语哪里好</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">幼儿园英语教材</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">少儿英语应该如何学</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">英孚 少儿英语 价格</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">幼儿英语和少儿英语的区别</a> <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;border:none;" data-rank="1">少儿英语主要学什么</a> </div> <!-- pc jieou new -->
通过分析上面的代码后,不难发现所有的推广信息都会放在一个div里面,这个div里面包含属性id和属性cmatchid,而其它正常的搜索结果不会同时带有这两个属性。那么Xpath语句就可以这么写://div[@id and @cmatchid],注意该语句不会管属性值是什么内容,只要包含属性即可。
当然如果只需要匹配推广标题的话,可以使用模糊匹配Xpath语句://h3[contains(@class,"t ")]
如果只需要匹配摘要可以使用精确匹配Xpath语句://div[@class=""]/a/text()。
如果要获取标题和简介,仍然需要先获取区域后再从区域内获取标题和简介,这样做的目的是为了防止数据错行。如果你还有别的问题,可以留言咨询。
如果你还有其它疑问可以来本站搜索相关问题,这里会有你想要的答案:火车脚本网
你会喜欢下面的文章?
还有什么疑问可以提出来
- 全部评论(0)
- zhanhy 评论 Xpath如何实现精确匹配:东西很有用,感谢分享。如果要匹配文本内容,可以使用相应的
还没有评论,快来抢沙发吧!