Xpath如何实现精确匹配和模糊匹配百度推广信息

- 作者:zhanhy ——来源:原创 ——

Xpath经常用到的地方是爬虫软件,一般在爬去数据时使用可以快速高效精确的定位到要爬取的数据。在使用过程中有时我们需要精确匹配节点,有时需要模糊匹配节点信息,具体如何做可以实现精确匹配和模糊匹配呢。这里来实际举例下Xpath如何实现精确匹配和模糊匹配百度推广信息,希望可以给你带来帮助。

百度搜索结果和搜索词如下图所示,搜索结果源码参考下方部分源码,目的是为了获取到推广标题和推广简介,要实现目的需要先用xpath获取数据所在的区域代码,之后从区域代码中再次提取到推广标题和推广简介。

百度推广.jpg

部分代码如下(为了视觉效果更好,链接地址已经缩短,如果需要获取链接地址的话,可以自行修改语句):

  <div id="3001" cmatchid="225" data-ecimtimesign="39" class="pJsPIR Naeqyh vLQtYD EUtKfl EC_ppim_new_gap_bottom" data-click="{&quot;fm&quot;:&quot;pp&quot;, &quot;p1&quot;:3001, &quot;p5&quot;:3001, &quot;rsv_srcid&quot;:&quot;49509&quot;}">
   <div class="kcJoV_">
    <h3 class="t MdiUjv PdUOce"><a href="http://www.baidu.com/baidu.php" class="TtjIpz" data-is-main-url="true" data-landurl="https://www.vipkid.com.cn/web/sem?channel_id=212&amp;channel_keyword=I5M427K65195O5_v1_10537462_66757469_2982263217_142001710787_37115329078_94_63_cl1_1&amp;utm_source=sem_baidu&amp;utm_medium=oldpc&amp;utm_campaign=tongtou_tongyong&amp;utm_content=zaixian&amp;utm_term=N073853" target="_blank">学<font color="#CC0000">儿童英语哪里好</font>_【VIPKID】在线青<font color="#CC0000">少儿英语</font>_在家学英语</a></h3>
   </div>
   <div class="c-abstract gzGZwQ uBmWPn ">
    <div class="">
     <a hidefocus="hidefocus" target="_blank" class="xVhiEH" href="http://www.baidu.com/baidu.php" data-landurl="https://www.vipkid.com.cn/web/sem?channel_id=212&amp;channel_keyword=I5M427K65195O5_v1_10537462_66757469_2982263217_142001710787_37115329078_94_63_cl1_1&amp;utm_source=sem_baidu&amp;utm_medium=oldpc&amp;utm_campaign=tongtou_tongyong&amp;utm_content=zaixian&amp;utm_term=N073853">VIPKID在线青<font color="#CC0000">少儿英语</font>,限时免费试听,真人外教在线实时互动,帮助孩子全方位提高<font color="#CC0000">英语</font>水平,现在注册即可在线免费领取价值288元试听礼包!</a>
    </div>
    <div class="XOQmu_">
     <a class="cLrkye bWrcOE" target="_blank" href="http://www.baidu.com/baidu.php">刘涛代言</a>
     <a class="cLrkye " target="_blank" href="http://www.baidu.com/baidu.php">北美师资</a>
     <a class="cLrkye " target="_blank" href="http://www.baidu.com/baidu.php">1对1授课</a>
    </div>
    <div class="c-row wnAXjs c-gap-top-small ">
     <div class="i_vPxp c-span6 ">
      <a href="http://www.baidu.com/baidu.php" target="_blank">
       <div class="ITjXAJ c-img c-img6 ">
        <span class="PkVuOH"></span>
        <img src="https://fc5tn.baidu.com/it/u=870837688,3029268417&amp;fm=202" class="nLQkUy" />
        <div class="UqJvhD"> 
        </div>
        <div class="RxvMjk">
         <span class="PkVuOH"></span>
         <div class="sSiIgq">
          <p class="c-gap-bottom-small">刘涛代言</p>
          <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span>
         </div>
        </div>
       </div>
       <div class="yixtZh" title="刘涛代言">
        刘涛代言
       </div></a>
     </div>
     <div class="i_vPxp c-span6 ">
      <a href="http://www.baidu.com/baidu.php" target="_blank">
       <div class="ITjXAJ c-img c-img6 ">
        <span class="PkVuOH"></span>
        <img src="https://fc2tn.baidu.com/it/u=1798040235,525758260&amp;fm=202" class="nLQkUy" />
        <div class="UqJvhD"> 
        </div>
        <div class="RxvMjk">
         <span class="PkVuOH"></span>
         <div class="sSiIgq">
          <p class="c-gap-bottom-small">北美外教</p>
          <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span>
         </div>
        </div>
       </div>
       <div class="yixtZh" title="北美外教">
        北美外教
       </div></a>
     </div>
     <div class="i_vPxp c-span6 ">
      <a href="http://www.baidu.com/baidu.php" target="_blank">
       <div class="ITjXAJ c-img c-img6 ">
        <span class="PkVuOH"></span>
        <img src="https://fc6tn.baidu.com/it/u=472412231,3219462096&amp;fm=202" class="nLQkUy" />
        <div class="UqJvhD"> 
        </div>
        <div class="RxvMjk">
         <span class="PkVuOH"></span>
         <div class="sSiIgq">
          <p class="c-gap-bottom-small">在线授课</p>
          <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span>
         </div>
        </div>
       </div>
       <div class="yixtZh" title="在线授课">
        在线授课
       </div></a>
     </div>
     <div class="i_vPxp c-span6  c-span-last">
      <a href="http://www.baidu.com/baidu.php" target="_blank">
       <div class="ITjXAJ c-img c-img6 ">
        <span class="PkVuOH"></span>
        <img src="https://fc5tn.baidu.com/it/u=1176971384,3599299217&amp;fm=202" class="nLQkUy" />
        <div class="UqJvhD"> 
        </div>
        <div class="RxvMjk">
         <span class="PkVuOH"></span>
         <div class="sSiIgq">
          <p class="c-gap-bottom-small">多样课程</p>
          <span class="c-btn c-btn-primary c-btn-mini" href="http://www.baidu.com/baidu.php">去看看</span>
         </div>
        </div>
       </div>
       <div class="yixtZh" title="多样课程">
        多样课程
       </div></a>
     </div>
    </div>
   </div>
   <div class="bTwItd uBmWPn">
    <a href="http://www.baidu.com/baidu.php" target="_blank" class="FZKtvO"><span class="FtGTVL">www.vipkid.com.cn</span>&nbsp;<span class="LHCArh">2020-08</span></a>
    <div id="tools_213_0" style="margin-left:5px;" class="c-tools">
     <a class="c-tip-icon"><i class="c-icon c-icon-triangle-down-g"></i></a>
    </div>
    <span class="icons PdUOce FYwnyN"><a class="_haGlQ c-icon ec-baobiao ec-baobiao-first" data-baobiao="{&quot;baobiao_text&quot;:&quot;\u8be5\u4f01\u4e1a\u5df2\u901a\u8fc7\u5b9e\u540d\u8ba4\u8bc1\uff0c\u67e5\u770b&nbsp;&lt;a href=\&quot;https:\/\/www.baidu.com\/s?wd=%E5%8C%97%E4%BA%AC%E5%A4%A7%E7%B1%B3%E7%A7%91%E6%8A%80%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8@v&amp;vmp_ec=1593772510&amp;vmp_ectm=b8d240fb9c7877bf46bdd8YMykzd3fNdla92bdc3daN5DmedlTNEwfc3Xee160X39df887c6&amp;from=fc\&quot; target=\&quot;_blank\&quot;&gt;\u4f01\u4e1a\u6863\u6848&lt;\/a&gt;\u3002&lt;\/br&gt;\u60a8\u5df2\u767b\u5f55\u767e\u5ea6\u8d26\u53f7\uff0c&nbsp;&lt;a href=\&quot;http:\/\/baozhang.baidu.com\/guarantee\/?from=fcad\&quot; target=\&quot;_blank\&quot;&gt;\u767e\u5ea6\u7f51\u6c11\u6743\u76ca\u4fdd\u969c\u8ba1\u5212&lt;\/a&gt;&nbsp;\u4e3a\u60a8\u641c\u7d22\u62a4\u822a\u3002&quot;,&quot;baobiao_title&quot;:&quot;\u5317\u4eac\u5927\u7c73\u79d1\u6280\u6709\u9650\u516c\u53f8&quot;}"></a></span>
    <font class="LbmOSd lNOPyK zpoRhUppouter " size="-1">-&nbsp;<a href="http://www.baidu.com/baidu.php" target="_blank" class="m enCEkc">评价</a></font>
    <font class="LbmOSd ec_tuiguang_ppouter ec_tuiguang_container" size="-1"><a class="PdUOce m IAu_Hy m ec_tuiguang_ppouter ec_tuiguang_pplink " target="_blank" href="http://www.baidu.com/baidu.php" style="margin-left:5px;"><span data-tuiguang="{&quot;tuiguang_text&quot;:&quot;\u672c\u641c\u7d22\u7ed3\u679c\u4e3a&nbsp;&lt;a href=\&quot;https:\/\/isite.baidu.com\/site\/e.baidu.com\/4a98a7ec-2715-49b3-b97a-6f87b8617926?refer=919\&quot; target=\&quot;_blank\&quot;&gt;\u5546\u4e1a\u63a8\u5e7f&lt;\/a&gt;&nbsp;\u4fe1\u606f\uff0c\u8bf7\u6ce8\u610f\u53ef\u80fd\u7684\u98ce\u9669\u3002&lt;br\/&gt;&quot;,&quot;tuiguang_title&quot;:&quot;&quot;}" class="MnZBAi">广告</span></a></font>
   </div>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class=" FCmXHq" style="display: none;" data-rank="0">幼儿蒙氏英语</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">少儿英语</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">儿童学英语哪里好</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">幼儿英语教材</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">儿童英语培训</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">少儿英语应该如何学</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">少儿英语主要学什么</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="0">英孚少儿英语 价格</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;border:none;" data-rank="0">上海英孚儿童英语价格表</a>
  </div>
  <!-- pc jieou new -->
  <div id="3002" cmatchid="225" data-ecimtimesign="39" data-general-xst="TjYknH0vnjD4PWcKmWYkPb7jwRc1PDmLn16vwW-jrHK7fWFjfbRvnDmdwDRkw0715HDYrHfYrHn4rj6drjbYPjcdnW63g1czPNtk0gTqYlQhv_MuVEMHCV5EEP83tUZV0gDqzIhvSIrBYSODknjE8_nKIHYzP1b1n1n4r07Y5HDdrHTLnHmknH0KUgDqn0cs0BYKmv6quhPxTAnKnHndrHTLnWTvP6" class="pJsPIR Naeqyh vLQtYD sDhKNG EUtKfl EC_ppim_new_gap_bottom" data-click="{&quot;fm&quot;:&quot;pp&quot;, &quot;p1&quot;:3002, &quot;p5&quot;:3002, &quot;rsv_srcid&quot;:&quot;49509&quot;}">
   <div class="kcJoV_">
    <h3 class="t MdiUjv PdUOce"><a href="http://www.baidu.com/baidu.php" class="TtjIpz" data-is-main-url="true" data-landurl="https://www.acadsoc.com.cn/sem/baidu/Perry/shaoer/children1-pc-kf.htm?utm_source=baidu&amp;utm_keyword=162093323369&amp;utm_terminal=PC&amp;utm_seat=cl2&amp;utm_page=1&amp;C_idea=c265-195-159" target="_blank">阿卡索_专注3-18岁在线青<font color="#CC0000">少儿英语</font>_低至13.8/节</a></h3>
   </div>
   <div class="c-abstract gzGZwQ uBmWPn ">
    <div class="cIjW_e">
     <span class="HjKhFx"><span class="qvlCax">获得网友:&nbsp;</span><a class="sTKFSc" href="http://koubei.baidu.com/s/d8b13979178ae080408f018e98cf50aa" target="_blank" data-click="{&quot;rsv_ct&quot;:1031,&quot;p2&quot;:1}"><span class="QAXGkf">89%好评</span></a></span>
     <span class="HjKhFx"><span class="QAXGkf">164条评价</span></span>
     <span class="HjKhFx GNJ_EY"><span class="QAXGkf">“性价比高&nbsp;|&nbsp;价格便宜&nbsp;|&nbsp;老师好”</span></span>
    </div>
    <div class="">
     <a hidefocus="hidefocus" target="_blank" class="xVhiEH" href="http://www.baidu.com/baidu.php" data-landurl="https://www.acadsoc.com.cn/sem/baidu/Perry/shaoer/children1-pc-kf.htm?utm_source=baidu&amp;utm_keyword=162093323369&amp;utm_terminal=PC&amp;utm_seat=cl2&amp;utm_page=1&amp;C_idea=c265-195-159">阿卡索<font color="#CC0000">少儿英语</font>,暖爸佟大为代言品牌,教材全面覆盖中小学新课标,外教1对1在线教学,严选全球10000+持TESOL证书外教,全英互动课堂,让孩子爱学敢说!</a>
    </div>
   </div>
   <div class="bTwItd uBmWPn">
    <a href="http://www.baidu.com/baidu.php" target="_blank" class="FZKtvO"><span class="FtGTVL">www.acadsoc.com.cn</span>&nbsp;<span class="LHCArh">2020-08</span></a>
    <div id="tools_213_1" style="margin-left:5px;" class="c-tools">
     <a class="c-tip-icon"><i class="c-icon c-icon-triangle-down-g"></i></a>
    </div>
    <span class="icons PdUOce FYwnyN"><a class="_haGlQ c-icon ec-baobiao ec-baobiao-first" data-baobiao="{&quot;baobiao_text&quot;:&quot;\u8be5\u4f01\u4e1a\u5df2\u901a\u8fc7\u5b9e\u540d\u8ba4\u8bc1\uff0c\u67e5\u770b&nbsp;&lt;a href=\&quot;https:\/\/www.baidu.com\/s?wd=%E6%B7%B1%E5%9C%B3%E5%B8%82%E9%98%BF%E5%8D%A1%E7%B4%A2%E8%B5%84%E8%AE%AF%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8@v&amp;vmp_ec=1578044188&amp;vmp_ectm=c263ee24fec303c978c79bakMId2zmXc2cfe71O85T243Mz25zl4kfXlfe4Nd4a1485247a2&amp;from=fc\&quot; target=\&quot;_blank\&quot;&gt;\u4f01\u4e1a\u6863\u6848&lt;\/a&gt;\u3002&lt;\/br&gt;\u60a8\u5df2\u767b\u5f55\u767e\u5ea6\u8d26\u53f7\uff0c&nbsp;&lt;a href=\&quot;http:\/\/baozhang.baidu.com\/guarantee\/?from=fcad\&quot; target=\&quot;_blank\&quot;&gt;\u767e\u5ea6\u7f51\u6c11\u6743\u76ca\u4fdd\u969c\u8ba1\u5212&lt;\/a&gt;&nbsp;\u4e3a\u60a8\u641c\u7d22\u62a4\u822a\u3002&quot;,&quot;baobiao_title&quot;:&quot;\u6df1\u5733\u5e02\u963f\u5361\u7d22\u8d44\u8baf\u80a1\u4efd\u6709\u9650\u516c\u53f8&quot;}"></a></span>
    <font class="LbmOSd lNOPyK zpoRhUppouter " size="-1">-&nbsp;<a href="http://www.baidu.com/baidu.php" target="_blank" class="m enCEkc">评价</a></font>
    <font class="LbmOSd ec_tuiguang_ppouter ec_tuiguang_container" size="-1"><a class="PdUOce m IAu_Hy m ec_tuiguang_ppouter ec_tuiguang_pplink " target="_blank" href="http://www.baidu.com/baidu.php" style="margin-left:5px;"><span data-tuiguang="{&quot;tuiguang_text&quot;:&quot;\u672c\u641c\u7d22\u7ed3\u679c\u4e3a&nbsp;&lt;a href=\&quot;https:\/\/isite.baidu.com\/site\/e.baidu.com\/4a98a7ec-2715-49b3-b97a-6f87b8617926?refer=919\&quot; target=\&quot;_blank\&quot;&gt;\u5546\u4e1a\u63a8\u5e7f&lt;\/a&gt;&nbsp;\u4fe1\u606f\uff0c\u8bf7\u6ce8\u610f\u53ef\u80fd\u7684\u98ce\u9669\u3002&lt;br\/&gt;&quot;,&quot;tuiguang_title&quot;:&quot;&quot;}" class="MnZBAi">广告</span></a></font>
   </div>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class=" FCmXHq" style="display: none;" data-rank="1">幼儿蒙氏英语</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">少儿英语</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">儿童学英语哪里好</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">幼儿园英语教材</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">少儿英语应该如何学</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">英孚 少儿英语 价格</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;" data-rank="1">幼儿英语和少儿英语的区别</a>
   <a href="http://www.baidu.com/baidu.php" target="_blank" class="c-gap-left-large FCmXHq" style="display: none;border:none;" data-rank="1">少儿英语主要学什么</a>
  </div>
  <!-- pc jieou new -->

通过分析上面的代码后,不难发现所有的推广信息都会放在一个div里面,这个div里面包含属性id和属性cmatchid,而其它正常的搜索结果不会同时带有这两个属性。那么Xpath语句就可以这么写://div[@id and @cmatchid],注意该语句不会管属性值是什么内容,只要包含属性即可。

当然如果只需要匹配推广标题的话,可以使用模糊匹配Xpath语句://h3[contains(@class,"t ")]

如果只需要匹配摘要可以使用精确匹配Xpath语句://div[@class=""]/a/text()。

如果要获取标题和简介,仍然需要先获取区域后再从区域内获取标题和简介,这样做的目的是为了防止数据错行。如果你还有别的问题,可以留言咨询。

如果你还有其它疑问可以来本站搜索相关问题,这里会有你想要的答案:火车脚本网

还有什么疑问可以提出来
  • 全部评论(0