Wednesday, 7 August 2013

XPath 1.0 excluding a branch of nodes

XPath 1.0 excluding a branch of nodes

I'm trying to parse through HTML scraped from a forum, and I'm trying to
exclude
Sample HTML:
<li id="post-#####" class="message " data-author="#####">
<div class="messageUserInfo" itemscope="itemscope"
itemtype="http://data-vocabulary.org/Person">
-snip-
</div>
<div class="messageInfo primaryContent">
<div class="messageContent">
<article>
<blockquote class="messageText ugc baseHtml">
<div class="bbCodeBlock bbCodeQuote" data-author="#####">
<aside>
<div class="attribution type">##### said:
<a href="goto/post?id=#####"
class="AttributionLink">&uarr;</a>
</div>
<blockquote>#####</blockquote>
</aside>
</div>M<b>ARK</b>ER
</blockquote>
</article>
</div>
<div class="messageMeta">
<div class="privateControls">
<span class="item muted">
<a href="#####" class="username author">#####</a>,
<a href="#####" title="Permalink"
class="datePermalink"><span class="DateTime"
title="#####">Jun 4, 2013</span></a>
</span>
</div>
<div class="publicControls">
<a href="#####" title="Permalink" class="item muted
postNumber hashPermalink OverlayTrigger"
data-href="#####">#2491</a>
</div>
</div>
<div id="#####"></div>
</div>
</li>
I'm trying to pick out post bodies that contain 'MARKER'. The MARKER may
be broken up by formatting tags such as <b></b>, for example,
M<b>ARK</b>ER and therefore, I need to extract the text recursively.
However, I do not want the post if the 'MARKER' is contained between the
<div class="messageText ugc baseHtml"></div> tags, so I need to exclude
that.
What I currently have is this (using lxml on Python 3.3):
if root.xpath("./div[not(@class='bbCodeBlock
bbCodeQuote')]//*[contains(.,'MARKER')]"):
This does not ignore the section between the quote blocks, but does
recurse successfully to return true for a MARKER broken up by formatting
tags.
This is somewhat similar to this: Exclude node from XPATH results, but I
need to be able to recurse and extract all the text.
if root.xpath("./div[not(@class='bbCodeBlock
bbCodeQuote')]/text()[contains(.,'MARKER')]"):
if root.xpath("./div[not(@class='bbCodeBlock
bbCodeQuote')]//text()[contains(.,'MARKER')]"):
These don't work, the first doesn't pick up anything, the second doesn't
recurse.

No comments:

Post a Comment