RE: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Randall,

I faced very similar problems when I migrated to 1.3. Basically I
believe the structure now returned has more of a DOM face than a SAX one
as earlier. Earlier all nodes were available at the base level. Today
they are buried within other nodes. Hence your <A> is buried in the
<TABLE> tag and has to be extracted as a child of the <TABLE> tag.

Code for extracing link tag is given in documentation of node.
Reproducing it here for u:

NodeList collectionList =3D new NodeList();=20
 Node node;=20
 for (NodeIterator e =3D parser.elements(); e.hasMoreNodes();) {
 		node =3D e.nextNode();
 		node.collectInto (collectionVector, LinkTag.class);
 }
=20
This will get link nodes at all levels.

Dhaval

> -----Original Message-----
> From: htm...@li...=20
> [mailto:htm...@li...] On=20
> Behalf Of rs...@so...
> Sent: Tuesday, May 27, 2003 11:37 AM
> To: htm...@li...
> Subject: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3
>=20
>=20
> Hi,
>=20
> I have a properly functioning link extraction program based on=20
> HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the newly=20
> released final version). I've fixed up all the changed data=20
> types, but=20
> I'm finding that links (<A> tags) contained within tables are=20
> being skipped.
>=20
> If in my node processing loop I replace the calls to my internal node=20
> processing function (which is called only if the node is an=20
> instance of=20
> LinkTag) with System.out.println(node.toString()); (for all returned=20
> node types), I see tables coming through as "blobs," if you=20
> will. That=20
> is, I see the entire <TABLE> ... </TABLE> element in the output as a=20
> single node with none of its internal structure parsed.
>=20
> I've tried adding a TableScanner to the parser (using=20
> parser.addScanner(new TableScanner(parser, "-tb"));), a=20
> TableRowScanner=20
> (using parser.addScanner(new TableRowScanner("-tr", parser));) and a=20
> TableColumnScanner (using parser.addScanner(new=20
> TableColumnScanner("-tc"));), but to no avail.
>=20
> What am I missing? How to I process _all_ of the link tags within the=20
> input document?
>=20
> Thanks.
>=20
> Randall Schulz
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by: ObjectStore.
> If flattening out C++ or Java code to make your application=20
> fit in a relational database is painful, don't do it! Check=20
> out ObjectStore. Now part of Progress Software.=20
http://www.objectstore.net/sourceforge
_______________________________________________
Htmlparser-user mailing list Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user