RE: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3
Brought to you by:
derrickoswald
From: <dha...@po...> - 2003-05-27 07:10:40
|
Hi Randall, I faced very similar problems when I migrated to 1.3. Basically I believe the structure now returned has more of a DOM face than a SAX one as earlier. Earlier all nodes were available at the base level. Today they are buried within other nodes. Hence your <A> is buried in the <TABLE> tag and has to be extracted as a child of the <TABLE> tag. Code for extracing link tag is given in documentation of node. Reproducing it here for u: NodeList collectionList =3D new NodeList();=20 Node node;=20 for (NodeIterator e =3D parser.elements(); e.hasMoreNodes();) { node =3D e.nextNode(); node.collectInto (collectionVector, LinkTag.class); } =20 This will get link nodes at all levels. Dhaval > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of rs...@so... > Sent: Tuesday, May 27, 2003 11:37 AM > To: htm...@li... > Subject: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 >=20 >=20 > Hi, >=20 > I have a properly functioning link extraction program based on=20 > HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the newly=20 > released final version). I've fixed up all the changed data=20 > types, but=20 > I'm finding that links (<A> tags) contained within tables are=20 > being skipped. >=20 > If in my node processing loop I replace the calls to my internal node=20 > processing function (which is called only if the node is an=20 > instance of=20 > LinkTag) with System.out.println(node.toString()); (for all returned=20 > node types), I see tables coming through as "blobs," if you=20 > will. That=20 > is, I see the entire <TABLE> ... </TABLE> element in the output as a=20 > single node with none of its internal structure parsed. >=20 > I've tried adding a TableScanner to the parser (using=20 > parser.addScanner(new TableScanner(parser, "-tb"));), a=20 > TableRowScanner=20 > (using parser.addScanner(new TableRowScanner("-tr", parser));) and a=20 > TableColumnScanner (using parser.addScanner(new=20 > TableColumnScanner("-tc"));), but to no avail. >=20 > What am I missing? How to I process _all_ of the link tags within the=20 > input document? >=20 > Thanks. >=20 > Randall Schulz >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |