htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
1
|
2
(1) |
3
(6) |
4
(3) |
5
(1) |
6
|
7
(1) |
8
(2) |
9
(2) |
10
(6) |
11
(1) |
12
(2) |
13
(2) |
14
(3) |
15
|
16
|
17
(1) |
18
(4) |
19
(1) |
20
|
21
(4) |
22
|
23
(2) |
24
(1) |
25
(2) |
26
(2) |
27
(12) |
28
(4) |
29
|
30
(3) |
31
|
From: Somik R. <so...@ya...> - 2003-05-30 12:08:33
|
Dear Dhaval, Thank you for being a part of this project, and best wishes for your higher studies! Cheers Somik ----- Original Message ----- From: <dha...@po...> To: <htm...@li...>; <htm...@li...> Sent: Friday, May 30, 2003 2:50 AM Subject: [Htmlparser-developer] Bye bye Everyone, I have been associated with this project for a shade less than one year. During this period I have made some small contributions to this project and identified a few bugs. Most of all what I have enjoyed is the tremendous learning that I have received both from a techncial viewpoint and a design perspective. Its altered my methodology of software development. For one it has instilled JUnit into my development methodolgy. It has also showed me that redesign is not such a bad thing. On the whole it has been quite a great experience working with some amazing people like Somik, Derrick and many more amongst u all. I thank u all for the suport that I have recvd, the quick bug-fixes, the quick-fix solutions and the exhilarating discussions that I have been involved in within this group. I am moving on for higher studies int eh field of management and I do not think I can keep so many things on my plate. So very sadly letting go of a few. One of them being the HTMLParser. I wish it all the best in future and hope that the tool continues for a long long time to come. Regards to all, Dhaval |
From: Derrick O. <Der...@ro...> - 2003-05-30 10:59:27
|
Dhaval, Your valuable input and extensive experience will be sorely missed. Best of luck in your new endeavors. Derrick dha...@po... wrote: >Everyone, > >I have been associated with this project for a shade less than one year. >During this period I have made some small contributions to this project >and identified a few bugs. Most of all what I have enjoyed is the >tremendous learning that I have received both from a technical viewpoint >and a design perspective. Its altered my methodology of software >development. For one it has instilled JUnit into my development >methodology. It has also showed me that redesign is not such a bad thing. >On the whole it has been quite a great experience working with some >amazing people like Somik, Derrick and many more amongst u all. I thank >u all for the support that I have recvd, the quick bug-fixes, the >quick-fix solutions and the exhilarating discussions that I have been >involved in within this group. > >I am moving on for higher studies in the field of management and I do >not think I can keep so many things on my plate. So very sadly letting >go of a few. One of them being the HTMLParser. > >I wish it all the best in future and hope that the tool continues for a >long long time to come. > >Regards to all, >Dhaval > > >------------------------------------------------------------------------ > >This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. >If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. >You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, >distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. > >Visit Us at http://www.polaris.co.in > > |
From: <dha...@po...> - 2003-05-30 07:13:00
|
Everyone, I have been associated with this project for a shade less than one year. During this period I have made some small contributions to this project and identified a few bugs. Most of all what I have enjoyed is the tremendous learning that I have received both from a techncial viewpoint and a design perspective. Its altered my methodology of software development. For one it has instilled JUnit into my development methodolgy. It has also showed me that redesign is not such a bad thing. On the whole it has been quite a great experience working with some amazing people like Somik, Derrick and many more amongst u all. I thank u all for the suport that I have recvd, the quick bug-fixes, the quick-fix solutions and the exhilarating discussions that I have been involved in within this group. I am moving on for higher studies int eh field of management and I do not think I can keep so many things on my plate. So very sadly letting go of a few. One of them being the HTMLParser.=20 I wish it all the best in future and hope that the tool continues for a long long time to come.=20 Regards to all, Dhaval |
From: Somik R. <so...@ya...> - 2003-05-28 12:38:31
|
One simplification: parser.addScanner (new TableScanner (parser, "-t"));=20 parser.addScanner (new BodyScanner ("-b"));=20 parser.setURL (args[0]); /* http://whatever */=20 iterator =3D parser.elements ();=20 while (iterator.hasMoreNodes ())=20 {=20 node =3D iterator.nextNode ();=20 if (node instanceof BodyTag)=20 {=20 contents =3D ((BodyTag)node).getChildrenAsNodeArray ();=20 for (int h =3D 0; h < contents.length; h++)=20 if (contents[h] instanceof TableTag)=20 {=20 TableTag tableTag =3D (TableTag)contents[h];=20 ... Can be: parser.addScanner (new TableScanner (parser, "-t"));=20 parser.addScanner (new BodyScanner ("-b"));=20 parser.setURL (args[0]); /* http://whatever */=20 Node [] tables =3D parser.extractAllNodesThatAre(TableTag.class); for (int i=3D0;i<tables.length;i++) { TableTag tableTag =3D (TableTag)contents[h];=20 ... Regards, Somik ----- Original Message -----=20 From: Derrick Oswald=20 To: htm...@li...=20 Sent: Wednesday, May 28, 2003 6:53 AM Subject: Re: [Htmlparser-user] unable to obtain table rows and table = columns. v1.3 If you need to dig into tables and get a specific cell, the code you = need is something like this:=20 import org.htmlparser.Node;=20 import org.htmlparser.Parser;=20 import org.htmlparser.StringNode;=20 import org.htmlparser.scanners.BodyScanner;=20 import org.htmlparser.scanners.TableScanner;=20 import org.htmlparser.tags.BodyTag;=20 import org.htmlparser.tags.TableColumn;=20 import org.htmlparser.tags.TableRow;=20 import org.htmlparser.tags.TableTag;=20 import org.htmlparser.util.NodeIterator;=20 import org.htmlparser.util.ParserException;=20 public class Example=20 {=20 /**=20 * Example=20 * @param args Pass arg[0] as the URL to process.=20 */=20 public static void main (String[] args)=20 {=20 Parser parser;=20 NodeIterator iterator;=20 Node node;=20 Hashtable attributes;=20 Enumeration enumeration;=20 Node[] contents;=20 parser =3D new Parser ();=20 try=20 {=20 parser.addScanner (new TableScanner (parser, "-t"));=20 parser.addScanner (new BodyScanner ("-b"));=20 parser.setURL (args[0]); /* http://whatever */=20 iterator =3D parser.elements ();=20 while (iterator.hasMoreNodes ())=20 {=20 node =3D iterator.nextNode ();=20 if (node instanceof BodyTag)=20 {=20 contents =3D ((BodyTag)node).getChildrenAsNodeArray = ();=20 for (int h =3D 0; h < contents.length; h++)=20 if (contents[h] instanceof TableTag)=20 {=20 TableTag tableTag =3D = (TableTag)contents[h];=20 for (int i=3D0;i<tableTag.getRowCount = ();i++)=20 {=20 TableRow row =3D tableTag.getRow (i);=20 TableColumn [] columns =3D = row.getColumns ();=20 for (int j=3D0;j<columns.length;j++)=20 {=20 // do something with columns[j]=20 // see the HTMLParser API = http://htmlparser.sourceforge.net/javadoc_1_3/=20 }=20 }=20 }=20 }=20 }=20 }=20 catch (ParserException pe)=20 {=20 pe.printStackTrace ();=20 }=20 }=20 }=20 Elizabeth Wong wrote: Hi there, I am unable to obtain table rows, table columns. When iterating = through the document, as my code comes across a node(<TABLE>) which is = an instanceof Tag, after applying the to.Html() method to it, it prints = out the entire table(refer to below)rather than just <TABLE> tag as = expected. The next iteration of the document, as it comes across a node = (I'm expecting it to be <TR>) which is an instanceof Tag, the to.Html() = method returns </p>. The same problem happens when I use instanceof = TableTag. I have added the TableRowScanner and the TableColumnScanner. = I've attempted to do instanceof TableRow and instanceof TableColumn = however it seems that these are simply ignored. Could I please get some = help as to how to obtain the TableRow and Table Column tags? Sample part of html document: <TABLE id=3D "links"> <TR id =3D "r1"><TH id =3D = "h1">Hotmail</TH><TD id =3D "col1"> <a id =3D "href1" = href=3D"http://www.hotmail.com">=20 hotmail </a></TD></TR> <TR id =3D "r2" ><TH id =3D "h2"> = Yahoo</TH><TD id =3D "col2"> <a id =3D "href2" = href=3D"http://www.hotmail.com"> yahoo</a></TD></TR> <TR id =3D "r3" ><TH id =3D = "h3">Emailcash</TH><TD id =3D "col3"> <a id =3D "href3" = href=3D"http://www.emailcash.com">=20 emailcash </a></TD></TR> <TR id =3D "r4" ><TH id =3D "h4" = >Ninemsn</TH><TD id =3D "col4"> <a id =3D "href4" = href=3D"http://www.ninemsn.com"> ninemsn</a></TD></TR> </TABLE> </p> Thanks for listening. If anyone could be of help it would be much = appreciated. lisbeth -------------------------------------------------------------------------= --- Do you Yahoo!? Free online calendar with sync to Outlook(TM).=20 |
From: Derrick O. <Der...@ro...> - 2003-05-28 11:01:31
|
If you need to dig into tables and get a specific cell, the code you need is something like this: import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.StringNode; import org.htmlparser.scanners.BodyScanner; import org.htmlparser.scanners.TableScanner; import org.htmlparser.tags.BodyTag; import org.htmlparser.tags.TableColumn; import org.htmlparser.tags.TableRow; import org.htmlparser.tags.TableTag; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.ParserException; public class Example { /** * Example * @param args Pass arg[0] as the URL to process. */ public static void main (String[] args) { Parser parser; NodeIterator iterator; Node node; Hashtable attributes; Enumeration enumeration; Node[] contents; parser = new Parser (); try { parser.addScanner (new TableScanner (parser, "-t")); parser.addScanner (new BodyScanner ("-b")); parser.setURL (args[0]); /* http://whatever */ iterator = parser.elements (); while (iterator.hasMoreNodes ()) { node = iterator.nextNode (); if (node instanceof BodyTag) { contents = ((BodyTag)node).getChildrenAsNodeArray (); for (int h = 0; h < contents.length; h++) if (contents[h] instanceof TableTag) { TableTag tableTag = (TableTag)contents[h]; for (int i=0;i<tableTag.getRowCount ();i++) { TableRow row = tableTag.getRow (i); TableColumn [] columns = row.getColumns (); for (int j=0;j<columns.length;j++) { // do something with columns[j] // see the HTMLParser API http://htmlparser.sourceforge.net/javadoc_1_3/ } } } } } } catch (ParserException pe) { pe.printStackTrace (); } } } Elizabeth Wong wrote: > Hi there, > > I am unable to obtain table rows, table columns. When iterating > through the document, as my code comes across a node(<TABLE>) which is > an instanceof Tag, after applying the to.Html() method to it, it > prints out the entire table(refer to below)rather than just <TABLE> > tag as expected. The next iteration of the document, as it comes > across a node (I'm expecting it to be <TR>) which is an instanceof > Tag, the to.Html() method returns </p>. The same problem happens when > I use instanceof TableTag. I have added the TableRowScanner and the > TableColumnScanner. I've attempted to do instanceof TableRow and > instanceof TableColumn however it seems that these are simply ignored. > Could I please get some help as to how to obtain the TableRow and > Table Column tags? > > > > Sample part of html document: > > <TABLE id= "links"> > > <TR id = "r1"><TH id = "h1">Hotmail</TH><TD id > = "col1"> > > <a id = "href1" href="http://www.hotmail.com"> > > hotmail </a></TD></TR> > > <TR id = "r2" ><TH id = "h2"> Yahoo</TH><TD id > = "col2"> > > <a id = "href2" href="http://www.hotmail.com"> > > yahoo</a></TD></TR> > > <TR id = "r3" ><TH id = "h3">Emailcash</TH><TD > id = "col3"> > > <a id = "href3" href="http://www.emailcash.com"> > > emailcash </a></TD></TR> > > <TR id = "r4" ><TH id = "h4" >Ninemsn</TH><TD > id = "col4"> > > <a id = "href4" href="http://www.ninemsn.com"> > > ninemsn</a></TD></TR> > > </TABLE> > > </p> > > > > Thanks for listening. If anyone could be of help it would be much > appreciated. > > lisbeth > > ------------------------------------------------------------------------ > Do you Yahoo!? > Free online calendar > <http://us.rd.yahoo.com/mail_us/tag/*http://calendar.yahoo.com> with > sync to Outlook(TM). |
From: <dha...@po...> - 2003-05-28 05:58:35
|
This e-Mail may contain proprietary and confidential information and is sent for the intended recipient(s) only. If by an addressing or transmission error this mail has been misdirected to you, you are requested to delete this mail immediately. You are also hereby notified that any use, any form of reproduction, dissemination, copying, disclosure, modification, distribution and/or publication of this e-mail message, contents or its attachment other than by its intended recipient/s is strictly prohibited. Visit Us at http://www.polaris.co.in |
From: Elizabeth W. <spa...@ya...> - 2003-05-28 03:22:56
|
Hi there, I am unable to obtain table rows, table columns. When iterating through the document, as my code comes across a node(<TABLE>) which is an instanceof Tag, after applying the to.Html() method to it, it prints out the entire table(refer to below)rather than just <TABLE> tag as expected. The next iteration of the document, as it comes across a node (Im expecting it to be <TR>) which is an instanceof Tag, the to.Html() method returns </p>. The same problem happens when I use instanceof TableTag. I have added the TableRowScanner and the TableColumnScanner. Ive attempted to do instanceof TableRow and instanceof TableColumn however it seems that these are simply ignored. Could I please get some help as to how to obtain the TableRow and Table Column tags? Sample part of html document: <TABLE id= "links"> <TR id = "r1"><TH id = "h1">Hotmail</TH><TD id = "col1"> <a id = "href1" href="http://www.hotmail.com"> hotmail </a></TD></TR> <TR id = "r2" ><TH id = "h2"> Yahoo</TH><TD id = "col2"> <a id = "href2" href="http://www.hotmail.com"> yahoo</a></TD></TR> <TR id = "r3" ><TH id = "h3">Emailcash</TH><TD id = "col3"> <a id = "href3" href="http://www.emailcash.com"> emailcash </a></TD></TR> <TR id = "r4" ><TH id = "h4" >Ninemsn</TH><TD id = "col4"> <a id = "href4" href="http://www.ninemsn.com"> ninemsn</a></TD></TR> </TABLE> </p> Thanks for listening. If anyone could be of help it would be much appreciated. lisbeth --------------------------------- Do you Yahoo!? Free online calendar with sync to Outlook(TM). |
From: Derrick O. <Der...@ro...> - 2003-05-27 22:14:59
|
Randall, Attributes are stored in the hashtable with uppercase keys and getAttribute() converts it's argument to uppercase also. The two examples are the same. As Mark suggests, writing a visitor to handle links is probably the best approach. But you may want to override visitLinkTag() instead. File a bug report regarding "file://localhostC:/", we're pretty unix centric and just missed that one. Derrrick Randall R Schulz wrote: > Hi, > > Following up on my latest lament... > > > I'm using this method to get unmodified HREF= attribute values: > > private > String > linkHREF(LinkTag link) > { > String href = link.getAttribute("href"); > if (href != null) > return href; > > href = link.getAttribute("HREF"); > if (href != null) > return href; > > return ""; > } > > > The documentation doesn't make clear if Tag.getAttribute() performs > alphabetic case canonicalization on the attribute names in the tags in > the HTML input and on its arguments, so I tried two of the 16 possible > combinations of upper and lower case among four letters. > > Comments? > > Randall Schulz > > > At 09:14 2003-05-27, Randall R Schulz wrote: > >> Hello, Again, >> >> I have another problem with changed behavior in version 1.3 compared >> to 1.2. >> >> I am using LinkTag.getLink() to get the target of a link. In version >> 1.2 this returned the unadorned value of the HREF= attribute as it >> appeared in the HTML document. This is what I want and what I need. >> >> In version 1.3, the result is modified when the link is relative. I'm >> (currently) processing local files, not documents retrieved via >> network a URL and the relative links are being resolved to something >> like this: >> >> Input file name: "C:/SHS/Web/Biography/Biography.html" >> HTML fragment: "<a href="SHS.html"> ... </a>" >> getLlink() return value: "file://localhostC:/SHS/Web/Biography/SHS.html" >> >> Not only is this not what I want, it's also malformed, since the >> required slash between the (gratuitous) "localhost" and the >> (gratuitous) "C:/SHS/Web/Biography/" is missing. >> >> How can I get the original behavior? >> >> Thanks. >> >> Randall Schulz > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application fit in a > relational database is painful, don't do it! Check out ObjectStore. > Now part of Progress Software. http://www.objectstore.net/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Andy N. <and...@ut...> - 2003-05-27 21:12:40
|
Hi HTMLParser User, How do I deal with Frame? I want to know the location the frame is pointing to I can't find any examples in the documentation. Thanks Andy |
From: Randall R S. <rs...@so...> - 2003-05-27 17:37:29
|
Hi, Following up on my latest lament... I'm using this method to get unmodified HREF= attribute values: private String linkHREF(LinkTag link) { String href = link.getAttribute("href"); if (href != null) return href; href = link.getAttribute("HREF"); if (href != null) return href; return ""; } The documentation doesn't make clear if Tag.getAttribute() performs alphabetic case canonicalization on the attribute names in the tags in the HTML input and on its arguments, so I tried two of the 16 possible combinations of upper and lower case among four letters. Comments? Randall Schulz At 09:14 2003-05-27, Randall R Schulz wrote: >Hello, Again, > >I have another problem with changed behavior in version 1.3 compared to 1.2. > >I am using LinkTag.getLink() to get the target of a link. In version >1.2 this returned the unadorned value of the HREF= attribute as it >appeared in the HTML document. This is what I want and what I need. > >In version 1.3, the result is modified when the link is relative. I'm >(currently) processing local files, not documents retrieved via >network a URL and the relative links are being resolved to something like this: > >Input file name: "C:/SHS/Web/Biography/Biography.html" >HTML fragment: "<a href="SHS.html"> ... </a>" >getLlink() return value: "file://localhostC:/SHS/Web/Biography/SHS.html" > >Not only is this not what I want, it's also malformed, since the >required slash between the (gratuitous) "localhost" and the >(gratuitous) "C:/SHS/Web/Biography/" is missing. > >How can I get the original behavior? > >Thanks. > >Randall Schulz |
From: Marc N. <ma...@ke...> - 2003-05-27 17:26:28
|
The default Visitors may collect the tags into collections, but it = completely possible to write your own Visitor that handles each tag as = it is parsed in the source document. Just override the "visitTag()" = method and do what you want with each tag as it's parsed. Marc -----Original Message----- From: Randall R Schulz [mailto:rs...@so...] Sent: Tuesday, May 27, 2003 10:15 AM To: htm...@li... Subject: RE: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 Marc, That's not really what I'm asking for. It's not stream oriented because=20 it forces the entire file to be read and an array holding the entire=20 set of links to be collected before I can process them. Nonetheless, Dhaval's solution worked fine. Now what I really need is unmodified HREF=3D attribute values. Randall Schulz At 09:58 2003-05-27, Marc Novakowski wrote: >Version 1.3 now has a "visitor" pattern to do exactly what you need: >http://htmlparser.sourceforge.net/docs/index.php/VisitorPattern > >See also the "LinkFindingVisitor" class (subclass of NodeVisitor). > >Marc ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randall R S. <rs...@so...> - 2003-05-27 17:14:38
|
Marc, That's not really what I'm asking for. It's not stream oriented because it forces the entire file to be read and an array holding the entire set of links to be collected before I can process them. Nonetheless, Dhaval's solution worked fine. Now what I really need is unmodified HREF= attribute values. Randall Schulz At 09:58 2003-05-27, Marc Novakowski wrote: >Version 1.3 now has a "visitor" pattern to do exactly what you need: >http://htmlparser.sourceforge.net/docs/index.php/VisitorPattern > >See also the "LinkFindingVisitor" class (subclass of NodeVisitor). > >Marc |
From: Marc N. <ma...@ke...> - 2003-05-27 16:58:09
|
Version 1.3 now has a "visitor" pattern to do exactly what you need: http://htmlparser.sourceforge.net/docs/index.php/VisitorPattern See also the "LinkFindingVisitor" class (subclass of NodeVisitor). Marc -----Original Message----- From: Randall R Schulz [mailto:rs...@so...] Sent: Tuesday, May 27, 2003 6:26 AM To: htm...@li... Subject: RE: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 Dhaval, Thanks for the information. Does this mean that there is not a strictly stream-based way to examine=20 a complete HTML document? I don't mean to be impertinent, but this seems like a weaker design=20 than that used in version 1.2. I don't like having to build collections=20 containing portions of the document's contents when all I'm doing is=20 making a single pass over the file. I think a purely stream-oriented=20 model, whenever suitable to the driving application, is distinctly=20 preferred. Building DOM-like structures when there's no need for random=20 access to their sub-structure seems like a waste of memory. If there's a way to preserve the kind of program design I was using,=20 I'd really like to know about it. Thanks again. Randall Schulz At 00:12 2003-05-27, dha...@po... wrote: >Hi Randall, > >I faced very similar problems when I migrated to 1.3. Basically I >believe the structure now returned has more of a DOM face than a SAX = one >as earlier. Earlier all nodes were available at the base level. Today >they are buried within other nodes. Hence your <A> is buried in the ><TABLE> tag and has to be extracted as a child of the <TABLE> tag. > >Code for extracing link tag is given in documentation of node. >Reproducing it here for u: > >NodeList collectionList =3D new NodeList(); > Node node; > for (NodeIterator e =3D parser.elements(); e.hasMoreNodes();) { > node =3D e.nextNode(); > node.collectInto (collectionVector, LinkTag.class); > } > >This will get link nodes at all levels. > >Dhaval > > > > -----Original Message----- > > From: htm...@li... > > [mailto:htm...@li...] On > > Behalf Of rs...@so... > > Sent: Tuesday, May 27, 2003 11:37 AM > > To: htm...@li... > > Subject: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 > > > > > > Hi, > > > > I have a properly functioning link extraction program based on > > HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the = newly > > released final version). I've fixed up all the changed data > > types, but > > I'm finding that links (<A> tags) contained within tables are > > being skipped. > > > > If in my node processing loop I replace the calls to my internal = node > > processing function (which is called only if the node is an > > instance of > > LinkTag) with System.out.println(node.toString()); (for all returned > > node types), I see tables coming through as "blobs," if you > > will. That > > is, I see the entire <TABLE> ... </TABLE> element in the output as a > > single node with none of its internal structure parsed. > > > > I've tried adding a TableScanner to the parser (using > > parser.addScanner(new TableScanner(parser, "-tb"));), a > > TableRowScanner > > (using parser.addScanner(new TableRowScanner("-tr", parser));) and a > > TableColumnScanner (using parser.addScanner(new > > TableColumnScanner("-tc"));), but to no avail. > > > > What am I missing? How to I process _all_ of the link tags within = the > > input document? > > > > Thanks. > > > > Randall Schulz ------------------------------------------------------- This SF.net email is sponsored by: ObjectStore. If flattening out C++ or Java code to make your application fit in a relational database is painful, don't do it! Check out ObjectStore. Now part of Progress Software. http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randall R S. <rs...@so...> - 2003-05-27 16:14:26
|
Hello, Again, I have another problem with changed behavior in version 1.3 compared to 1.2. I am using LinkTag.getLink() to get the target of a link. In version 1.2 this returned the unadorned value of the HREF= attribute as it appeared in the HTML document. This is what I want and what I need. In version 1.3, the result is modified when the link is relative. I'm (currently) processing local files, not documents retrieved via network a URL and the relative links are being resolved to something like this: Input file name: "C:/SHS/Web/Biography/Biography.html" HTML fragment: "<a href="SHS.html"> ... </a>" getLlink() return value: "file://localhostC:/SHS/Web/Biography/SHS.html" Not only is this not what I want, it's also malformed, since the required slash between the (gratuitous) "localhost" and the (gratuitous) "C:/SHS/Web/Biography/" is missing. How can I get the original behavior? Thanks. Randy |
From: <dha...@po...> - 2003-05-27 13:37:08
|
Hi Randall, Lets not get critical over the design of 1.3. I think it has helped to bring about a great degree of stability as well as correctness into the parser. The isitor pattern employed has helped a lotand its really visible to a person like me who has used the parser extensively in my project. I also had a stream-based approach and I managed to maintain the same through a recursive approach of searching the page and then reproducing it. If u let me know your design and your requirement, there will definitely be a way out, only that we will have to search for it. Dhaval > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of rs...@so... > Sent: Tuesday, May 27, 2003 6:56 PM > To: htm...@li... > Subject: RE: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 >=20 >=20 > Dhaval, >=20 > Thanks for the information. >=20 > Does this mean that there is not a strictly stream-based way=20 > to examine=20 > a complete HTML document? >=20 > I don't mean to be impertinent, but this seems like a weaker design=20 > than that used in version 1.2. I don't like having to build=20 > collections=20 > containing portions of the document's contents when all I'm doing is=20 > making a single pass over the file. I think a purely stream-oriented=20 > model, whenever suitable to the driving application, is distinctly=20 > preferred. Building DOM-like structures when there's no need=20 > for random=20 > access to their sub-structure seems like a waste of memory. >=20 > If there's a way to preserve the kind of program design I was using,=20 > I'd really like to know about it. >=20 > Thanks again. >=20 > Randall Schulz >=20 >=20 > At 00:12 2003-05-27, dha...@po... wrote: > >Hi Randall, > > > >I faced very similar problems when I migrated to 1.3. Basically I=20 > >believe the structure now returned has more of a DOM face than a SAX=20 > >one as earlier. Earlier all nodes were available at the base level.=20 > >Today they are buried within other nodes. Hence your <A> is=20 > buried in=20 > >the <TABLE> tag and has to be extracted as a child of the=20 > <TABLE> tag. > > > >Code for extracing link tag is given in documentation of node.=20 > >Reproducing it here for u: > > > >NodeList collectionList =3D new NodeList(); > > Node node; > > for (NodeIterator e =3D parser.elements(); e.hasMoreNodes();) { > > node =3D e.nextNode(); > > node.collectInto (collectionVector, LinkTag.class); > > } > > > >This will get link nodes at all levels. > > > >Dhaval > > > > > > > -----Original Message----- > > > From: htm...@li... > > > [mailto:htm...@li...] On Behalf Of=20 > > > rs...@so... > > > Sent: Tuesday, May 27, 2003 11:37 AM > > > To: htm...@li... > > > Subject: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 > > > > > > > > > Hi, > > > > > > I have a properly functioning link extraction program based on=20 > > > HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the=20 > > > newly released final version). I've fixed up all the changed data=20 > > > types, but I'm finding that links (<A> tags) contained=20 > within tables=20 > > > are being skipped. > > > > > > If in my node processing loop I replace the calls to my internal=20 > > > node processing function (which is called only if the node is an=20 > > > instance of > > > LinkTag) with System.out.println(node.toString()); (for=20 > all returned=20 > > > node types), I see tables coming through as "blobs," if you will.=20 > > > That is, I see the entire <TABLE> ... </TABLE> element in=20 > the output=20 > > > as a single node with none of its internal structure parsed. > > > > > > I've tried adding a TableScanner to the parser (using=20 > > > parser.addScanner(new TableScanner(parser, "-tb"));), a=20 > > > TableRowScanner (using parser.addScanner(new=20 > TableRowScanner("-tr",=20 > > > parser));) and a TableColumnScanner (using parser.addScanner(new > > > TableColumnScanner("-tc"));), but to no avail. > > > > > > What am I missing? How to I process _all_ of the link tags within=20 > > > the input document? > > > > > > Thanks. > > > > > > Randall Schulz >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randall R S. <rs...@so...> - 2003-05-27 13:25:15
|
Dhaval, Thanks for the information. Does this mean that there is not a strictly stream-based way to examine a complete HTML document? I don't mean to be impertinent, but this seems like a weaker design than that used in version 1.2. I don't like having to build collections containing portions of the document's contents when all I'm doing is making a single pass over the file. I think a purely stream-oriented model, whenever suitable to the driving application, is distinctly preferred. Building DOM-like structures when there's no need for random access to their sub-structure seems like a waste of memory. If there's a way to preserve the kind of program design I was using, I'd really like to know about it. Thanks again. Randall Schulz At 00:12 2003-05-27, dha...@po... wrote: >Hi Randall, > >I faced very similar problems when I migrated to 1.3. Basically I >believe the structure now returned has more of a DOM face than a SAX one >as earlier. Earlier all nodes were available at the base level. Today >they are buried within other nodes. Hence your <A> is buried in the ><TABLE> tag and has to be extracted as a child of the <TABLE> tag. > >Code for extracing link tag is given in documentation of node. >Reproducing it here for u: > >NodeList collectionList = new NodeList(); > Node node; > for (NodeIterator e = parser.elements(); e.hasMoreNodes();) { > node = e.nextNode(); > node.collectInto (collectionVector, LinkTag.class); > } > >This will get link nodes at all levels. > >Dhaval > > > > -----Original Message----- > > From: htm...@li... > > [mailto:htm...@li...] On > > Behalf Of rs...@so... > > Sent: Tuesday, May 27, 2003 11:37 AM > > To: htm...@li... > > Subject: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 > > > > > > Hi, > > > > I have a properly functioning link extraction program based on > > HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the newly > > released final version). I've fixed up all the changed data > > types, but > > I'm finding that links (<A> tags) contained within tables are > > being skipped. > > > > If in my node processing loop I replace the calls to my internal node > > processing function (which is called only if the node is an > > instance of > > LinkTag) with System.out.println(node.toString()); (for all returned > > node types), I see tables coming through as "blobs," if you > > will. That > > is, I see the entire <TABLE> ... </TABLE> element in the output as a > > single node with none of its internal structure parsed. > > > > I've tried adding a TableScanner to the parser (using > > parser.addScanner(new TableScanner(parser, "-tb"));), a > > TableRowScanner > > (using parser.addScanner(new TableRowScanner("-tr", parser));) and a > > TableColumnScanner (using parser.addScanner(new > > TableColumnScanner("-tc"));), but to no avail. > > > > What am I missing? How to I process _all_ of the link tags within the > > input document? > > > > Thanks. > > > > Randall Schulz |
From: <dha...@po...> - 2003-05-27 07:10:40
|
Hi Randall, I faced very similar problems when I migrated to 1.3. Basically I believe the structure now returned has more of a DOM face than a SAX one as earlier. Earlier all nodes were available at the base level. Today they are buried within other nodes. Hence your <A> is buried in the <TABLE> tag and has to be extracted as a child of the <TABLE> tag. Code for extracing link tag is given in documentation of node. Reproducing it here for u: NodeList collectionList =3D new NodeList();=20 Node node;=20 for (NodeIterator e =3D parser.elements(); e.hasMoreNodes();) { node =3D e.nextNode(); node.collectInto (collectionVector, LinkTag.class); } =20 This will get link nodes at all levels. Dhaval > -----Original Message----- > From: htm...@li...=20 > [mailto:htm...@li...] On=20 > Behalf Of rs...@so... > Sent: Tuesday, May 27, 2003 11:37 AM > To: htm...@li... > Subject: [Htmlparser-user] Link Extraction from Tables w/ ver. 1.3 >=20 >=20 > Hi, >=20 > I have a properly functioning link extraction program based on=20 > HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the newly=20 > released final version). I've fixed up all the changed data=20 > types, but=20 > I'm finding that links (<A> tags) contained within tables are=20 > being skipped. >=20 > If in my node processing loop I replace the calls to my internal node=20 > processing function (which is called only if the node is an=20 > instance of=20 > LinkTag) with System.out.println(node.toString()); (for all returned=20 > node types), I see tables coming through as "blobs," if you=20 > will. That=20 > is, I see the entire <TABLE> ... </TABLE> element in the output as a=20 > single node with none of its internal structure parsed. >=20 > I've tried adding a TableScanner to the parser (using=20 > parser.addScanner(new TableScanner(parser, "-tb"));), a=20 > TableRowScanner=20 > (using parser.addScanner(new TableRowScanner("-tr", parser));) and a=20 > TableColumnScanner (using parser.addScanner(new=20 > TableColumnScanner("-tc"));), but to no avail. >=20 > What am I missing? How to I process _all_ of the link tags within the=20 > input document? >=20 > Thanks. >=20 > Randall Schulz >=20 >=20 >=20 > ------------------------------------------------------- > This SF.net email is sponsored by: ObjectStore. > If flattening out C++ or Java code to make your application=20 > fit in a relational database is painful, don't do it! Check=20 > out ObjectStore. Now part of Progress Software.=20 http://www.objectstore.net/sourceforge _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Randall R S. <rs...@so...> - 2003-05-27 06:06:19
|
Hi, I have a properly functioning link extraction program based on HTMLParser 1.2. I'm trying to convert it to HTMLParser 1.3 (the newly released final version). I've fixed up all the changed data types, but I'm finding that links (<A> tags) contained within tables are being skipped. If in my node processing loop I replace the calls to my internal node processing function (which is called only if the node is an instance of LinkTag) with System.out.println(node.toString()); (for all returned node types), I see tables coming through as "blobs," if you will. That is, I see the entire <TABLE> ... </TABLE> element in the output as a single node with none of its internal structure parsed. I've tried adding a TableScanner to the parser (using parser.addScanner(new TableScanner(parser, "-tb"));), a TableRowScanner (using parser.addScanner(new TableRowScanner("-tr", parser));) and a TableColumnScanner (using parser.addScanner(new TableColumnScanner("-tc"));), but to no avail. What am I missing? How to I process _all_ of the link tags within the input document? Thanks. Randall Schulz |
From: Marc N. <ma...@ke...> - 2003-05-27 04:37:16
|
VGhlICJvYnNjdXJlZCB3b3JkcyBpbiBhbiBpbWFnZSIgdHJpY2sgaXMgdXNlZCB0byBhdm9pZCB3 ZWIgcm9ib3RzIHN1Y2ggYXMgdGhlIG9uZSB5b3UncmUgd3JpdGluZy4gIFNpdGVzIGxpa2UgVGlj a2V0bWFzdGVyIHVzZSBpdCB0byBiZSBmYWlyIHRvIHRoZSBhdmVyYWdlIGh1bWFuIGJlaW5nIHdo byBkb2Vzbid0IGhhdmUgdGhlIGtub3dsZWRnZSAob3IgcGVyaGFwcyB0aW1lPykgdG8gd3JpdGUg YW4gYXV0b21hdGVkIHNjcmlwdCB0byBjb21wZXRlIHdpdGggcmVhbCAoc2xvdykgaHVtYW5zLiAg SSBzdXNwZWN0IHRoZXkgZm91bmQgdGhhdCB0aGUgYXV0b21hdGVkIHNjcmlwdHMgd2VyZSBub3Qg b25seSBwb3VuZGluZyByZWxlbnRsZXNzbHkgb24gdGhlaXIgc2VydmVyICh3aGljaCBjb3N0cyB0 aGVtIG1vbmV5IGFuZCBtYWtlcyB0aGUgc2l0ZSBzbG93IGZvciBldmVyeW9uZSksIGJ1dCB0aGUg c2NyaXB0cyB3ZXJlIHByb2JhYmx5IHdpbm5pbmcgbW9yZSB0aWNrZXRzIHRoYW4gaHVtYW5zIGp1 c3QgZnJvbSB0aGUgc2hlZXIgbnVtYmVyIG9mIGF0dGVtcHRzIHRoZXkgd2VyZSBhYmxlIHRvIG1h a2UgcGVyIHNlY29uZC4NCiANCk90aGVyIHNpdGVzLCBzdWNoIGFzIFBheXBhbCBJIHNlZW0gdG8g cmVtZW1iZXIsIHVzZSBpdCBhcyBhbiBhZGRlZCBsZXZlbCBvZiBzZWN1cml0eSB0byB2ZXJpZnkg dGhhdCBhIHJlYWwgaHVtYW4gYmVpbmcgaXMgc2V0dGluZyB1cCBhIG5ldyBhY2NvdW50Lg0KIA0K VW5sZXNzIHlvdSBjYW4gd3JpdGUgc29waGl0aWNhdGVkIEFJIGltYWdlIHJlY29nbml0aW9uIHNv ZnR3YXJlLCB5b3UncmUgbm90IGdvaW5nIHRvIGJlIGFibGUgdG8gZ2V0IGFyb3VuZCBpdC4NCiAN CkFzIGZvciBoYW5kbGluZyBjb29raWVzLCBjaGVjayBvdXQgdGhlIGZvbGxvd2luZyBkb2N1bWVu dGF0aW9uIG9uIGhvdyB0byBoYW5kbGUgY29va2llcyB3aXRoIHRoZSBQYXJzZXI6DQpodHRwOi8v aHRtbHBhcnNlci5zb3VyY2Vmb3JnZS5uZXQvZG9jcy9pbmRleC5waHAvVXNpbmdDb29raWVzV2l0 aFBhcnNlciA8aHR0cDovL2h0bWxwYXJzZXIuc291cmNlZm9yZ2UubmV0L2RvY3MvaW5kZXgucGhw L1VzaW5nQ29va2llc1dpdGhQYXJzZXI+IA0KIA0KTWFyYw0KIA0KDQoJLS0tLS1PcmlnaW5hbCBN ZXNzYWdlLS0tLS0gDQoJRnJvbTogQy4gRi4gU2NoZWlkZWNrZXIgQW50dW5lcyBbbWFpbHRvOm5h bmRvQGFudHVuZXMuZXRpLmJyXSANCglTZW50OiBNb24gNS8yNi8yMDAzIDExOjU2IEFNIA0KCVRv OiBodG1scGFyc2VyLXVzZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0IA0KCUNjOiANCglTdWJqZWN0 OiBbSHRtbHBhcnNlci11c2VyXSBhdm9pZGluZyB3ZWIgcm9ib3RzDQoJDQoJDQoNCglIZWxsbyBh bGwsDQoJDQoJSSBhbSB0cnlpbmcgdG8gYXV0b21hdGUgdGhlIHBvc3Rpbmcgb2YgYSBwYWdlLiBI b3dldmVyIHRvIGFueQ0KCWluZm9ybWF0aW9uIEkgcmVxdWVzdCBieSB0eXBpbmcgYSBjb2RlIGl0 IGFza3MgbWUgdG8gdHlwZSB3aGF0IGlzIHdyaXRlbg0KCWluIGFuIGltYWdlLiBUaGlzIHBhZ2Ug YWxzbyBzYXZlcyBhIGNvb2tpZSB3aGljaCBpcyB0aGFuIHVzZWQgdG8gcmVhZA0KCWl0cyBzZXNz aW9uIGNvbnRlbnQgdG8gdmVyaWZ5IGlmIHdoYXQgaXMgb24gdGhlIGltYWdlIGlzIHRoZSBzYW1l IG9mDQoJd2hhdCBpcyB0eXBlZC4gSWYgaXQgaXMgdGhhbiBpdCBzaG93IHRoZSByZXN1bHRzLiBU aGV5IGRvICB0aGlzIHRvIGF2b2lkDQoJd2ViIHJvYm90cy4gRG9lcyBhbnkgb25lIGtub3cgaG93 IHRvIGJ5cGFzcyB0aGlzPw0KCQ0KCVRoYW5rcyBpbiBhZHZhbmNlLA0KCQ0KCUMuRi4NCgkNCgkN CgkNCgktLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tDQoJVGhpcyBTRi5uZXQgZW1haWwgaXMgc3BvbnNvcmVkIGJ5OiBPYmplY3RTdG9yZS4NCglJ ZiBmbGF0dGVuaW5nIG91dCBDKysgb3IgSmF2YSBjb2RlIHRvIG1ha2UgeW91ciBhcHBsaWNhdGlv biBmaXQgaW4gYQ0KCXJlbGF0aW9uYWwgZGF0YWJhc2UgaXMgcGFpbmZ1bCwgZG9uJ3QgZG8gaXQh IENoZWNrIG91dCBPYmplY3RTdG9yZS4NCglOb3cgcGFydCBvZiBQcm9ncmVzcyBTb2Z0d2FyZS4g aHR0cDovL3d3dy5vYmplY3RzdG9yZS5uZXQvc291cmNlZm9yZ2UNCglfX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXw0KCUh0bWxwYXJzZXItdXNlciBtYWlsaW5n IGxpc3QNCglIdG1scGFyc2VyLXVzZXJAbGlzdHMuc291cmNlZm9yZ2UubmV0DQoJaHR0cHM6Ly9s aXN0cy5zb3VyY2Vmb3JnZS5uZXQvbGlzdHMvbGlzdGluZm8vaHRtbHBhcnNlci11c2VyDQoJDQoN Cg== |
From: C. F. S. A. <na...@an...> - 2003-05-26 18:50:18
|
Hello all, I am trying to automate the posting of a page. However to any information I request by typing a code it asks me to type what is writen in an image. This page also saves a cookie which is than used to read its session content to verify if what is on the image is the same of what is typed. If it is than it show the results. They do this to avoid web robots. Does any one know how to bypass this? Thanks in advance, C.F. |
From: David HM S. <sp...@ze...> - 2003-05-26 04:41:08
|
I've just started working with HTMLParser and before I dive of the deep end and start coding something up was wondering if anyone was working on an (XML, perhaps) scripting language to driver the parser..? I.e., for a known web page, use a script to get relevant info from the page, populate form fields, submit and then parse more results... regards, David ------------------------------------------------------------------------ ------------------ David HM Spector spector (at) zeitgeist.com software architecture network design security consultation technical due diligence technology planning needs analysis Office:(631)261-5013 www.zeitgeist.com Cell: (631)827-3132 |
From: Derrick O. <Der...@ro...> - 2003-05-25 23:46:02
|
Version 1.3 of the most popular HTML parser on sourceforge is now available. Four weeks of candidate testing have culminated in a very stable, production level product, with many new user requested features. Features added since 1.2 include: constructor(URLConnection) for POST and exotic GET improved character set handling hierarchically nested tags, i.e. tables scanners for each type of tag java beans for easy integration of text and link fetching 'visitor' patterns Wiki page documentation improved script scanning improved whitespace handling The developers of the HTML Parser hope you enjoy it. |
From: Derrick O. <Der...@ro...> - 2003-05-25 18:38:37
|
Steve, When I parse w <http://www.sina.com.cn>ww.sina.com.cn <http://www.sina.com.cn> I get: INFO: detected charset "gb2312", using "EUC-CN" I have no easy way of checking if the characters are correct, but it seems to work because it chooses a Chinese character set. Can you suggest a better way to choose a character set? The parse did find an off-by-one error in the way htmlparser stacks nodes. I fixed that and it will be available in the next integration release (1_3_20030425). Derrick Steve wrote: > Hi, I try to parse some Chinese HTML pages. > First, I just use a very simple chinese HTML page. but > ParserHelper.findCharset() method seems not to work well: > java.nio.charset.Charset doesn't support "GB2312"(chinese encoding). > So, HtmlParser use the default characterset "ISO-8859-1", but it > display a mess! And, I write hardcode for character_set = "GB2312", it > works!! I am not very familiar with java.nio.charset.Charset class and > HTMLPaser package, so I don't know why you use this class to test > supported encoding. > Second,I try to parse w <http://www.sina.com.cn>ww.sina.com.cn > <http://www.sina.com.cn>, which is the most popular website, when I > try to parser it, HtmlParser can't work! |
From: Steve <li...@e2...> - 2003-05-24 18:22:08
|
SGksIEkgdHJ5IHRvIHBhcnNlIHNvbWUgQ2hpbmVzZSBIVE1MIHBhZ2VzLiANCkZpcnN0LCBJIGp1 c3QgdXNlIGEgdmVyeSBzaW1wbGUgY2hpbmVzZSBIVE1MIHBhZ2UuIGJ1dCBQYXJzZXJIZWxwZXIu ZmluZENoYXJzZXQoKSBtZXRob2Qgc2VlbXMgbm90IHRvIHdvcmsgd2VsbDoNCmphdmEubmlvLmNo YXJzZXQuQ2hhcnNldCBkb2Vzbid0IHN1cHBvcnQgIkdCMjMxMiIoY2hpbmVzZSBlbmNvZGluZyku IFNvLCBIdG1sUGFyc2VyIHVzZSB0aGUgZGVmYXVsdCBjaGFyYWN0ZXJzZXQgIklTTy04ODU5LTEi LCBidXQgaXQgZGlzcGxheSBhIG1lc3MhIEFuZCwgSSB3cml0ZSBoYXJkY29kZSBmb3IgY2hhcmFj dGVyX3NldCA9ICJHQjIzMTIiLCBpdCB3b3JrcyEhIEkgYW0gbm90IHZlcnkgZmFtaWxpYXIgd2l0 aCBqYXZhLm5pby5jaGFyc2V0LkNoYXJzZXQgY2xhc3MgYW5kIEhUTUxQYXNlciBwYWNrYWdlLCBz byBJIGRvbid0IGtub3cgd2h5IHlvdSB1c2UgdGhpcyBjbGFzcyB0byB0ZXN0IHN1cHBvcnRlZCBl bmNvZGluZy4gDQoNClNlY29uZCxJIHRyeSB0byBwYXJzZSB3d3cuc2luYS5jb20uY24sIHdoaWNo IGlzIHRoZSBtb3N0IHBvcHVsYXIgd2Vic2l0ZSwgd2hlbiBJIHRyeSB0byBwYXJzZXIgaXQsIEh0 bWxQYXJzZXIgY2FuJ3Qgd29yayEgDQo= |
From: greg <gr...@we...> - 2003-05-23 16:26:14
|
I just started looking at HTMLParser yesterday so forgive me if this is an incorrect question. Basically I have a "dirty" HTML document coming out of MS Word 2000 that I need to parse to pull some information from and store in portions of in a database. One of the tags that Word uses throughout is the <a name="_TOC353535></a> type tag. Becuase of this I extended LinkTag, LinkScanner, LinkTagData classes to become NamedLinkTag, NamedLinkScanner, and NamedLinkTagData as I needed to be able to get the name used for the anchor. I tehn extended the LinkScannerTest class to have 3 tests of my own. The last test looks like this createParser("<a name=\"FOO\">Click Here</A>"); parser.addScanner(new NamedLinkScanner("-l")); parseAndAssertNodeCount(1); assertTrue("The node should be a named link tag",node[0] instanceof NamedLinkTag); NamedLinkTag namedLinkTag = (NamedLinkTag)node[0]; assertEquals("Link URL of link tag",null,namedLinkTag.getLink()); assertEquals("Link Text of link tag","Click Here",namedLinkTag.getLinkText()); assertEquals("Access key",null,namedLinkTag.getAccessKey()); assertEquals("Link name check","FOO",namedLinkTag.getLinkName()); This test keeps failing on the parseAndAssertNodeCount(1) method with returning a 3 instead of a 1. What is the likely cause of this? It almost seems as though the <a name=""> is not seen a link (thus a namedlinktag), by the parser? Thanks for your help, Greg |