htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
1
|
2
(3) |
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
(2) |
15
(1) |
16
|
17
(5) |
18
(2) |
19
(1) |
20
|
21
|
22
|
23
|
24
|
25
|
26
(1) |
27
(2) |
28
|
29
|
|
|
|
|
|
|
From: Steve M. <st...@so...> - 2004-02-27 03:58:56
|
Derrick, Thanks again for your help. By the way, I didn't have much luck with the threading issue. I figured upgrading would be a good course of action. I think I am getting the hang of the new parser now... Once again, thanks and that's two I owe you... Steve -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Thursday, February 26, 2004 8:17 PM To: htm...@li... Subject: Re: [Htmlparser-user] Getting title tag text Steve, I think you've hit on the nub of the difference between the lexer and the parser. The lexer simply returns nodes, in order, and doesn't try to match end tags with start tags. So yes, you will get a TitleTag, but it hasn't been fed it's children. The parser on the other hand will cause collection of the nodes between start and end tags so as to "know" that the thing between the TITLE and /TITLE tag is the "title of the document". See the home page for another explanation: http://htmlparser.sourceforge.net/ What you can do with the Lexer is get the next node *after* the TITLE tag and assume it's a plain text title in a string node (people do funny things with HTML, so you're bound to see <TITLE><B>My Title</B><TITLE> and stuff like that, which I'm not sure is even completely handled by the parser code, so you have to be careful). Or perhaps get the *next* StringNode from the lexer which is presumably the title for the same reasons as outline before, but you have to watch out for empty <TITLE></TITLE> constructs. Or you can use the Parser and hope it does the 'right thing'. If it doesn't, let us know. Derrick Steve McCann wrote: >Using the following code, the assert for the title fails (getTitle() >returns an empty string). Is it not possible to retrieve that >information using the lexer rather than the parser? I am using HTML >Parser Integration Release 1.4-20040125. > >Thank you, >Steve > > public void testTitleScan() throws ParserException > { > String inputHTML = >"<html><!--remark--><head><title>Yahoo!</title></head>"; > Lexer lexer = new Lexer (new Page (inputHTML)); > > PrototypicalNodeFactory factory = > new PrototypicalNodeFactory(new >TitleTag()); > lexer.setNodeFactory (factory); > > Node node; > while (null != (node = lexer.nextNode ())) > { > if (node instanceof TitleTag) > { > TitleTag titleTag = (TitleTag) node; > String test = titleTag.getTitle(); > >assertEquals("Title","Yahoo!",titleTag.getTitle()); > } > if(node instanceof RemarkNode) > { > RemarkNode remarkNode = (RemarkNode)node; > String test = remarkNode.toPlainTextString(); > assertEquals("Remark","remark",test); > } > } > } > > > > ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Derrick O. <Der...@Ro...> - 2004-02-27 01:26:53
|
Steve, I think you've hit on the nub of the difference between the lexer and the parser. The lexer simply returns nodes, in order, and doesn't try to match end tags with start tags. So yes, you will get a TitleTag, but it hasn't been fed it's children. The parser on the other hand will cause collection of the nodes between start and end tags so as to "know" that the thing between the TITLE and /TITLE tag is the "title of the document". See the home page for another explanation: http://htmlparser.sourceforge.net/ What you can do with the Lexer is get the next node *after* the TITLE tag and assume it's a plain text title in a string node (people do funny things with HTML, so you're bound to see <TITLE><B>My Title</B><TITLE> and stuff like that, which I'm not sure is even completely handled by the parser code, so you have to be careful). Or perhaps get the *next* StringNode from the lexer which is presumably the title for the same reasons as outline before, but you have to watch out for empty <TITLE></TITLE> constructs. Or you can use the Parser and hope it does the 'right thing'. If it doesn't, let us know. Derrick Steve McCann wrote: >Using the following code, the assert for the title fails (getTitle() >returns an empty string). Is it not possible to retrieve that >information using the lexer rather than the parser? I am using HTML >Parser Integration Release 1.4-20040125. > >Thank you, >Steve > > public void testTitleScan() throws ParserException > { > String inputHTML = >"<html><!--remark--><head><title>Yahoo!</title></head>"; > Lexer lexer = new Lexer (new Page (inputHTML)); > > PrototypicalNodeFactory factory = > new PrototypicalNodeFactory(new >TitleTag()); > lexer.setNodeFactory (factory); > > Node node; > while (null != (node = lexer.nextNode ())) > { > if (node instanceof TitleTag) > { > TitleTag titleTag = (TitleTag) node; > String test = titleTag.getTitle(); > >assertEquals("Title","Yahoo!",titleTag.getTitle()); > } > if(node instanceof RemarkNode) > { > RemarkNode remarkNode = (RemarkNode)node; > String test = remarkNode.toPlainTextString(); > assertEquals("Remark","remark",test); > } > } > } > > > > |
From: Steve M. <st...@so...> - 2004-02-26 20:26:57
|
Using the following code, the assert for the title fails (getTitle() returns an empty string). Is it not possible to retrieve that information using the lexer rather than the parser? I am using HTML Parser Integration Release 1.4-20040125. Thank you, Steve public void testTitleScan() throws ParserException { String inputHTML = "<html><!--remark--><head><title>Yahoo!</title></head>"; Lexer lexer = new Lexer (new Page (inputHTML)); PrototypicalNodeFactory factory = new PrototypicalNodeFactory(new TitleTag()); lexer.setNodeFactory (factory); Node node; while (null != (node = lexer.nextNode ())) { if (node instanceof TitleTag) { TitleTag titleTag = (TitleTag) node; String test = titleTag.getTitle(); assertEquals("Title","Yahoo!",titleTag.getTitle()); } if(node instanceof RemarkNode) { RemarkNode remarkNode = (RemarkNode)node; String test = remarkNode.toPlainTextString(); assertEquals("Remark","remark",test); } } } |
From: Anthony L. <diz...@ho...> - 2004-02-19 10:34:51
|
Derrick, Thanks for your quick answer, you were right about : >I'm not sure where you are going wrong. The signature of the method >looks correct (upper and lower case do matter). Have you imported the >right RemarkNode class? > I had actually imported the xxxNode classes from org.htmlparser.lexer.nodes instead of org.htmlparser; I corrected this, and now those nodes are correctly detected. Thanks very much, Anthony |
From: Derrick O. <Der...@Ro...> - 2004-02-18 23:23:35
|
Anthony, I just tried the following little program, to make sure there wasn't a fundamental flaw: import org.htmlparser.Parser; import org.htmlparser.RemarkNode; import org.htmlparser.util.ParserException; import org.htmlparser.visitors.NodeVisitor; public class RemarkVisitor extends NodeVisitor { public void visitRemarkNode (RemarkNode remark) { System.out.println (remark.toHtml ()); // or System.out.println (remark); } public static void main (String[] args) throws ParserException { Parser parser; RemarkVisitor visitor; parser = new Parser ("http://cbc.ca"); visitor = new RemarkVisitor (); parser.visitAllNodesWith (visitor); } } If you run this, it should show something like: <!--- Start Top Nav Bar ---> <!-- Get required Javascript --> <!--- End Top Nav Bar ---> ... I'm not sure where you are going wrong. The signature of the method looks correct (upper and lower case do matter). Have you imported the right RemarkNode class? Maybe clip out the visitRemarkNode() method from this message and paste it in your class. The compiler should complain with both of them defined. The reason you can't see RemarkNode objects in your other example is because they are nested within other tags. You would have to use getChildren() to dig into each node, and dig into the children of those and so on recursively. For example, void printAllNodeClasses (Node node) { System.out.println (node.getClass().getName()); children = node.getChildren (); if (null != children) for (int i = 0; i < children.size (); i++) printAllNodeClasses (children.elementAt (i)); } Then replace line (1) with: printAllNodeClasses (n); Derrick Anthony Labarre wrote: > Hello, > > This should be very trivial to do, but I've tried all the ways > presented on the HTMLParser wiki and I just can't seem to retrieve the > comments in the webpages I parse. I've tried it by inheriting > NodeVisitor and redefining the visitRemarkNode(RemarkNode) method, and > then of course sending the message visitAllNodesWith(MyVisitor) to the > parser but there's no result (checked by trying to output a test > string in visitRemarkNode). > > public void visitRemarkNode(RemarkNode remarkNode) { > System.out.println ("COMMENT : "); > } > > I've played around with RemarkNode.accept(MyVisitor), but it was no > use - and I don't think I inherited it in a bad way since I used your > documentation and the tags are visited correctly by visitTag. > > What puzzles me most though is that when I've tried it the other way - > not using a visitor - and displayed the result of : > > Parser parser = new Parser(input_filename); > NodeIterator i = parser.elements(); > StringBuffer htmlBuffer = new StringBuffer(); > for ( ; i.hasMoreNodes(); ) { > Node n = i.nextNode(); > System.out.println(n.getClass()); // (1) > htmlBuffer.append(n.toHtml()); > } > System.out.println("htmlBuffer > contains:\n"+htmlBuffer.toString()); // (2) > > ... the output of (1) doesn't contain any reference to a RemarkNode > (there are references to StringNode though, and I've unsuccessfully > used accept(MyVisitor) on them), but the output of (2) does contain > the commented part. I've spent hours in the documentation, the code > and on the net looking for help, but didn't find anything (nor anyone) > that could solve my problem. I hope someone here will have the will > and time to help me. > > Thanks everyone and have a nice day, > A. > |
From: Anthony L. <ant...@sk...> - 2004-02-18 17:18:11
|
Hello, This should be very trivial to do, but I've tried all the ways presented on the HTMLParser wiki and I just can't seem to retrieve the comments in the webpages I parse. I've tried it by inheriting NodeVisitor and redefining the visitRemarkNode(RemarkNode) method, and then of course sending the message visitAllNodesWith(MyVisitor) to the parser but there's no result (checked by trying to output a test string in visitRemarkNode). public void visitRemarkNode(RemarkNode remarkNode) { System.out.println ("COMMENT : "); } I've played around with RemarkNode.accept(MyVisitor), but it was no use - and I don't think I inherited it in a bad way since I used your documentation and the tags are visited correctly by visitTag. What puzzles me most though is that when I've tried it the other way - not using a visitor - and displayed the result of : Parser parser = new Parser(input_filename); NodeIterator i = parser.elements(); StringBuffer htmlBuffer = new StringBuffer(); for ( ; i.hasMoreNodes(); ) { Node n = i.nextNode(); System.out.println(n.getClass()); // (1) htmlBuffer.append(n.toHtml()); } System.out.println("htmlBuffer contains:\n"+htmlBuffer.toString()); // (2) ... the output of (1) doesn't contain any reference to a RemarkNode (there are references to StringNode though, and I've unsuccessfully used accept(MyVisitor) on them), but the output of (2) does contain the commented part. I've spent hours in the documentation, the code and on the net looking for help, but didn't find anything (nor anyone) that could solve my problem. I hope someone here will have the will and time to help me. Thanks everyone and have a nice day, A. |
From: Marc N. <ma...@ke...> - 2004-02-17 18:27:06
|
Just to clarify -- the library already does most of the things I list = below (i.e. I've already implemented them using a semi-current version = of HTMLParser). However, I'm listing them here so they may be = considered as one of the many use cases for the library. I also want to commend Derrick for all the work he's put into the = project! Marc -----Original Message----- From: Marc Novakowski=20 Sent: Tuesday, February 17, 2004 10:12 AM To: htm...@li...; htm...@li... Subject: RE: [Htmlparser-user] version 1.5 I'm a big fan of server-side transforms. That is, scanning an HTML = document and transforming parts of it into custom markup and/or DHTML. = I do this using a servlet filter in Tomcat. I'm currently using an older version of the library (from 08/24/2003) -- = before the major code changes were made, mostly because I've been too = busy working on other things to port my code to the new APIs. I hope to = get to it eventually! :) However, if you're looking for feedback, then here's what I would find = useful in the library. It may or may not already do the following to = certain degrees. But if anything in this list can be made easy(ier) = than I'm all for it: - scan an HTML page for "custom" XML/HTML tags embedded within the HTML - maintain both the original HTML and the location of the XML "islands" = within it - provide mechanisms to parse different kinds of custom tags, including = the following: - very simple tags (like <br>) - value-only tags (like <a>value</a>) - composite tags (like <ul>) - tags that contain "anything", which the parser simply skips over (similar to <script>, but even dumber so that all it looks for is the = closing tag) - APIs that allow the definition of the custom tags (above) without = having to create a custom scanner and tag class for each one For illustrative purposes, here's an example of what some of my custom = tags look like: <html> <body> <h2>Here is the chart</h2> <Component name=3D"myChart" incorporates=3D"Chart"> <String name=3D"backgroundColor" value=3D"white"/> <String name=3D"foregroundColor" value=3D"black"/> <Number name=3D"width" value=3D"200"/> <Number name=3D"height" value=3D"400"/> <Reference name=3D"data" value=3D"dataModel"/> <Method name=3D"changeSize"> <Param name=3D"width"/> <Param name=3D"height"/> <Impl> // This is javascript code this.width.set(width); this.height.set(height); this.render(); </Impl> </Method> </Component> <hr> blah blah .... (more HTML) .... </body> </html> Hope this helps! Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...] Sent: Tuesday, February 17, 2004 4:40 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-user] version 1.5 Now that version 1.4 is nearly put to bed, it's time to look forward=20 into the future to visualize or 'blue sky' the features that could be=20 incorporated in the next version of the parser. There are a small number = of feature requests that have accumulated over the last few months that=20 can serve as a starting point:=20 http://sourceforge.net/tracker/?group_id=3D24399&atid=3D381402 But what is really required are some real use-cases that aren't=20 addressed by the curent parser, which will lead to real requirements,=20 which lead to real features that can be added to the parser for the next = version. What does everyone do with the htmlparser that could be built=20 into it? Or more to the point, what capabilities are lacking that cause=20 a developer to *not* use htmlparser and do it themselves some other way? = Does anybody have any ideas? Does anybody have some applications they=20 would like to add to the htmlparser codebase so that 'out-of-the-box' it = does what they want? In general, what directions should development=20 take, i.e. HTML correction or editing, XML, robots, server side=20 transforms etc.? Has anybody got some pet peeves they want cleared up?=20 Come on, give it up. Now's the time. Derrick ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438&op=3Dclick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=1356&alloc_id438&op=3Dick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: John M. <jo...@rt...> - 2004-02-17 18:25:39
|
custom tags with namespaces would also be a nice feature. Ala <rte:body></rte:body> we use those for marking the test that our Lucene search engine should index. At the moment I am using a simple substring method to parse out the text between these tags, but having htmlparser support them out of the box would made things a lot more efficient for more complex pages with multiple tags. John On Tue, 2004-02-17 at 18:11, Marc Novakowski wrote: > I'm a big fan of server-side transforms. That is, scanning an HTML document and transforming parts of it into custom markup and/or DHTML. I do this using a servlet filter in Tomcat. > > I'm currently using an older version of the library (from 08/24/2003) -- before the major code changes were made, mostly because I've been too busy working on other things to port my code to the new APIs. I hope to get to it eventually! :) > > However, if you're looking for feedback, then here's what I would find useful in the library. It may or may not already do the following to certain degrees. But if anything in this list can be made easy(ier) than I'm all for it: > > - scan an HTML page for "custom" XML/HTML tags embedded within the HTML > - maintain both the original HTML and the location of the XML "islands" within it > - provide mechanisms to parse different kinds of custom tags, including the following: > - very simple tags (like <br>) > - value-only tags (like <a>value</a>) > - composite tags (like <ul>) > - tags that contain "anything", which the parser simply skips over > (similar to <script>, but even dumber so that all it looks for is the closing tag) > > - APIs that allow the definition of the custom tags (above) without having to create a custom scanner and tag class for each one > > For illustrative purposes, here's an example of what some of my custom tags look like: > > <html> > <body> > <h2>Here is the chart</h2> > <Component name="myChart" incorporates="Chart"> > <String name="backgroundColor" value="white"/> > <String name="foregroundColor" value="black"/> > <Number name="width" value="200"/> > <Number name="height" value="400"/> > <Reference name="data" value="dataModel"/> > <Method name="changeSize"> > <Param name="width"/> > <Param name="height"/> > <Impl> > // This is javascript code > this.width.set(width); > this.height.set(height); > this.render(); > </Impl> > </Method> > </Component> > <hr> > blah blah .... (more HTML) .... > > </body> > </html> > > > > Hope this helps! > Marc > > -----Original Message----- > From: Derrick Oswald [mailto:Der...@Ro...] > Sent: Tuesday, February 17, 2004 4:40 AM > To: htm...@li...; > htm...@li... > Subject: [Htmlparser-user] version 1.5 > > > Now that version 1.4 is nearly put to bed, it's time to look forward > into the future to visualize or 'blue sky' the features that could be > incorporated in the next version of the parser. There are a small number > of feature requests that have accumulated over the last few months that > can serve as a starting point: > http://sourceforge.net/tracker/?group_id=24399&atid=381402 > > But what is really required are some real use-cases that aren't > addressed by the curent parser, which will lead to real requirements, > which lead to real features that can be added to the parser for the next > version. What does everyone do with the htmlparser that could be built > into it? Or more to the point, what capabilities are lacking that cause > a developer to *not* use htmlparser and do it themselves some other way? > Does anybody have any ideas? Does anybody have some applications they > would like to add to the htmlparser codebase so that 'out-of-the-box' it > does what they want? In general, what directions should development > take, i.e. HTML correction or editing, XML, robots, server side > transforms etc.? Has anybody got some pet peeves they want cleared up? > Come on, give it up. Now's the time. > > Derrick > > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id56&alloc_id438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user -- John Moylan ---------------------- ePublishing Radio Telefis Eireann, Montrose House, Donnybrook, Dublin 4, Eire t:+353 1 2083564 e:joh...@rt... ****************************************************************************** The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, any disclosure, copying, distribution, or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Please note that emails to, from and within RTÉ may be subject to the Freedom of Information Act 1997 and may be liable to disclosure. ****************************************************************************** |
From: Marc N. <ma...@ke...> - 2004-02-17 18:16:16
|
I'm a big fan of server-side transforms. That is, scanning an HTML = document and transforming parts of it into custom markup and/or DHTML. = I do this using a servlet filter in Tomcat. I'm currently using an older version of the library (from 08/24/2003) -- = before the major code changes were made, mostly because I've been too = busy working on other things to port my code to the new APIs. I hope to = get to it eventually! :) However, if you're looking for feedback, then here's what I would find = useful in the library. It may or may not already do the following to = certain degrees. But if anything in this list can be made easy(ier) = than I'm all for it: - scan an HTML page for "custom" XML/HTML tags embedded within the HTML - maintain both the original HTML and the location of the XML "islands" = within it - provide mechanisms to parse different kinds of custom tags, including = the following: - very simple tags (like <br>) - value-only tags (like <a>value</a>) - composite tags (like <ul>) - tags that contain "anything", which the parser simply skips over (similar to <script>, but even dumber so that all it looks for is the = closing tag) - APIs that allow the definition of the custom tags (above) without = having to create a custom scanner and tag class for each one For illustrative purposes, here's an example of what some of my custom = tags look like: <html> <body> <h2>Here is the chart</h2> <Component name=3D"myChart" incorporates=3D"Chart"> <String name=3D"backgroundColor" value=3D"white"/> <String name=3D"foregroundColor" value=3D"black"/> <Number name=3D"width" value=3D"200"/> <Number name=3D"height" value=3D"400"/> <Reference name=3D"data" value=3D"dataModel"/> <Method name=3D"changeSize"> <Param name=3D"width"/> <Param name=3D"height"/> <Impl> // This is javascript code this.width.set(width); this.height.set(height); this.render(); </Impl> </Method> </Component> <hr> blah blah .... (more HTML) .... </body> </html> Hope this helps! Marc -----Original Message----- From: Derrick Oswald [mailto:Der...@Ro...] Sent: Tuesday, February 17, 2004 4:40 AM To: htm...@li...; htm...@li... Subject: [Htmlparser-user] version 1.5 Now that version 1.4 is nearly put to bed, it's time to look forward=20 into the future to visualize or 'blue sky' the features that could be=20 incorporated in the next version of the parser. There are a small number = of feature requests that have accumulated over the last few months that=20 can serve as a starting point:=20 http://sourceforge.net/tracker/?group_id=3D24399&atid=3D381402 But what is really required are some real use-cases that aren't=20 addressed by the curent parser, which will lead to real requirements,=20 which lead to real features that can be added to the parser for the next = version. What does everyone do with the htmlparser that could be built=20 into it? Or more to the point, what capabilities are lacking that cause=20 a developer to *not* use htmlparser and do it themselves some other way? = Does anybody have any ideas? Does anybody have some applications they=20 would like to add to the htmlparser codebase so that 'out-of-the-box' it = does what they want? In general, what directions should development=20 take, i.e. HTML correction or editing, XML, robots, server side=20 transforms etc.? Has anybody got some pet peeves they want cleared up?=20 Come on, give it up. Now's the time. Derrick ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438&op=3Dclick _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Ian M. <ia...@nt...> - 2004-02-17 16:07:27
|
I haven't looked for quite awhile, but I think the documentation and examples of the existing version could added to. I think I grabbed a 1.4beta and the examples hadn't quite caught up with the main coding. It's a great project though, and one I will make more use of in the future. Congrats to those involved in keeping on plugging away at it. Apols for not catching up on the latest developments before posting. Ian :: http://ianmoss.com :: http://alteris.co.uk :: http://rock666.com :: --- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.577 / Virus Database: 366 - Release Date: 03/02/2004 |
From: Derrick O. <Der...@Ro...> - 2004-02-17 12:44:13
|
Now that version 1.4 is nearly put to bed, it's time to look forward into the future to visualize or 'blue sky' the features that could be incorporated in the next version of the parser. There are a small number of feature requests that have accumulated over the last few months that can serve as a starting point: http://sourceforge.net/tracker/?group_id=24399&atid=381402 But what is really required are some real use-cases that aren't addressed by the curent parser, which will lead to real requirements, which lead to real features that can be added to the parser for the next version. What does everyone do with the htmlparser that could be built into it? Or more to the point, what capabilities are lacking that cause a developer to *not* use htmlparser and do it themselves some other way? Does anybody have any ideas? Does anybody have some applications they would like to add to the htmlparser codebase so that 'out-of-the-box' it does what they want? In general, what directions should development take, i.e. HTML correction or editing, XML, robots, server side transforms etc.? Has anybody got some pet peeves they want cleared up? Come on, give it up. Now's the time. Derrick |
From: Gifti <gi...@my...> - 2004-02-15 02:21:53
|
Derrich, you are great! Thx for this perfect support, now I understand the thmlparser, and my class is working! Gifti Am Sat, 14 Feb 2004 17:57:54 -0500 hat Derrick Oswald <Der...@Ro...> geschrieben: > > Gifti, > > If you have the FORM tag, the ACTION attribute can be set with the > setFormLocation(String url) method. > Be careful of any BASE tag that specifies the root URL for the page, this > will alter what you need to specify for a URL. Maybe absolute is best. > I think something like this should work: > > // get all the nodes on the page > NodeList list = new NodeList (); > NodeIterator iterator = parser.elements (); > Node node; > while (null != (node = iterator.nextNode ())) > list.add (node); > // get the FORM tag somehow > FormTag form = (FormTag)list.extractAllNodesThatMatch (new > NodeClassFilter (FormTag.class), true).elementAt (0); > // set it's action > form.setFormLocation ("http://whatever.org/cgi-bin"); > // create a hidden tag like <input type="hidden"name="dunno" > value="what"> > InputTag input = new Tag (); > input.setTagName ("input"); > input.setAttribute ("type", "hidden"); > input.setAttribute ("name", "dunno"); > input.setAttribute ("value", "what"); > // add the input as a child of the form > form.getChildren ().add (input); > // then print it all out > iterator = list.elements (); > while (null != (node = iterator.nextNode ())) > System.out.println (node); > > Derrick > > > Gifti wrote: > >> Hi all, >> >> I've got to parse a html file in all of the input fields. After long >> work i had it. ;) >> >> So now, I've got to change the action field. I can change it, but if I >> want to write all in the File, the change in the action field will not >> be writting down... >> >> My other Problem is, I've got to insert a hiddenfield in the html-source >> code. I don't now how i've got to do that. >> >> Sorry for my bad english, but I'm a noob in english (like in java...) >> >> thx for held! >> >> gifti >> > > > > > ------------------------------------------------------- > SF.Net is sponsored by: Speed Start Your Linux Apps Now. > Build and deploy apps & Web services for Linux with > a free DVD software kit from IBM. Click Now! > http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/ |
From: Derrick O. <Der...@Ro...> - 2004-02-14 23:00:41
|
Gifti, If you have the FORM tag, the ACTION attribute can be set with the setFormLocation(String url) method. Be careful of any BASE tag that specifies the root URL for the page, this will alter what you need to specify for a URL. Maybe absolute is best. I think something like this should work: // get all the nodes on the page NodeList list = new NodeList (); NodeIterator iterator = parser.elements (); Node node; while (null != (node = iterator.nextNode ())) list.add (node); // get the FORM tag somehow FormTag form = (FormTag)list.extractAllNodesThatMatch (new NodeClassFilter (FormTag.class), true).elementAt (0); // set it's action form.setFormLocation ("http://whatever.org/cgi-bin"); // create a hidden tag like <input type="hidden"name="dunno" value="what"> InputTag input = new Tag (); input.setTagName ("input"); input.setAttribute ("type", "hidden"); input.setAttribute ("name", "dunno"); input.setAttribute ("value", "what"); // add the input as a child of the form form.getChildren ().add (input); // then print it all out iterator = list.elements (); while (null != (node = iterator.nextNode ())) System.out.println (node); Derrick Gifti wrote: > Hi all, > > I've got to parse a html file in all of the input fields. After long > work i had it. ;) > > So now, I've got to change the action field. I can change it, but if I > want to write all in the File, the change in the action field will not > be writting down... > > My other Problem is, I've got to insert a hiddenfield in the > html-source code. I don't now how i've got to do that. > > Sorry for my bad english, but I'm a noob in english (like in java...) > > thx for held! > > gifti > |
From: Gifti <gi...@my...> - 2004-02-14 19:48:27
|
Hi all, I've got to parse a html file in all of the input fields. After long work i had it. ;) So now, I've got to change the action field. I can change it, but if I want to write all in the File, the change in the action field will not be writting down... My other Problem is, I've got to insert a hiddenfield in the html-source code. I don't now how i've got to do that. Sorry for my bad english, but I'm a noob in english (like in java...) thx for held! gifti |
From: Derrick O. <Der...@Ro...> - 2004-02-02 23:11:24
|
Steve, Not your server, the other side. If you are using a Sun 1.4 JDK, you can try forcing the TCP/IP read to time-out after a set period. In your mainline, before starting, set these system properties (milliseconds): System.setProperty ("sun.net.client.defaultReadTimeout", "7000"); System.setProperty ("sun.net.client.defaultConnectTimeout", "7000"); Derrick Steve McCann wrote: >Derrick, >I have not been able to get the problem to reproduce. That is, if a >particular URL in a set hangs, if I rerun the exact same set, it >processes fine. It has been a little frustrating... > >You mentioned it might be a server-side issue. Do you mean an issue with >my server dropping the socket? If so, any suggestions on how to get the >parser to recognize this situation and throw an exception of some sort? > >Thanks, >Steve > > > > >Date: Thu, 29 Jan 2004 17:55:00 -0500 >From: Derrick Oswald <Der...@Ro...> >To: htm...@li... >Subject: Re: [Htmlparser-user] thread question >Reply-To: htm...@li... > >Steve, > >The 1.3 NodeReader.getNextLine() issues that message when it encounters >an IOException. >It's unclear why it's looping though. >Have you got a URL that causes it? >Maybe there is a server-side problem and it drops the socket. > >Derrick > >Steve McCann wrote: > > > >>I am currently using Version 1.3 and have a thread problem. In summary, >>I have a list of URLs (approximately 800 at a time) and I start a fixed >>number of threads to parse those pages. On occasion, a thread never >>completes and the log file logs "I/O Exception occurred while reading!" >>until I stop Tomcat. It does not happen often (maybe once every few >>thousand pages...) and I have not been able to reproduce the error in a >>test environment. >> >>Is this an issue that you have come across before? If so, do you have >>any advice on addressing the issue? >> >>Thank you in advance for your assistance. >> >>Steve >> >> >> >> |
From: Steve M. <st...@so...> - 2004-02-02 14:59:09
|
Derrick, I have not been able to get the problem to reproduce. That is, if a particular URL in a set hangs, if I rerun the exact same set, it processes fine. It has been a little frustrating... You mentioned it might be a server-side issue. Do you mean an issue with my server dropping the socket? If so, any suggestions on how to get the parser to recognize this situation and throw an exception of some sort? Thanks, Steve Date: Thu, 29 Jan 2004 17:55:00 -0500 From: Derrick Oswald <Der...@Ro...> To: htm...@li... Subject: Re: [Htmlparser-user] thread question Reply-To: htm...@li... Steve, The 1.3 NodeReader.getNextLine() issues that message when it encounters an IOException. It's unclear why it's looping though. Have you got a URL that causes it? Maybe there is a server-side problem and it drops the socket. Derrick Steve McCann wrote: >I am currently using Version 1.3 and have a thread problem. In summary, >I have a list of URLs (approximately 800 at a time) and I start a fixed >number of threads to parse those pages. On occasion, a thread never >completes and the log file logs "I/O Exception occurred while reading!" >until I stop Tomcat. It does not happen often (maybe once every few >thousand pages...) and I have not been able to reproduce the error in a >test environment. > >Is this an issue that you have come across before? If so, do you have >any advice on addressing the issue? > >Thank you in advance for your assistance. > >Steve > > |
From: Xavier G. <xav...@si...> - 2004-02-02 12:54:47
|
Hi, By using the StringNoLinksBean class to extract the page text I found that it does not filter style tags so that code appears in the middle of the extracted text. I've modified the class, adding another filter: mIsStyle (such as mIsA, etc.) so that is solves de problem. I'm not a registered developer nor I know if this is 'generically desirable' feature so I just mentioned this on the list. Cheers, Xavi. |