htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
1
|
2
|
3
|
4
|
5
|
6
|
7
(2) |
8
(2) |
9
(1) |
10
(1) |
11
|
12
|
13
|
14
(1) |
15
(5) |
16
|
17
(1) |
18
|
19
|
20
|
21
|
22
|
23
|
24
|
25
|
26
|
27
|
28
(2) |
29
|
30
|
31
(1) |
|
|
From: Lee F. W. <le...@st...> - 2005-03-31 14:47:13
|
Hi I have been using htmlparser 1.5 for a short while now and am getting pretty flustrated. :-) Not because it doesn't seem to be a good utility and expose a rich API, but because sufficient documentation to get started with key issues is severly lacking. However, it seems of much of my flustration is due to the API's flexability - the ability to do the same things using different techniques. However, for someone gettings started using the parser, comparing and picking these techniques is difficult and I implore the developers and users who have experience using the API to share their knowledge in relavant forums, add more demos and write a JavaWorld (or other) basic "getting started" article on basic usage. This would also lead to more exposure to other java developers unaware of the htmlparser project. Anyway, as the title says I need to transform some html for presentation to use in another web site. To do this I need to remove some tags (and children), alter som tags, remove some attributes, alter some attributes etc. Up to now I have been following a basic simple procedure using a combination of registering tag handlers and using filters to achieve my goal. From what I have seen there seems to be many ways of retrieving tag content, but only one way to change tag content while parsing (by overriding doSemantivAction() in a tag subclass): Ok, my code looks something like this: // register tag handlers for tags I know I want to handle PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.registerTag(new LinkTag() { public void doSemanticAction() throws ParserException { // do something with the Link tag or attributes } }); factory.registerTag(new BodyTag() { public void doSemanticAction() throws ParserException { // do something with the body tag or attributes } }); //factory.registerTag(new general tag that processes all tags // even those that have specific handlers // see text below Parser parser = new Parser(url); parser.setNodeFactory(factory); // this step seems a bit ridiculous. why doesn't parser have a // public NodeList getNodeList() method? Or am I missing something? NodeList list = new NodeList(); for (NodeIterator iter = parser.elements(); iter.hasMoreNodes(); ) { list.add(iter.nextNode()); } // get the body tag only. can be part of a complete html document // or just a html fragment (containing a body tag) NodeList body = list.extractAllNodesThatMatch( new NodeClassFilter(BodyTag.class), true); // remove unwanted tags and their children body.keepAllNodesThatMatch(new NotFilter( new NodeClassFilter(ScriptTag.class)), true); body.keepAllNodesThatMatch(new NotFilter( new NodeClassFilter(MetaTag.class)), true); // print the result for debugging for (int i = 0; i < l2.size (); i++) { System.out.print(l2.elementAt(i).toHtml()); } Does this look correct? It seems to get the result I want... However, what I can't quite figure out is how to register a general tag listener that will take care of removing an attribute for all tags (or a subset of them), but also have the ability to register the more specific tag listeners at the same time. Is this possible? I tried to create a subclass of TagNode and override the doSemanticAction() method, but it did't work and was never called. Am I on the right track? Any ideas? Regards Lee Francis Wilhelmsen -- Programs should be written for people to read, and only incidentally for machines to execute. -- Structure and Interpretation of Computer Programs, MIT Press |
From: Derrick O. <Der...@Ro...> - 2005-03-28 21:32:15
|
I think you will probably just need to subclass CompositeTag and register it with the PrototypicalNodeFactory. See numerous examples in the org.htmlparser.tags package regarding subclassing, and see org.htmlparser.parserapplications.SiteCapturer for an example of registration. Bart Smith wrote: > Hello, > I need to create my own custom tags to search for. > On the wiki there is an article "Writing your own scanners" but it > says do not use this one. > > I have to create a custom tag without attirbutes. it will just be some > data between <mytag> </mytag> > > Thanks > Randy > > > <http://us.rd.yahoo.com/my/navbar/sethp/*http://www.yahoo.com/r/hs%20%0A> |
From: Bart S. <rtp...@ya...> - 2005-03-28 18:41:50
|
Hello, I need to create my own custom tags to search for. On the wiki there is an article "Writing your own scanners" but it says do not use this one. I have to create a custom tag without attirbutes. it will just be some data between <mytag> </mytag> Thanks Randy --------------------------------- Do you Yahoo!? Make Yahoo! your home page |
From: biao l. <sav...@ya...> - 2005-03-17 01:48:46
|
hi,everybody! I I need a kind of XPath to describe or mark patterns of the interesting elements.But I am not sure htmlparser can support Xpath.If not,How can I solve this problem? any information is appreciated! _________________________________________________________ Do You Yahoo!? 注册世界一流品质的雅虎免费电邮 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/ |
From: Derrick O. <Der...@Ro...> - 2005-03-15 12:15:28
|
You might be adding nodes (tags) to the node currently being visited by the node visitor, so it is recursing and applying your changes many, many times to the same node. Qingyi Gu wrote: >Hi, > >I am using "NodeVisitor" to modify the links in the >HTML page. It works pretty well, but I got the >following errors for certain page. Anybody has any >idea how to fix it. I appreciate your help. > >************************************************************** >java.lang.StackOverflowError > at >org.htmlparser.lexer.InputStreamSource.getCharacters(InputStreamSource.java:588) > at org.htmlparser.lexer.Page.getText(Page.java:954) > at >org.htmlparser.lexer.PageAttribute.getRawValue(PageAttribute.java:383) > at >org.htmlparser.Attribute.toString(Attribute.java:706) > at >org.htmlparser.nodes.TagNode.toHtml(TagNode.java:699) > at >org.htmlparser.tags.CompositeTag.toHtml(CompositeTag.java:156) > at >org.htmlparser.tags.CompositeTag.putEndTagInto(CompositeTag.java:151) > at >org.htmlparser.tags.CompositeTag.toHtml(CompositeTag.java:161) >.... >************************************************************ > >Thanks, >Jenny > > > >__________________________________ >Do you Yahoo!? >Yahoo! Small Business - Try our new resources site! >http://smallbusiness.yahoo.com/resources/ > > >------------------------------------------------------- >SF email is sponsored by - The IT Product Guide >Read honest & candid reviews on hundreds of IT Products from real users. >Discover which products truly live up to the hype. Start reading now. >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Derrick O. <Der...@Ro...> - 2005-03-15 12:11:31
|
I believe this is the same situation as bug #976448 Parent TableTag has incomplete HTML, or bug #923146 tag nesting rule too strict for forms. The tag closing mechanism is somewhat intolerant of badly formed html. It's doing one option, and that is to close the table off and start another. You could add your own <table>, <td> and <tr> custom tags that have their own scanner. Your scanner could be more tolerant. See custom tag examples. e3oginos X wrote: > hi > i have a runned into a situation where a site's html had contained two > closing </td> (for example </td> </td>) instead of one. This still > displayed fine in a normal browser, however when running my program > that is using the htmlparser, it automaticly replaced the second </td> > as a </table> and totally messed up the output. > how can i stop this from happening? how can i can provide custom > methods in the way of handling badly coded html situations such as this? > > thanks > |
From: Derrick O. <Der...@Ro...> - 2005-03-15 12:05:01
|
Can you post the error message you are getting regarding "can't resolve". umasurya sykam wrote: > I am a user of Htmlparser1_4 with j2sdk1.4.2_06. > > Everything is working fine. > > But there r 2 problems that r very important to get solved. > > 1.It is telling any Visitor even i regester it is giving can't resolve > symbol(error) > > > 2.If at all any nested tables in HTML page it is unable to find out > the text using method > getExtractedText(). > > > So, some one plz give me solution. > > Mainly Visitors r not working inmy s/w except that everything is fine > > First of all i am verythankful to u for providing this kind of great > software > > Plz reply me as soon as possible > > Regards Umasuryasykam > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > |
From: umasurya s. <uma...@ya...> - 2005-03-15 08:22:46
|
I am a user of Htmlparser1_4 with j2sdk1.4.2_06. Everything is working fine. But there r 2 problems that r very important to get solved. 1.It is telling any Visitor even i regester it is giving can't resolve symbol(error) 2.If at all any nested tables in HTML page it is unable to find out the text using method getExtractedText(). So, some one plz give me solution. Mainly Visitors r not working inmy s/w except that everything is fine First of all i am verythankful to u for providing this kind of great software Plz reply me as soon as possible Regards Umasuryasykam __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: e3oginos X <e3o...@ho...> - 2005-03-15 02:58:08
|
hi i have a runned into a situation where a site's html had contained two closing </td> (for example </td> </td>) instead of one. This still displayed fine in a normal browser, however when running my program that is using the htmlparser, it automaticly replaced the second </td> as a </table> and totally messed up the output. how can i stop this from happening? how can i can provide custom methods in the way of handling badly coded html situations such as this? thanks _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ |
From: Qingyi Gu <q_z...@ya...> - 2005-03-14 22:16:39
|
Hi, I am using "NodeVisitor" to modify the links in the HTML page. It works pretty well, but I got the following errors for certain page. Anybody has any idea how to fix it. I appreciate your help. ************************************************************** java.lang.StackOverflowError at org.htmlparser.lexer.InputStreamSource.getCharacters(InputStreamSource.java:588) at org.htmlparser.lexer.Page.getText(Page.java:954) at org.htmlparser.lexer.PageAttribute.getRawValue(PageAttribute.java:383) at org.htmlparser.Attribute.toString(Attribute.java:706) at org.htmlparser.nodes.TagNode.toHtml(TagNode.java:699) at org.htmlparser.tags.CompositeTag.toHtml(CompositeTag.java:156) at org.htmlparser.tags.CompositeTag.putEndTagInto(CompositeTag.java:151) at org.htmlparser.tags.CompositeTag.toHtml(CompositeTag.java:161) .... ************************************************************ Thanks, Jenny __________________________________ Do you Yahoo!? Yahoo! Small Business - Try our new resources site! http://smallbusiness.yahoo.com/resources/ |
From: Derrick O. <Der...@Ro...> - 2005-03-10 02:07:56
|
It's possible to use the Lexer, which just gives a flat, linear sequence of nodes, but the parser will parse all of a composite tag (and place the child nodes in a node list) before it hands it back to you. I guess you could tailor the nodes returned, so only some are composite and process the rest linearly. Rob Shields wrote: > Hi Derrick, > > So is it not possible to use HTMLParser as a SAX-like parser? I recall > I used HTMLParser in a very SAX-like way about 3-4 years ago. > > Rob > > From: Derrick Oswald <Der...@Ro...> > Reply-To: htm...@li... > To: htm...@li... > Subject: Re: [Htmlparser-user] performance benchmarks > > Sorry, no. There were some old benchmarks that indicated htmlparser is > 40% to 600% faster than JTidy. > In any case, the current SAX implementation of HTML Parser is based > on the DOM model under the hood. > > > > > ------------------------------------------------------- > SF email is sponsored by - The IT Product Guide > Read honest & candid reviews on hundreds of IT Products from real users. > Discover which products truly live up to the hype. Start reading now. > http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Rob S. <bob...@ho...> - 2005-03-09 14:58:30
|
Hi Derrick, So is it not possible to use HTMLParser as a SAX-like parser? I recall I used HTMLParser in a very SAX-like way about 3-4 years ago. Rob From: Derrick Oswald <Der...@Ro...> Reply-To: htm...@li... To: htm...@li... Subject: Re: [Htmlparser-user] performance benchmarks Sorry, no. There were some old benchmarks that indicated htmlparser is 40% to 600% faster than JTidy. In any case, the current SAX implementation of HTML Parser is based on the DOM model under the hood. |
From: Derrick O. <Der...@Ro...> - 2005-03-08 23:18:58
|
Sorry, no. There were some old benchmarks that indicated htmlparser is 40% to 600% faster than JTidy. In any case, the current SAX implementation of HTML Parser is based on the DOM model under the hood. Rob Shields wrote: > Hi > > I'm working on a product that uses a javaCC based HTML parser and I > have suggested that a SAX-like parser may be faster for our use. Are > there any performance benchmarks available for HTMLParser? > > Thanks > Rob > > |
From: Rob S. <bob...@ho...> - 2005-03-08 14:43:33
|
Hi I'm working on a product that uses a javaCC based HTML parser and I have suggested that a SAX-like parser may be faster for our use. Are there any performance benchmarks available for HTMLParser? Thanks Rob |
From: Derrick O. <Der...@Ro...> - 2005-03-07 23:26:01
|
The only ones that have 'children' are derived from CompositeTag. There aren't any tag classes for the font, strong and h1 through h5 tags you mention so they are parsed as generic tags which are not composite. You can make them composite by adding them to the set to be recognized, by defining classes for them: public class FontTag extends CompositeTag { private static final String[] mIds = new String[] {"FONT"}; public String[] getIds () { return (mIds); } } and adding them to the list of tags to be recognized: PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.registerTag (new FontTag ()); parser.setNodeFactory (factory); e3oginos X wrote: > hi > > i am using the html parser in a project that recognizes patterns > within a page, however when using the getParent() method on a Node > recursively to get its "ancestors" it skips <font> <strong> <h1-h5> > tags. It seems to only get parents if they are <body> <html> <table> > <td> <tr> > How can i get it to get these tags? > > thanks > |
From: e3oginos X <e3o...@ho...> - 2005-03-07 17:50:14
|
hi i am using the html parser in a project that recognizes patterns within a page, however when using the getParent() method on a Node recursively to get its "ancestors" it skips <font> <strong> <h1-h5> tags. It seems to only get parents if they are <body> <html> <table> <td> <tr> How can i get it to get these tags? thanks _________________________________________________________________ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ |