htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
| 2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
| 2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
| 2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
| 2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
| 2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
| 2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
| 2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
| 2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
| 2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
| 2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
| 2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
| 2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
| 2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
| 2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
| 2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
| 2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
| 2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
1
|
|
2
(5) |
3
(9) |
4
(1) |
5
(1) |
6
|
7
|
8
(2) |
|
9
|
10
(5) |
11
(3) |
12
(11) |
13
(3) |
14
(2) |
15
(1) |
|
16
(1) |
17
(2) |
18
(1) |
19
(1) |
20
(1) |
21
(5) |
22
|
|
23
(2) |
24
(3) |
25
|
26
|
27
(5) |
28
|
29
|
|
30
(1) |
31
(2) |
|
|
|
|
|
|
From: ope t. <op...@ho...> - 2003-03-31 21:08:04
|
Thanks a lot, it worked! Sincerely, Ope >From: htm...@li... >Reply-To: htm...@li... >To: htm...@li... >Subject: Htmlparser-user digest, Vol 1 #228 - 1 msg >Date: Sun, 30 Mar 2003 12:09:36 -0800 > >Send Htmlparser-user mailing list submissions to > htm...@li... > >To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/htmlparser-user >or, via email, send a message with subject or body 'help' to > htm...@li... > >You can reach the person managing the list at > htm...@li... > >When replying, please edit your Subject line so it is more specific >than "Re: Contents of Htmlparser-user digest..." > > >Today's Topics: > > 1. Re: Re: Htmlparser-user digest, Vol 1 #226 - 2 msgs (Somik Raha) > >--__--__-- > >Message: 1 >From: "Somik Raha" <so...@ya...> >To: <htm...@li...> >Subject: Re: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2 >msgs >Date: Sat, 29 Mar 2003 22:18:18 -0800 >Reply-To: htm...@li... > >FYI, I've just found that the CompositeTagScanner had a bug, due to which >the filters were not being set. Ope --> >node.collectInto(nodeList, LinkTag.LINK_TAG_FILTER); > >will work in the next integration release. > >Regards, >Somik >----- Original Message ----- >From: "Somik Raha" <so...@ya...> >To: <htm...@li...> >Sent: Thursday, March 27, 2003 2:38 PM >Subject: RE: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2 >msgs > > > > Instead of this, > > > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > > use: > > > > node.collectInto(nodeList,LinkTag.class); > > > > Regards, > > Somik > > --- Marc Novakowski <ma...@ke...> wrote: > > > Try removing the following line from your code: > > > > > > nodeList.add(node); > > > > > > It's most likely adding non-LinkTag nodes into > > > nodeList which causes the ClassCastException later > > > on. > > > > > > Marc > > > > > > -----Original Message----- > > > From: ope tomori [mailto:op...@ho...] > > > Sent: Thursday, March 27, 2003 1:31 PM > > > To: htm...@li... > > > Subject: [Htmlparser-user] Re: Htmlparser-user > > > digest, Vol 1 #226 - 2 > > > msgs > > > > > > > > > I figured out the part using the > > > nodeList.collectInto. My debug output shows > > > the right output, put when i try to process the link > > > information, i get this > > > error (this is part of the error): > > > > > > Exception occurred during event dispatching: > > > java.lang.ClassCastException: > > > org.htmlparser.tags.DoctypeTag > > > > > > > > > Thanks in advance for your help > > > > > > Sincerely, > > > Ope T. > > > > > > > > > This is my code below: > > > try{ > > > //create the parser with the url to be parsed > > > parser = new Parser(urlAddressComplete,new > > > DefaultParserFeedback()); > > > parser.registerScanners(); > > > nodeList = new NodeList(); > > > > > > //to extratct all the embedded links and images > > > > > > for (NodeIterator e = > > > parser.elements();e.hasMoreNodes();) { > > > Node node = (Node)e.nextNode(); > > > nodeList.add(node); > > > > > //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER); > > > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER); > > > > > > }//for > > > > > > System.out.print("CHECKING NODES.. " + > > > nodeList.toString()+ "\n"); > > > > > > //now process the links and images > > > //this is the part that doesnt seem to work > > > > > > for (SimpleNodeIterator e = > > > nodeList.elements();e.hasMoreNodes();) { > > > LinkTag linkTag = (LinkTag)e.nextNode(); > > > > > > //put the links and their texts into vectors > > > allTextLinkVector.addElement(linkTag.getLinkText()); > > > allLinkVector.addElement(linkTag.getLink()); > > > } > > > // System.out.print( "All Links " + "Size: "+ > > > allTextLinkVector.size() + " > > > "+ allTextLinkVector.toString()+ "\n"); > > > > > > }//inner try > > > > > > catch (ParserException e) { > > > System.err.println("Error, could not create parser > > > object"); > > > e.printStackTrace(); > > > }//catch > > > }// outer try > > > catch(IOException ex) { ex.printStackTrace(); } > > > > > > > > > > > > > > > > > > > > > >From: htm...@li... > > > Reply-To: > > > >htm...@li... To: > > > >htm...@li... Subject: > > > Htmlparser-user digest, Vol > > > >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 > > > -0800 > > > > > > > >Send Htmlparser-user mailing list submissions to > > > >htm...@li... > > > > > > > >To subscribe or unsubscribe via the World Wide Web, > > > visit > > > > > >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > or, via email, > > > >send a message with subject or body 'help' to > > > >htm...@li... > > > > > > > >You can reach the person managing the list at > > > >htm...@li... > > > > > > > >When replying, please edit your Subject line so it > > > is more specific than > > > >"Re: Contents of Htmlparser-user digest..." > > > > > > > > > > > >Today's Topics: > > > > > > > >1. Help with method --> node.collectInto() (ope > > > tomori) 2. RE: Help with > > > >method --> node.collectInto() (Marc Novakowski) > > > > > > > >-- __--__-- > > > > > > > >Message: 1 From: "ope tomori" To: > > > htm...@li... > > > >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: > > > [Htmlparser-user] Help with > > > >method --> node.collectInto() Reply-To: > > > >htm...@li... > > > > > > > > > > > >Hi Im trying to use the method > > > node.collectInto(...) to extract embedded > > > >links and images on webpages. Im using the latest > > > integration release which > > > >means its now Parser, not HTMLParser, nodeIterator, > > > etc and all the other > > > >changes. > > > > > > > > > > > > > > > >I followed the sample code: > > > > > > > >HTMLParser parser = new > > > HTMLParser("http://www.yahoo.com"); > > > >parser.registerScanners(); int i = 0; Vector > > > collectionVector = new > > > >Vector(); HTMLNode node; for (HTMLEnumeration e = > > > >parser.elements();e.hasMoreNodes();) { node = > > > e.nextHTMLNode(); > > > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > > > } // All > > > >items in the collection vector should be links for > > > (Enumeration e = > > > >collectionVector.elements();e.hasMoreElements();) { > > > HTMLLinkTag linkTag = > > > >(HTMLLinkTag)e.nextElement(); // you can now > > > process the links as you like > > > >} > > > > > *********************************************************** > > > > > > > > > > > >Im getting an error because this line: > > > > > > > > > >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); > > > requires a > > > >nodeList and not a vector, ive tried changing it > > > without any success: > > > >Creating a nodelist instead of a vector, > > > > > > > >can u please help me!! > > > > > > > >Thanks Ope > > > > > > > > > > > > > >_________________________________________________________________ > > > The new > > > >MSN 8: advanced junk mail protection and 2 months > > > FREE* > > > >http://join.msn.com/?page=features/junkmail > > > > > > > > > > > > > > > >-- __--__-- > > > > > > > >Message: 2 Subject: RE: [Htmlparser-user] Help with > > > method --> > > > >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 > > > -0800 From: "Marc > > > >Novakowski" To: Reply-To: > > > htm...@li... > > > > > > > >If you can paste the actual code you're trying to > > > compile, I'd be more = > > > >than happy to take a look at it. > > > > > > > >Marc > > > > > > > >-----Original Message----- From: ope tomori > > > [mailto:op...@ho...] > > > >Sent: Thursday, March 27, 2003 7:00 AM To: > > > >htm...@li... Subject: > > > [Htmlparser-user] > > === message truncated === > > > > > > __________________________________________________ > > Do you Yahoo!? > > Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! > > http://platinum.yahoo.com > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: > > The Definitive IT and Networking Event. Be There! > > NetWorld+Interop Las Vegas 2003 -- Register today! > > http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > >--__--__-- > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >End of Htmlparser-user Digest _________________________________________________________________ Add photos to your e-mail with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail |
|
From: Somik R. <so...@ya...> - 2003-03-31 04:43:54
|
Hi Folks,
This week's integration release is packed with goodies!
From the change log:
Integration Build 1.3 - 20030330
--------------------------------
[1] fixed bug (an enhancement really) 694477 quotes in content-type header
[2] fix bug #699886 and #707447 by using a buffered stream reader with
infinite mark
[3] fixed bug in CompositeTagScanner, filter not being set correctly
[4] fixed thread safety issue in TagParser (bug 711073)
[5] fixed out of memory error when parsing custom composite tags (bug
709152)
[6] fixed bug 701159, 696455 - redesigned script scanner.
Javascript parsing is now much more robust.
As you can see, a lot of bug fixes have gone in. There are three major
fixes - one by Derrick Oswald (#2) addresses the charset issue. The parser
should now be able to handle different charsets dynamically. We hope you can
test this and give us feedback.
The second big change is a redesign of the way Javascript is handled by the
parser. It had been riddled with problems for some time, so we've changed
its internals. The new implementation is much more robust, and hopefully we
can get some feedback on that too.
There were some thread safety issues (thanks to Joe Robbins for reporting
this). These have been addressed in this release, and the parser should be
totally thread-safe now.
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2003-03-30 06:16:42
|
FYI, I've just found that the CompositeTagScanner had a bug, due to which
the filters were not being set. Ope -->
node.collectInto(nodeList, LinkTag.LINK_TAG_FILTER);
will work in the next integration release.
Regards,
Somik
----- Original Message -----
From: "Somik Raha" <so...@ya...>
To: <htm...@li...>
Sent: Thursday, March 27, 2003 2:38 PM
Subject: RE: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2
msgs
> Instead of this,
> > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER);
> use:
>
> node.collectInto(nodeList,LinkTag.class);
>
> Regards,
> Somik
> --- Marc Novakowski <ma...@ke...> wrote:
> > Try removing the following line from your code:
> >
> > nodeList.add(node);
> >
> > It's most likely adding non-LinkTag nodes into
> > nodeList which causes the ClassCastException later
> > on.
> >
> > Marc
> >
> > -----Original Message-----
> > From: ope tomori [mailto:op...@ho...]
> > Sent: Thursday, March 27, 2003 1:31 PM
> > To: htm...@li...
> > Subject: [Htmlparser-user] Re: Htmlparser-user
> > digest, Vol 1 #226 - 2
> > msgs
> >
> >
> > I figured out the part using the
> > nodeList.collectInto. My debug output shows
> > the right output, put when i try to process the link
> > information, i get this
> > error (this is part of the error):
> >
> > Exception occurred during event dispatching:
> > java.lang.ClassCastException:
> > org.htmlparser.tags.DoctypeTag
> >
> >
> > Thanks in advance for your help
> >
> > Sincerely,
> > Ope T.
> >
> >
> > This is my code below:
> > try{
> > //create the parser with the url to be parsed
> > parser = new Parser(urlAddressComplete,new
> > DefaultParserFeedback());
> > parser.registerScanners();
> > nodeList = new NodeList();
> >
> > //to extratct all the embedded links and images
> >
> > for (NodeIterator e =
> > parser.elements();e.hasMoreNodes();) {
> > Node node = (Node)e.nextNode();
> > nodeList.add(node);
> >
> //node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER);
> > node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER);
> >
> > }//for
> >
> > System.out.print("CHECKING NODES.. " +
> > nodeList.toString()+ "\n");
> >
> > //now process the links and images
> > //this is the part that doesnt seem to work
> >
> > for (SimpleNodeIterator e =
> > nodeList.elements();e.hasMoreNodes();) {
> > LinkTag linkTag = (LinkTag)e.nextNode();
> >
> > //put the links and their texts into vectors
> > allTextLinkVector.addElement(linkTag.getLinkText());
> > allLinkVector.addElement(linkTag.getLink());
> > }
> > // System.out.print( "All Links " + "Size: "+
> > allTextLinkVector.size() + "
> > "+ allTextLinkVector.toString()+ "\n");
> >
> > }//inner try
> >
> > catch (ParserException e) {
> > System.err.println("Error, could not create parser
> > object");
> > e.printStackTrace();
> > }//catch
> > }// outer try
> > catch(IOException ex) { ex.printStackTrace(); }
> >
> >
> >
> >
> >
> >
> > >From: htm...@li...
> > Reply-To:
> > >htm...@li... To:
> > >htm...@li... Subject:
> > Htmlparser-user digest, Vol
> > >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39
> > -0800
> > >
> > >Send Htmlparser-user mailing list submissions to
> > >htm...@li...
> > >
> > >To subscribe or unsubscribe via the World Wide Web,
> > visit
> >
> >https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> > or, via email,
> > >send a message with subject or body 'help' to
> > >htm...@li...
> > >
> > >You can reach the person managing the list at
> > >htm...@li...
> > >
> > >When replying, please edit your Subject line so it
> > is more specific than
> > >"Re: Contents of Htmlparser-user digest..."
> > >
> > >
> > >Today's Topics:
> > >
> > >1. Help with method --> node.collectInto() (ope
> > tomori) 2. RE: Help with
> > >method --> node.collectInto() (Marc Novakowski)
> > >
> > >--__--__--
> > >
> > >Message: 1 From: "ope tomori" To:
> > htm...@li...
> > >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject:
> > [Htmlparser-user] Help with
> > >method --> node.collectInto() Reply-To:
> > >htm...@li...
> > >
> > >
> > >Hi Im trying to use the method
> > node.collectInto(...) to extract embedded
> > >links and images on webpages. Im using the latest
> > integration release which
> > >means its now Parser, not HTMLParser, nodeIterator,
> > etc and all the other
> > >changes.
> > >
> > >
> > >
> > >I followed the sample code:
> > >
> > >HTMLParser parser = new
> > HTMLParser("http://www.yahoo.com");
> > >parser.registerScanners(); int i = 0; Vector
> > collectionVector = new
> > >Vector(); HTMLNode node; for (HTMLEnumeration e =
> > >parser.elements();e.hasMoreNodes();) { node =
> > e.nextHTMLNode();
> >
> >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
> > } // All
> > >items in the collection vector should be links for
> > (Enumeration e =
> > >collectionVector.elements();e.hasMoreElements();) {
> > HTMLLinkTag linkTag =
> > >(HTMLLinkTag)e.nextElement(); // you can now
> > process the links as you like
> > >}
> >
> ***********************************************************
> > >
> > >
> > >Im getting an error because this line:
> > >
> >
> >node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
> > requires a
> > >nodeList and not a vector, ive tried changing it
> > without any success:
> > >Creating a nodelist instead of a vector,
> > >
> > >can u please help me!!
> > >
> > >Thanks Ope
> > >
> > >
> >
> >_________________________________________________________________
> > The new
> > >MSN 8: advanced junk mail protection and 2 months
> > FREE*
> > >http://join.msn.com/?page=features/junkmail
> > >
> > >
> > >
> > >--__--__--
> > >
> > >Message: 2 Subject: RE: [Htmlparser-user] Help with
> > method -->
> > >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54
> > -0800 From: "Marc
> > >Novakowski" To: Reply-To:
> > htm...@li...
> > >
> > >If you can paste the actual code you're trying to
> > compile, I'd be more =
> > >than happy to take a look at it.
> > >
> > >Marc
> > >
> > >-----Original Message----- From: ope tomori
> > [mailto:op...@ho...]
> > >Sent: Thursday, March 27, 2003 7:00 AM To:
> > >htm...@li... Subject:
> > [Htmlparser-user]
> === message truncated ===
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
> http://platinum.yahoo.com
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by:
> The Definitive IT and Networking Event. Be There!
> NetWorld+Interop Las Vegas 2003 -- Register today!
> http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Somik R. <so...@ya...> - 2003-03-27 22:38:36
|
Instead of this,
> node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER);
use:
node.collectInto(nodeList,LinkTag.class);
Regards,
Somik
--- Marc Novakowski <ma...@ke...> wrote:
> Try removing the following line from your code:
>
> nodeList.add(node);
>
> It's most likely adding non-LinkTag nodes into
> nodeList which causes the ClassCastException later
> on.
>
> Marc
>
> -----Original Message-----
> From: ope tomori [mailto:op...@ho...]
> Sent: Thursday, March 27, 2003 1:31 PM
> To: htm...@li...
> Subject: [Htmlparser-user] Re: Htmlparser-user
> digest, Vol 1 #226 - 2
> msgs
>
>
> I figured out the part using the
> nodeList.collectInto. My debug output shows
> the right output, put when i try to process the link
> information, i get this
> error (this is part of the error):
>
> Exception occurred during event dispatching:
> java.lang.ClassCastException:
> org.htmlparser.tags.DoctypeTag
>
>
> Thanks in advance for your help
>
> Sincerely,
> Ope T.
>
>
> This is my code below:
> try{
> //create the parser with the url to be parsed
> parser = new Parser(urlAddressComplete,new
> DefaultParserFeedback());
> parser.registerScanners();
> nodeList = new NodeList();
>
> //to extratct all the embedded links and images
>
> for (NodeIterator e =
> parser.elements();e.hasMoreNodes();) {
> Node node = (Node)e.nextNode();
> nodeList.add(node);
>
//node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER);
> node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER);
>
> }//for
>
> System.out.print("CHECKING NODES.. " +
> nodeList.toString()+ "\n");
>
> //now process the links and images
> //this is the part that doesnt seem to work
>
> for (SimpleNodeIterator e =
> nodeList.elements();e.hasMoreNodes();) {
> LinkTag linkTag = (LinkTag)e.nextNode();
>
> //put the links and their texts into vectors
> allTextLinkVector.addElement(linkTag.getLinkText());
> allLinkVector.addElement(linkTag.getLink());
> }
> // System.out.print( "All Links " + "Size: "+
> allTextLinkVector.size() + "
> "+ allTextLinkVector.toString()+ "\n");
>
> }//inner try
>
> catch (ParserException e) {
> System.err.println("Error, could not create parser
> object");
> e.printStackTrace();
> }//catch
> }// outer try
> catch(IOException ex) { ex.printStackTrace(); }
>
>
>
>
>
>
> >From: htm...@li...
> Reply-To:
> >htm...@li... To:
> >htm...@li... Subject:
> Htmlparser-user digest, Vol
> >1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39
> -0800
> >
> >Send Htmlparser-user mailing list submissions to
> >htm...@li...
> >
> >To subscribe or unsubscribe via the World Wide Web,
> visit
>
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> or, via email,
> >send a message with subject or body 'help' to
> >htm...@li...
> >
> >You can reach the person managing the list at
> >htm...@li...
> >
> >When replying, please edit your Subject line so it
> is more specific than
> >"Re: Contents of Htmlparser-user digest..."
> >
> >
> >Today's Topics:
> >
> >1. Help with method --> node.collectInto() (ope
> tomori) 2. RE: Help with
> >method --> node.collectInto() (Marc Novakowski)
> >
> >--__--__--
> >
> >Message: 1 From: "ope tomori" To:
> htm...@li...
> >Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject:
> [Htmlparser-user] Help with
> >method --> node.collectInto() Reply-To:
> >htm...@li...
> >
> >
> >Hi Im trying to use the method
> node.collectInto(...) to extract embedded
> >links and images on webpages. Im using the latest
> integration release which
> >means its now Parser, not HTMLParser, nodeIterator,
> etc and all the other
> >changes.
> >
> >
> >
> >I followed the sample code:
> >
> >HTMLParser parser = new
> HTMLParser("http://www.yahoo.com");
> >parser.registerScanners(); int i = 0; Vector
> collectionVector = new
> >Vector(); HTMLNode node; for (HTMLEnumeration e =
> >parser.elements();e.hasMoreNodes();) { node =
> e.nextHTMLNode();
>
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
> } // All
> >items in the collection vector should be links for
> (Enumeration e =
> >collectionVector.elements();e.hasMoreElements();) {
> HTMLLinkTag linkTag =
> >(HTMLLinkTag)e.nextElement(); // you can now
> process the links as you like
> >}
>
***********************************************************
> >
> >
> >Im getting an error because this line:
> >
>
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
> requires a
> >nodeList and not a vector, ive tried changing it
> without any success:
> >Creating a nodelist instead of a vector,
> >
> >can u please help me!!
> >
> >Thanks Ope
> >
> >
>
>_________________________________________________________________
> The new
> >MSN 8: advanced junk mail protection and 2 months
> FREE*
> >http://join.msn.com/?page=features/junkmail
> >
> >
> >
> >--__--__--
> >
> >Message: 2 Subject: RE: [Htmlparser-user] Help with
> method -->
> >node.collectInto() Date: Thu, 27 Mar 2003 08:30:54
> -0800 From: "Marc
> >Novakowski" To: Reply-To:
> htm...@li...
> >
> >If you can paste the actual code you're trying to
> compile, I'd be more =
> >than happy to take a look at it.
> >
> >Marc
> >
> >-----Original Message----- From: ope tomori
> [mailto:op...@ho...]
> >Sent: Thursday, March 27, 2003 7:00 AM To:
> >htm...@li... Subject:
> [Htmlparser-user]
=== message truncated ===
__________________________________________________
Do you Yahoo!?
Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop!
http://platinum.yahoo.com
|
|
From: Marc N. <ma...@ke...> - 2003-03-27 22:19:58
|
Try removing the following line from your code:
nodeList.add(node);
It's most likely adding non-LinkTag nodes into nodeList which causes the =
ClassCastException later on.
Marc
-----Original Message-----
From: ope tomori [mailto:op...@ho...]
Sent: Thursday, March 27, 2003 1:31 PM
To: htm...@li...
Subject: [Htmlparser-user] Re: Htmlparser-user digest, Vol 1 #226 - 2
msgs
I figured out the part using the nodeList.collectInto. My debug output =
shows=20
the right output, put when i try to process the link information, i get =
this=20
error (this is part of the error):
Exception occurred during event dispatching:
java.lang.ClassCastException: org.htmlparser.tags.DoctypeTag
Thanks in advance for your help
Sincerely,
Ope T.
This is my code below:
try{
//create the parser with the url to be parsed
parser =3D new Parser(urlAddressComplete,new DefaultParserFeedback());
parser.registerScanners();
nodeList =3D new NodeList();
//to extratct all the embedded links and images
for (NodeIterator e =3D parser.elements();e.hasMoreNodes();) {
Node node =3D (Node)e.nextNode();
nodeList.add(node);
//node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER);
node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER);
}//for
System.out.print("CHECKING NODES.. " + nodeList.toString()+ "\n");
//now process the links and images
//this is the part that doesnt seem to work
for (SimpleNodeIterator e =3D nodeList.elements();e.hasMoreNodes();) {
LinkTag linkTag =3D (LinkTag)e.nextNode();
//put the links and their texts into vectors
allTextLinkVector.addElement(linkTag.getLinkText());
allLinkVector.addElement(linkTag.getLink());
}
// System.out.print( "All Links " + "Size: "+ allTextLinkVector.size() + =
"=20
"+ allTextLinkVector.toString()+ "\n");
}//inner try
catch (ParserException e) {
System.err.println("Error, could not create parser object");
e.printStackTrace();
}//catch
}// outer try
catch(IOException ex) { ex.printStackTrace(); }
>From: htm...@li... Reply-To:=20
>htm...@li... To:=20
>htm...@li... Subject: Htmlparser-user digest, =
Vol=20
>1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 -0800
>
>Send Htmlparser-user mailing list submissions to=20
>htm...@li...
>
>To subscribe or unsubscribe via the World Wide Web, visit=20
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user or, via =
email,=20
>send a message with subject or body 'help' to=20
>htm...@li...
>
>You can reach the person managing the list at=20
>htm...@li...
>
>When replying, please edit your Subject line so it is more specific =
than=20
>"Re: Contents of Htmlparser-user digest..."
>
>
>Today's Topics:
>
>1. Help with method --> node.collectInto() (ope tomori) 2. RE: Help =
with=20
>method --> node.collectInto() (Marc Novakowski)
>
>--__--__--
>
>Message: 1 From: "ope tomori" To: htm...@li... =
>Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: [Htmlparser-user] Help =
with=20
>method --> node.collectInto() Reply-To:=20
>htm...@li...
>
>
>Hi Im trying to use the method node.collectInto(...) to extract =
embedded=20
>links and images on webpages. Im using the latest integration release =
which=20
>means its now Parser, not HTMLParser, nodeIterator, etc and all the =
other=20
>changes.
>
>
>
>I followed the sample code:
>
>HTMLParser parser =3D new HTMLParser("http://www.yahoo.com");=20
>parser.registerScanners(); int i =3D 0; Vector collectionVector =3D new =
>Vector(); HTMLNode node; for (HTMLEnumeration e =3D=20
>parser.elements();e.hasMoreNodes();) { node =3D e.nextHTMLNode();=20
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // =
All=20
>items in the collection vector should be links for (Enumeration e =3D=20
>collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag =
=3D=20
>(HTMLLinkTag)e.nextElement(); // you can now process the links as you =
like=20
>} ***********************************************************
>
>
>Im getting an error because this line:
>
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); =
requires a=20
>nodeList and not a vector, ive tried changing it without any success:=20
>Creating a nodelist instead of a vector,
>
>can u please help me!!
>
>Thanks Ope
>
>
>_________________________________________________________________ The =
new=20
>MSN 8: advanced junk mail protection and 2 months FREE*=20
>http://join.msn.com/?page=3Dfeatures/junkmail
>
>
>
>--__--__--
>
>Message: 2 Subject: RE: [Htmlparser-user] Help with method -->=20
>node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 -0800 From: "Marc=20
>Novakowski" To: Reply-To: htm...@li...
>
>If you can paste the actual code you're trying to compile, I'd be more =
=3D=20
>than happy to take a look at it.
>
>Marc
>
>-----Original Message----- From: ope tomori [mailto:op...@ho...] =
>Sent: Thursday, March 27, 2003 7:00 AM To:=20
>htm...@li... Subject: [Htmlparser-user] Help =
with=20
>method --> node.collectInto()
>
>
>
>Hi Im trying to use the method node.collectInto(...) to extract =
embedded =3D
>
>links and images on webpages. Im using the latest integration release =
which=20
>means its now Parser, not=3D20 HTMLParser, nodeIterator, etc and all =
the=20
>other changes.
>
>
>
>I followed the sample code:
>
>HTMLParser parser =3D3D new HTMLParser("http://www.yahoo.com");=20
>parser.registerScanners(); int i =3D3D 0; Vector collectionVector =3D3D =
new=20
>Vector(); HTMLNode node; for (HTMLEnumeration e =3D3D=20
>parser.elements();e.hasMoreNodes();) { node =3D3D e.nextHTMLNode();=20
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // =
All=20
>items in the collection vector should be links for (Enumeration e =3D3D =
=3D=20
>collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag =
=3D3D=20
>(HTMLLinkTag)e.nextElement(); // you can now process the links as you =
like=20
>} ***********************************************************
>
>
>Im getting an error because this line:
>
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); =
requires a=20
>nodeList and not a vector, ive tried changing it without any=3D20 =
success:=20
>Creating a nodelist instead of a vector,
>
>can u please help me!!
>
>Thanks Ope
>
>
>_________________________________________________________________ The =
new=20
>MSN 8: advanced junk mail protection and 2 months FREE* =3D20=20
>http://join.msn.com/?page=3D3Dfeatures/junkmail
>
>
>
>------------------------------------------------------- This SF.net =
email=20
>is sponsored by: The Definitive IT and Networking Event. Be There!=20
>NetWorld+Interop Las Vegas 2003 -- Register today!=20
>http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en=20
>_______________________________________________ Htmlparser-user mailing =
>list Htm...@li...=20
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
>--__--__--
>
>_______________________________________________ Htmlparser-user mailing =
>list Htm...@li...=20
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>End of Htmlparser-user Digest
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE* =20
http://join.msn.com/?page=3Dfeatures/junkmail
-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: ope t. <op...@ho...> - 2003-03-27 21:30:53
|
I figured out the part using the nodeList.collectInto. My debug output shows
the right output, put when i try to process the link information, i get this
error (this is part of the error):
Exception occurred during event dispatching:
java.lang.ClassCastException: org.htmlparser.tags.DoctypeTag
Thanks in advance for your help
Sincerely,
Ope T.
This is my code below:
try{
//create the parser with the url to be parsed
parser = new Parser(urlAddressComplete,new DefaultParserFeedback());
parser.registerScanners();
nodeList = new NodeList();
//to extratct all the embedded links and images
for (NodeIterator e = parser.elements();e.hasMoreNodes();) {
Node node = (Node)e.nextNode();
nodeList.add(node);
//node.collectInto(nodeList,ImageTag.IMAGE_TAG_FILTER);
node.collectInto(nodeList,LinkTag.LINK_TAG_FILTER);
}//for
System.out.print("CHECKING NODES.. " + nodeList.toString()+ "\n");
//now process the links and images
//this is the part that doesnt seem to work
for (SimpleNodeIterator e = nodeList.elements();e.hasMoreNodes();) {
LinkTag linkTag = (LinkTag)e.nextNode();
//put the links and their texts into vectors
allTextLinkVector.addElement(linkTag.getLinkText());
allLinkVector.addElement(linkTag.getLink());
}
// System.out.print( "All Links " + "Size: "+ allTextLinkVector.size() + "
"+ allTextLinkVector.toString()+ "\n");
}//inner try
catch (ParserException e) {
System.err.println("Error, could not create parser object");
e.printStackTrace();
}//catch
}// outer try
catch(IOException ex) { ex.printStackTrace(); }
>From: htm...@li... Reply-To:
>htm...@li... To:
>htm...@li... Subject: Htmlparser-user digest, Vol
>1 #226 - 2 msgs Date: Thu, 27 Mar 2003 12:49:39 -0800
>
>Send Htmlparser-user mailing list submissions to
>htm...@li...
>
>To subscribe or unsubscribe via the World Wide Web, visit
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user or, via email,
>send a message with subject or body 'help' to
>htm...@li...
>
>You can reach the person managing the list at
>htm...@li...
>
>When replying, please edit your Subject line so it is more specific than
>"Re: Contents of Htmlparser-user digest..."
>
>
>Today's Topics:
>
>1. Help with method --> node.collectInto() (ope tomori) 2. RE: Help with
>method --> node.collectInto() (Marc Novakowski)
>
>--__--__--
>
>Message: 1 From: "ope tomori" To: htm...@li...
>Date: Thu, 27 Mar 2003 15:00:17 +0000 Subject: [Htmlparser-user] Help with
>method --> node.collectInto() Reply-To:
>htm...@li...
>
>
>Hi Im trying to use the method node.collectInto(...) to extract embedded
>links and images on webpages. Im using the latest integration release which
>means its now Parser, not HTMLParser, nodeIterator, etc and all the other
>changes.
>
>
>
>I followed the sample code:
>
>HTMLParser parser = new HTMLParser("http://www.yahoo.com");
>parser.registerScanners(); int i = 0; Vector collectionVector = new
>Vector(); HTMLNode node; for (HTMLEnumeration e =
>parser.elements();e.hasMoreNodes();) { node = e.nextHTMLNode();
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // All
>items in the collection vector should be links for (Enumeration e =
>collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag =
>(HTMLLinkTag)e.nextElement(); // you can now process the links as you like
>} ***********************************************************
>
>
>Im getting an error because this line:
>
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); requires a
>nodeList and not a vector, ive tried changing it without any success:
>Creating a nodelist instead of a vector,
>
>can u please help me!!
>
>Thanks Ope
>
>
>_________________________________________________________________ The new
>MSN 8: advanced junk mail protection and 2 months FREE*
>http://join.msn.com/?page=features/junkmail
>
>
>
>--__--__--
>
>Message: 2 Subject: RE: [Htmlparser-user] Help with method -->
>node.collectInto() Date: Thu, 27 Mar 2003 08:30:54 -0800 From: "Marc
>Novakowski" To: Reply-To: htm...@li...
>
>If you can paste the actual code you're trying to compile, I'd be more =
>than happy to take a look at it.
>
>Marc
>
>-----Original Message----- From: ope tomori [mailto:op...@ho...]
>Sent: Thursday, March 27, 2003 7:00 AM To:
>htm...@li... Subject: [Htmlparser-user] Help with
>method --> node.collectInto()
>
>
>
>Hi Im trying to use the method node.collectInto(...) to extract embedded =
>
>links and images on webpages. Im using the latest integration release which
>means its now Parser, not=20 HTMLParser, nodeIterator, etc and all the
>other changes.
>
>
>
>I followed the sample code:
>
>HTMLParser parser =3D new HTMLParser("http://www.yahoo.com");
>parser.registerScanners(); int i =3D 0; Vector collectionVector =3D new
>Vector(); HTMLNode node; for (HTMLEnumeration e =3D
>parser.elements();e.hasMoreNodes();) { node =3D e.nextHTMLNode();
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); } // All
>items in the collection vector should be links for (Enumeration e =3D =
>collectionVector.elements();e.hasMoreElements();) { HTMLLinkTag linkTag =3D
>(HTMLLinkTag)e.nextElement(); // you can now process the links as you like
>} ***********************************************************
>
>
>Im getting an error because this line:
>
>node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER); requires a
>nodeList and not a vector, ive tried changing it without any=20 success:
>Creating a nodelist instead of a vector,
>
>can u please help me!!
>
>Thanks Ope
>
>
>_________________________________________________________________ The new
>MSN 8: advanced junk mail protection and 2 months FREE* =20
>http://join.msn.com/?page=3Dfeatures/junkmail
>
>
>
>------------------------------------------------------- This SF.net email
>is sponsored by: The Definitive IT and Networking Event. Be There!
>NetWorld+Interop Las Vegas 2003 -- Register today!
>http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
>_______________________________________________ Htmlparser-user mailing
>list Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
>--__--__--
>
>_______________________________________________ Htmlparser-user mailing
>list Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>End of Htmlparser-user Digest
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
|
|
From: Marc N. <ma...@ke...> - 2003-03-27 16:31:00
|
If you can paste the actual code you're trying to compile, I'd be more =
than happy to take a look at it.
Marc
-----Original Message-----
From: ope tomori [mailto:op...@ho...]
Sent: Thursday, March 27, 2003 7:00 AM
To: htm...@li...
Subject: [Htmlparser-user] Help with method --> node.collectInto()
Hi Im trying to use the method node.collectInto(...) to extract embedded =
links and images on webpages.
Im using the latest integration release which means its now Parser, not=20
HTMLParser, nodeIterator, etc and all the other changes.
I followed the sample code:
HTMLParser parser =3D new HTMLParser("http://www.yahoo.com");
parser.registerScanners();
int i =3D 0;
Vector collectionVector =3D new Vector();
HTMLNode node;
for (HTMLEnumeration e =3D parser.elements();e.hasMoreNodes();) {
node =3D e.nextHTMLNode();
node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
}
// All items in the collection vector should be links
for (Enumeration e =3D =
collectionVector.elements();e.hasMoreElements();) {
HTMLLinkTag linkTag =3D (HTMLLinkTag)e.nextElement();
// you can now process the links as you like
}
***********************************************************
Im getting an error because this line:
node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
requires a nodeList and not a vector, ive tried changing it without any=20
success: Creating a nodelist instead of a vector,
can u please help me!!
Thanks
Ope
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE* =20
http://join.msn.com/?page=3Dfeatures/junkmail
-------------------------------------------------------
This SF.net email is sponsored by:
The Definitive IT and Networking Event. Be There!
NetWorld+Interop Las Vegas 2003 -- Register today!
http://ads.sourceforge.net/cgi-bin/redirect.pl?keyn0001en
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: ope t. <op...@ho...> - 2003-03-27 15:00:29
|
Hi Im trying to use the method node.collectInto(...) to extract embedded
links and images on webpages.
Im using the latest integration release which means its now Parser, not
HTMLParser, nodeIterator, etc and all the other changes.
I followed the sample code:
HTMLParser parser = new HTMLParser("http://www.yahoo.com");
parser.registerScanners();
int i = 0;
Vector collectionVector = new Vector();
HTMLNode node;
for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) {
node = e.nextHTMLNode();
node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
}
// All items in the collection vector should be links
for (Enumeration e = collectionVector.elements();e.hasMoreElements();) {
HTMLLinkTag linkTag = (HTMLLinkTag)e.nextElement();
// you can now process the links as you like
}
***********************************************************
Im getting an error because this line:
node.collectInto(collectionVector,HTMLLinkTag.LINK_TAG_FILTER);
requires a nodeList and not a vector, ive tried changing it without any
success: Creating a nodelist instead of a vector,
can u please help me!!
Thanks
Ope
_________________________________________________________________
The new MSN 8: advanced junk mail protection and 2 months FREE*
http://join.msn.com/?page=features/junkmail
|
|
From: Marc N. <ma...@ke...> - 2003-03-24 23:23:45
|
Somik, Thanks for fixing 702614! Unfortunately I can't seem to get the latest = build to work. It's throwing an OOM exception in my own code when using = the NodeIterator returned by parser.elements(). I'm looking into this = to make sure I'm not doing something stupid in my code. However, the = library seems to be acting differently than previous releases even = out-of-the-box. For example, the following used to return a list of the = links on Yahoo (in the 0302 release): java -jar ./htmlparser.jar http://www.yahoo.com -l In the 0323 release, however, it returns nothing. Marc -----Original Message----- From: Somik Raha [mailto:so...@ya...] Sent: Sunday, March 23, 2003 5:24 PM To: HTMLParser Announcement List; HTMLParser User List; HTMLParser Developer List Subject: [Htmlparser-user] Integration Release 1.3-20030323 is out Hi Folks, This week's integration release has two important fixes : Integration build 1.3 - 20030323 -------------------------------- [1] Fixed bug 702547 - single quotes parsed more robustly now [2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a method isEmptyXmlTag(). #2 refers to tags like <tag/>. Thanks to Joe Robbins for a fine bug report that helped in putting in = the fix for #1 faster. Thanks also to Marc Novakowski for the other report. Thanks are also due to Huang-Chun Yu for uncovering a serious bug with = the script scanning mechanism. The parser can currently handle script tags = like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular = tags. Such pages are quite widespread and ought to be supported. I was curious = if anyone has ideas on solving this - given the existing design - fresh = ideas often lead to a better perspective. If you have some ideas, feel free to join the developer list (http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and = post. Regards, Somik ------------------------------------------------------- This SF.net email is sponsored by:Crypto Challenge is now open!=20 Get cracking and register here for some mind boggling fun and=20 the chance of winning an Apple iPod: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0031en _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
|
From: mohammad a. <re...@em...> - 2003-03-24 11:45:15
|
I didnt mean to jump on u or anyone else, or evene complaining. i totaly understand your situation, its same for me, i only have my free time to work on my personal projects. what i meant was that, this kind of bugs, if u can call it a bug, should be easier and faster to fix, but thats only how i see it, it may be more complicated. what i've understand av the source code is that this only happens once, when meta-scanninga starts, and therfore it should be fixed easily to let the meta-tag use different charsets. when i say "stupid bug", i mean it shouldnt be there at all, i can't understand why the designers and developers would consider every page use ISO-charsets, when there are som many of them. but thats just my opinion. i hope u dont missunderstood me about "put everything down and fix the bug" thing, its just i see it as "easy to fix" and really would help me, but thats just my opinion. i've seen a new "Integration Releaset, but what a dissapointment that the cahrset-bug is not fixed! i hope everyone have noticed the bug report for META-charset bug. as i said before, my solution was just temporary and is not a good one of 2 reasons: i dont have enough skills in this matter to come with good solutions, and i hav'nt yet checked through the whole code, as i consider it important to be able to suggest fixes. i hope the bug report is enough to fix the probelm. rezamotori, Sweden -- _______________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup |
|
From: Somik R. <so...@ya...> - 2003-03-24 01:22:13
|
Hi Folks,
This week's integration release has two important fixes :
Integration build 1.3 - 20030323
--------------------------------
[1] Fixed bug 702547 - single quotes parsed more robustly now
[2] Fixed bug 702614 - empty tags handled correctly now. Tag now has a
method isEmptyXmlTag().
#2 refers to tags like <tag/>.
Thanks to Joe Robbins for a fine bug report that helped in putting in the
fix for #1 faster. Thanks also to Marc Novakowski for the other report.
Thanks are also due to Huang-Chun Yu for uncovering a serious bug with the
script scanning mechanism. The parser can currently handle script tags like
:
<script>
<!--
code here
-->
</script>
But when the tags are like:
<script>
code here
</script>
the parser is unable to identify the code and treats it like regular tags.
Such pages are quite widespread and ought to be supported. I was curious if
anyone has ideas on solving this - given the existing design - fresh ideas
often lead to a better perspective. If you have some ideas, feel free to
join the developer list
(http://lists.sourceforge.net/lists/listinfo/htmlparser-developer) and post.
Regards,
Somik
|
|
From: Somik R. <so...@ya...> - 2003-03-23 16:41:07
|
mohammad azadi wrote: > I really think it's an stupid bug that all pages must use ISO-charset! cant u just fix the damn thing and make it as a patch so we can continue with our work?? If you're objecting to my request to file a bug report- then pls note that I cannot devote weekdays to the project, only my personal time on weekends. And when I do get the time, I do not prefer to search all emails on the user list to find what bugs need to be tackled. As far as the bug in question being stupid- all bugs are stupid, its just that one person does not have the time to find them all, and code is often written by more than one person. There are also development priorities - certain bugs take precendence - in my opinion, which I often base on feedback. Since this is not a paid project, you cannot expect me or any other developer to jump on an incomplete bug report - the least we expect is the community to help out. However, if a certain bug hurts you, and needs fixing, you could always make a polite request. Or solve it yourself and give it to the community, for which all of us will be grateful. > my suggestion is to have an String[] containing all the common charsets, and enable it to expand for new charsets. > I don't think it should take long to fix it, i've tried myself, but it just was a temperary fix. Thank you for the suggestion. Perhaps you can give us the patch in question. And just so you don't think I am being sarcastic, I'd be happy to have you on our developer team - anyone who wants to improve the system earns a right to be on the dev team. In general - I think it will be good to have guidelines for posting questions to make us a more effective community. I try to follow this Eric Raymond's well-written paper- http://www.catb.org/%7Eesr/faqs/smart-questions.html Regards, Somik |
|
From: mohammad a. <re...@em...> - 2003-03-23 14:09:14
|
I really think it's an stupid bug that all pages must use ISO-charset! cant u just fix the damn thing and make it as a patch so we can continue with our work?? my suggestion is to have an String[] containing all the common charsets, and enable it to expand for new charsets. I don't think it should take long to fix it, i've tried myself, but it just was a temperary fix. Rezamotori, Sweden -- _______________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup |
|
From: Somik R. <so...@ya...> - 2003-03-21 19:48:40
|
You should be able to suppress all the feedback. Check http://htmlparser.sourceforge.net/docs/index.php/FeedbackMechanism Regards, Somik --- Sean_Syslab <se...@sy...> wrote: > Sorry, I misunderstand the return strings. The > WARNING messsges are not within the return strings > of the methods, but are shown after that. > > Dear all: > > When I used the sample program to extract links or > strings, there were sometimes WARNING messages shown > within the return strings. I don't want those > WARNING strings accompanied with the return value. > What should I do... > > > Yours, Sean > __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
|
From: Sean_Syslab <se...@sy...> - 2003-03-21 19:14:46
|
Sorry, I misunderstand the return strings. The WARNING messsges are =
not within the return strings of the methods, but are shown after that.
Dear all:
When I used the sample program to extract links or strings, there were =
sometimes WARNING messages shown within the return strings. I don't want =
those WARNING strings accompanied with the return value. What should I =
do...
=
Yours, Sean
|
|
From: Sean_Syslab <se...@sy...> - 2003-03-21 18:30:17
|
Dear all:
When I used the sample program to extract links or strings, there were =
sometimes WARNING messages shown within the return strings. I don't want =
those WARNING strings accompanied with the return value. What should I =
do...
=
Yours, Sean
|
|
From: Somik R. <so...@ya...> - 2003-03-21 17:39:17
|
To login to sourceforge, you need to have a sourceforge id. Get one from http://sourceforge.net/account/register.php Regards Somik --- Aminudin Khalid <ami...@mi...> wrote: > Can somebody else help mo to file this bug. I could > not login to > sourceforge. > > Thanks :) > > > Somik Raha wrote: > > > Sounds like a bug.. Can you file a bug report at > > http://htmlparser.sourceforge.net > > > > Regards, > > Somik > > > > ----- Original Message ----- > > *From:* Aminudin Khalid > <mailto:ami...@mi...> > > *To:* htm...@li... > > <mailto:htm...@li...> > > *Sent:* Monday, March 17, 2003 6:42 PM > > *Subject:* Re: [Htmlparser-user] Handling META > tag > > > > > >>It will help if you can post the stack trace. > >> > > I dunno how to do that. > > > > Well , I think the error comes from the > htmlparser.jar . Simply > > parse a file that contains the following code > and you will notice > > the "error". Actually there is no error , it > just doesnt parse > > the file correctly. > > > > OK, I have a file ( thisfile.html) . Below is > HTML code inside > > thisfile.html . > > > > <html> > > <head> > > <meta http-equiv="content-type" > content="text/html; > > charset=windows-1252"> > > </head> > > </html> > > > > > > Try to parse thisfile.html with htmlparser.jar > . > > > > java -jar htmlparser.jar > thisfile.html > > > > Below is the only output, (It doesn't detect > html code ???? ): > > > > HTMLParser v1.3 (Integration Build Mar 16, > 2003) > > INFO: file://localhost/thisfile.html > > Parsing file://localhost/thisifle.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Mohd. Aminudin bin Mohd. Khalid > Linux Programmer > Asian Open Source Centre (http://www.asiaosc.org) > Mimos Berhad (http://www.mimos.my) > > > > __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
|
From: Aminudin K. <ami...@mi...> - 2003-03-21 00:48:57
|
Can somebody else help mo to file this bug. I could not login to sourceforge. Thanks :) Somik Raha wrote: > Sounds like a bug.. Can you file a bug report at > http://htmlparser.sourceforge.net > > Regards, > Somik > > ----- Original Message ----- > *From:* Aminudin Khalid <mailto:ami...@mi...> > *To:* htm...@li... > <mailto:htm...@li...> > *Sent:* Monday, March 17, 2003 6:42 PM > *Subject:* Re: [Htmlparser-user] Handling META tag > > >>It will help if you can post the stack trace. >> > I dunno how to do that. > > Well , I think the error comes from the htmlparser.jar . Simply > parse a file that contains the following code and you will notice > the "error". Actually there is no error , it just doesnt parse > the file correctly. > > OK, I have a file ( thisfile.html) . Below is HTML code inside > thisfile.html . > > <html> > <head> > <meta http-equiv="content-type" content="text/html; > charset=windows-1252"> > </head> > </html> > > > Try to parse thisfile.html with htmlparser.jar . > > java -jar htmlparser.jar thisfile.html > > Below is the only output, (It doesn't detect html code ???? ): > > HTMLParser v1.3 (Integration Build Mar 16, 2003) > INFO: file://localhost/thisfile.html > Parsing file://localhost/thisifle.html > > > > > > > > > > > > > > -- Mohd. Aminudin bin Mohd. Khalid Linux Programmer Asian Open Source Centre (http://www.asiaosc.org) Mimos Berhad (http://www.mimos.my) |
|
From: Sean_YZU90 <s9...@ma...> - 2003-03-20 17:15:59
|
The member who posts about the compilation problem sets a correct classpath, I think. The problem is that he used the latest version of htmlparser, which doesn't contain the class HtmlNode... . So he should use htmlparser 1.2 , then the sample LinkExtractor.java could be correctly compiled. =20 =20 Yours, Sean |
|
From: Somik R. <so...@ya...> - 2003-03-19 06:26:08
|
Sounds like a bug.. Can you file a bug report at = http://htmlparser.sourceforge.net Regards, Somik ----- Original Message -----=20 From: Aminudin Khalid=20 To: htm...@li...=20 Sent: Monday, March 17, 2003 6:42 PM Subject: Re: [Htmlparser-user] Handling META tag It will help if you can post the stack trace.I dunno how to do that. Well , I think the error comes from the htmlparser.jar . Simply parse = a file that contains the following code and you will notice the = "error". Actually there is no error , it just doesnt parse the file = correctly. OK, I have a file ( thisfile.html) . Below is HTML code inside = thisfile.html .=20 <html> <head> <meta http-equiv=3D"content-type" content=3D"text/html; = charset=3Dwindows-1252"> </head> </html> Try to parse thisfile.html with htmlparser.jar . =20 java -jar htmlparser.jar thisfile.html Below is the only output, (It doesn't detect html code ???? ): HTMLParser v1.3 (Integration Build Mar 16, 2003) INFO: file://localhost/thisfile.html Parsing file://localhost/thisifle.html =20 |
|
From: Aminudin K. <ami...@mi...> - 2003-03-18 02:41:56
|
>It will help if you can post the stack trace.
>
I dunno how to do that.
Well , I think the error comes from the htmlparser.jar . Simply parse a
file that contains the following code and you will notice the "error".
Actually there is no error , it just doesnt parse the file correctly.
OK, I have a file ( thisfile.html) . Below is HTML code inside
thisfile.html .
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=windows-1252">
</head>
</html>
Try to parse thisfile.html with htmlparser.jar .
java -jar htmlparser.jar thisfile.html
Below is the only output, (It doesn't detect html code ???? ):
HTMLParser v1.3 (Integration Build Mar 16, 2003)
INFO: file://localhost/thisfile.html
Parsing file://localhost/thisifle.html
|
|
From: Somik R. <so...@ya...> - 2003-03-17 21:35:52
|
It will help if you can post the stack trace. Regards Somik --- Aminudin Khalid <ami...@mi...> wrote: > I have problem to parse HTML codes that contains the > following META tag. > > <html> > <head> > <meta http-equiv="content-type" > content="text/html; > charset=windows-1252"> > </head> > </html> > > I wrote a visitor class to parse several web sites > but it failed to > parse this kind of HTML codes. I also tried ( java > -jar htmlparser.jar > thisfile.html ), it also failed. > > I guess it couldn't read the > *http-equiv="content-type" * > > > Any idea ? > __________________________________________________ Do you Yahoo!? Yahoo! Platinum - Watch CBS' NCAA March Madness, live on your desktop! http://platinum.yahoo.com |
|
From: Aminudin K. <ami...@mi...> - 2003-03-17 08:46:01
|
I have problem to parse HTML codes that contains the following META tag.
<html>
<head>
<meta http-equiv="content-type" content="text/html;
charset=windows-1252">
</head>
</html>
I wrote a visitor class to parse several web sites but it failed to
parse this kind of HTML codes. I also tried ( java -jar htmlparser.jar
thisfile.html ), it also failed.
I guess it couldn't read the *http-equiv="content-type" *
Any idea ?
|
|
From: Somik R. <so...@ya...> - 2003-03-16 21:36:46
|
Hi Folks,
This is a major milestone release. A massive refactoring has been
completed (took two weeks) - which has brought all the robust error handling
cases into CompositeTagScanner. This means, all tags that have children will
be able to do error correction uniformly. Form tag (and table tags too)
should be robust.
Table tags are not yet in the standard set of scanners (you still need
to add them manually). They should make the cut next week.
We have a new method - registerDomScanners() in Parser - that allows you
to build html dom objects.
Interesting fact, as a result of the refactorings, the LOC of the
scanners package has reduced from 1553 to 1355 (I was surprised at the
digits).
Documentation has been updated - we've started putting up answers by our
list members to common questions. Pls feel free to update the Wiki and
improve it. No login is required.
From the change log:
Integration build 1.3 - 20030316
--------------------------------
[1] Added method finishedParsing() to NodeVisitor
[2] LinkScanner uses CompositeTagScanner.scan()
[3] BulletScanner added
[4] FormScanner uses CompositeTagScanner.scan()
[5] AppletScanner uses CompositeTagScanner.scan()
We highly recommend an upgrade to this version.
Regards,
Somik
|
|
From: Derrick O. <Der...@ro...> - 2003-03-15 20:59:22
|
Guilherme,
I think what you need is in
src/org/htmlparser/util/Translate.java
Something like this should work:
String htmltext = Translate.encode (resultset.getString ("databasetext"));
If you have to do a lot of it though, you'll probably want to rewrite
that method.
As it stands it allocates one Character for each character in the input
string.
If you do want to rewrite it, you should probably instead adjust the
Generate class
in the same package since the Translate.java source is created by
running Generate.
Derrick
>To: htm...@li...
>Date: Fri, 14 Mar 2003 20:40:12 +0000 (WET)
>From: Guilherme Zambon <gz...@sa...>
>Subject: [Htmlparser-user] html code parsing
>Reply-To: htm...@li...
>
>Anyone using htmlparser to parse ", <, > from user input to
>", < and > ?
>I have the following scenario:
>my database has texts with these chars (",< and >) and I have to
>put them from database to a <textarea> in the html. Is there any
>taglib or other solution to I filter this database information,
>to show in a html form field?
>
>Thanks in advance,
>
>Guilherme Zambon
>
>Example of code that I need to threat:
>
><textarea><%= rs.getString("databasetext") %></textarea>
>
>it generates something like
><textarea>a text with < won't work in a html</textarea>
>
>and I want something like
><textarea><sometag:encode string="<%=
>rs.getString("databasetext")" /></textarea>
>
>--
>SAPO ADSL.PT, apanhe já o comboio da Banda Larga. Kit SAPO ADSL.PT €50
>
>hTTP://www.sapo.pt/kitadsl
>
>
>
|