htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
| 2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
| 2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
| 2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
| 2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
| 2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
| 2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
| 2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
| 2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
| 2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
| 2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
| 2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
| 2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
| 2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
| 2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
| 2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
| 2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
| 2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
1
|
|
2
|
3
(1) |
4
(4) |
5
(7) |
6
(7) |
7
(8) |
8
(2) |
|
9
|
10
(1) |
11
(3) |
12
(2) |
13
(1) |
14
(4) |
15
|
|
16
(3) |
17
(2) |
18
(3) |
19
(6) |
20
|
21
|
22
(1) |
|
23
(2) |
24
(4) |
25
(2) |
26
(5) |
27
(2) |
28
|
|
|
From: Derrick O. <Der...@ro...> - 2003-02-27 02:54:12
|
If I recall correctly, implementing this feature would require deferring
not only the connect but also the determination of the character set
(from the header returned by the connect) and creation of the reader
(because it needs the character set, and an input stream) until
elements() is called. elements() would need to check for a null reader
and do the work. Then getReader() and getEncoding() would also have to
handle a null reader or null character_set too. Are there other subtleties?
Maybe tricky, but probably do-able. I think all the constructors have
test cases.
But then, all that's really being saved is the user coding:
Parser parser = new Parser ("http://yadda");
URL url = parser.getConnection ();
...process the url as appropriate
... parser.elements ()
instead of:
URL url = new URL ("http://yadda");
url.openConnection ();
...process the url as appropriate
Parser parser = new Parser (url);
... parser.elements ()
So it's probably not really worth the convoluted coding, unless I'm
missing something in the use-case.
Derrick
htm...@li... wrote:
>
>Also, on another note, if I try to initialize the
>parser directly, I am unable to work with the
>URLConnection. For example:
>
> HttpURLConnection urlConn = null;
> HTMLParser parser = new
>HTMLParser("http://somedomain/somepath");
> urlConn =
>(HttpURLConnection)parser.getConnection();
> urlConn.setDoInput(true);
> // ...
>
>This code throws an exception because the HTTP request
>has already been made.
>
>Exception in thread "main"
>java.lang.IllegalAccessError: Already connected
> at
>java.net.URLConnection.setDoInput(URLConnection.java:677)
>
>--- Bob Lewis <bob...@ya...> wrote:
>
>
<snip>
>
>--__--__--
>
>Message: 3
>From: "Somik Raha" <so...@ya...>
>To: <htm...@li...>
>Subject: Re: [Htmlparser-user] Malformed Input Exception
>Date: Tue, 25 Feb 2003 22:46:16 -0800
>Reply-To: htm...@li...
>
>That sounds like a good feature request. Derrick ->what do you think ?
>
>Regards,
>Somik
>
>
>
>
>
>
|
|
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-27 00:21:15
|
hi there, I think a formtag should end when it sees another formtag although it is not an endtag. Another way of determining the endtag of formtag is to check wether it is the end of the html page by checking the endtag of hmtltag. This is because the in formtag, it's consist of inputtag and the importants information about a form is its method, action, and inputtag, therefore when the parser first see a formtag it will parse the node until it sees the endtag of the formtag, another formtag or the end of html document. therefore, we can logically group Vector of inputtag and other attributes to the appropriate formtag (if there is more than one formtag). I hope my explaination can help us improve htmlparser. thank you. Quoting Somik Raha <so...@ya...>: > This is a known limitation. The problem is in guessing > when a form tag really should have ended. Can you > suggest something looking at the page that failed ? > > Regards, > Somik > --- Mohd-Taqiyuddin Zalfan <mt...@ec...> > wrote: > > Hi, > > > > I'm doing my harvester to harvest information in the > > formtag. It works find > > when I parse to any html pages that I need to parse > > except for this URL > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. > > It seems that the page that gives the error does not > > have an endtag for the > > formtag and the parser loopback to find the endtag > > for the formtag. Is this > > a bug? Do you know a solution that I can still parse > > the page and still get > > the Vector FormInput for further processing. Hope > > you can help me on this. > > below is the generated error. > > " > > ERROR: HTMLReader.readElement() : Error occurred > > while trying to decipher > > the tag using scanners > > Tag being processed : FORM > > Current Tag Line : <form > > action="earlyadopterjxtaanswers.jsp" > > method="POST"> > > at Line 690 : null > > Previous Line 689 : </HTML> > > ERROR: HTMLReader.readElement() : Error occurred > > while trying to read the > > next element, > > at Line 690 : null > > Previous Line 689 : </HTML> > > ERROR: Unexpected Exception occurred while reading > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, > > > > in nextHTMLNode > > at Line 690 : null > > Previous Line 689 : </HTML> > > org.htmlparser.util.ParserException: Unexpected > > Exception occurred while > > reading > > > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta > > .html, in nextHTMLNode > > at Line 690 : null > > Previous Line 689 : </HTML>" > > > > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Scholarships for > > Techies! > > Can't afford IT training? All 2003 ictp students > > receive scholarships. > > Get hands-on training in Microsoft, Cisco, Sun, > > Linux/UNIX, and more. > > www.ictp.com/training/sourceforge.asp > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This SF.net email is sponsored by: Scholarships for Techies! > Can't afford IT training? All 2003 ictp students receive scholarships. > Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more. > www.ictp.com/training/sourceforge.asp > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
|
From: Somik R. <so...@ya...> - 2003-02-26 18:05:06
|
This is a known limitation. The problem is in guessing when a form tag really should have ended. Can you suggest something looking at the page that failed ? Regards, Somik --- Mohd-Taqiyuddin Zalfan <mt...@ec...> wrote: > Hi, > > I'm doing my harvester to harvest information in the > formtag. It works find > when I parse to any html pages that I need to parse > except for this URL > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. > It seems that the page that gives the error does not > have an endtag for the > formtag and the parser loopback to find the endtag > for the formtag. Is this > a bug? Do you know a solution that I can still parse > the page and still get > the Vector FormInput for further processing. Hope > you can help me on this. > below is the generated error. > " > ERROR: HTMLReader.readElement() : Error occurred > while trying to decipher > the tag using scanners > Tag being processed : FORM > Current Tag Line : <form > action="earlyadopterjxtaanswers.jsp" > method="POST"> > at Line 690 : null > Previous Line 689 : </HTML> > ERROR: HTMLReader.readElement() : Error occurred > while trying to read the > next element, > at Line 690 : null > Previous Line 689 : </HTML> > ERROR: Unexpected Exception occurred while reading > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, > > in nextHTMLNode > at Line 690 : null > Previous Line 689 : </HTML> > org.htmlparser.util.ParserException: Unexpected > Exception occurred while > reading > http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta > .html, in nextHTMLNode > at Line 690 : null > Previous Line 689 : </HTML>" > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Scholarships for > Techies! > Can't afford IT training? All 2003 ictp students > receive scholarships. > Get hands-on training in Microsoft, Cisco, Sun, > Linux/UNIX, and more. > www.ictp.com/training/sourceforge.asp > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
|
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-26 16:34:06
|
Hi, I'm doing my harvester to harvest information in the formtag. It works find when I parse to any html pages that I need to parse except for this URL http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html. It seems that the page that gives the error does not have an endtag for the formtag and the parser loopback to find the endtag for the formtag. Is this a bug? Do you know a solution that I can still parse the page and still get the Vector FormInput for further processing. Hope you can help me on this. below is the generated error. " ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners Tag being processed : FORM Current Tag Line : <form action="earlyadopterjxtaanswers.jsp" method="POST"> at Line 690 : null Previous Line 689 : </HTML> ERROR: HTMLReader.readElement() : Error occurred while trying to read the next element, at Line 690 : null Previous Line 689 : </HTML> ERROR: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta.html, in nextHTMLNode at Line 690 : null Previous Line 689 : </HTML> org.htmlparser.util.ParserException: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/misc/earlyadopterjxta .html, in nextHTMLNode at Line 690 : null Previous Line 689 : </HTML>" |
|
From: Bob L. <bob...@ya...> - 2003-02-26 16:16:41
|
Hi,
I tried this, as you suggested, and received the same
Exception while reading the InputStream. Which led me
to discover that I was setting the wrong character set
in the InputStreamReader.
My app was erroneously using the system default
character set (UTF8 in this case), but the actual
stream was using ISO-8859-1.
The getCharset and getCharacterSet methods in Parser
are very useful here. You may want to consider making
them static and public, or moving them to a Utility
class. That way they can be used by applications
which construct their own Readers.
Thanks for the help,
Bob Lewis
--- Somik Raha <so...@ya...> wrote:
> Hi Bob,
> Can you try this - get the data from the url in
> question into a file
> (using a post request). Then try to parse the file.
> If it breaks, we would
> know why.
>
> Regards,
> Somik
> ----- Original Message -----
> From: "Bob Lewis" <bob...@ya...>
> To: <htm...@li...>
> Sent: Tuesday, February 25, 2003 12:07 PM
> Subject: Re: [Htmlparser-user] Malformed Input
> Exception
>
>
> >
> > I tried using the parser directly, as you
> suggested,
> > and it seems to work. However, I need to be able
> work
> > with the URLConnection to set headers, cookies and
> > send POST data.
> >
> > Typically, this is what I'm doing:
> >
> > //create and initialize the URL Connection
> > HttpURLConnection urlConn = null;
> > URL url = new
> URL("http://somedomain/somepath");
> > urlConn =
> (HttpURLConnection)url.openConnection();
> > urlConn.setDoInput(true);
> > urlConn.setDoOutput(true);
> > urlConn.setUseCaches(false);
> > urlConn.setAllowUserInteraction(false);
> > urlConn.setRequestMethod("POST");
> >
> > // ... usually many HTTP Headers and cookie
> values
> > set
> > urlConn.setRequestProperty("someHeader",
> > "someValue");
> > urlConn.setRequestProperty("anotherHeader",
> > "anotherValue");
> >
> > StringBuffer postData = new StringBuffer();
> > // ... generate post data in buffer
> >
> > //Send the post data
> > PrintWriter printWriter = new
> > PrintWriter(urlConn.getOutputStream());
> > printWriter.println(postData.toString());
> > printWriter.close();
> >
> > //parse the response
> > HTMLEnumeration tags = parser.elements();
> >
> > while (parser.hasMoreNodes())
> > {
> > // ... Do Something
> > }
> >
> > This works fine on most URLs. I am normally able
> to
> > execute the server-side web application, obtain
> and
> > parse the HTML response. However, in the case of
> > these two URLs, I get the MalformedInputException.
> >
> > Is there something I'm missing?
> >
> > Thanks,
> >
> > Bob Lewis
> >
> > --- Somik Raha <so...@ya...> wrote:
> >
> > >Date: 2003-02-24 21:33
> > >Sender: somik
> > >Logged In: YES
> > >user_id=187944
> > >
> > >I ran the parser on these pages and it worked
> fine.
> > Try
> > >runParser.bat
> http://www.flytango.com/en/index.html.
> > >
> > >It could be that you have intialized your
> > urlconnection
> > >incorrectly. Have you tried using the parser
> > directly, like :
> > >
> > >HTMLParser parser = new HTMLParser
> > >("http://www.flytango.com/en/index.html");
> > >for (NodeIterator
> > i=parser.elements();i.hasMoreNodes();) {
> > > System.out.println(i.nextNode().toHtml());
> > >}
> >
> > --- Somik Raha <so...@ya...> wrote:
> > > Hi Bob,
> > > Sounds like a bug.
> > > Can you file a bug report at
> > > http://htmlparser.sourceforge.net?
> > >
> > > Regards,
> > > Somik
> > > --- Bob Lewis <bob...@ya...> wrote:
> > > > Hi,
> > > >
> > > > I am trying to use htmlparser 1.3 to parse the
> > > HTML
> > > > at
> > > > http://www.flytango.com/en/taschedule.html and
> > > > http://www.flytango.com/en/index.html. When I
> > > > attempt
> > > > to parse these pages, I get
> > > > com.sun.io.MalformedInputException:
> > > >
> > > > sun.io.MalformedInputException
> > > > at
> > > >
> > >
> >
>
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105)
> > > > at
> > > >
> > >
> >
>
java.io.InputStreamReader.convertInto(InputStreamReader.java:132)
> > > > at
> > > >
> > >
> >
>
java.io.InputStreamReader.fill(InputStreamReader.java:181)
> > > > at
> > > >
> > >
> >
>
java.io.InputStreamReader.read(InputStreamReader.java:244)
> > > > at
> > > >
> > >
> java.io.BufferedReader.fill(BufferedReader.java:134)
> > > > at
> > > >
> > >
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:294)
> > > > at
> > > >
> > >
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:357)
> > > > at
> > > >
> > >
> >
>
org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139)
> > > > at
> > > >
> > >
> >
>
org.htmlparser.HTMLReader.readElement(HTMLReader.java:176)
> > > > at
> > > >
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60)
> > > > at
> > > >
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav
> a:91)
> > > >
> > > > Now, if I copy the source of these pages from
> a
> > > > browser into a file and put them on my own
> > > > webserver,
> > > > I can parse them without any errors.
> > > >
> > > > It's my guess that there is some strange
> control
> > > > character in the source that is causing the
> > > > exception,
> > > > but I'm not entirely sure. Any suggestions?
> If
> > > it
> > > > is
> > > > a bad character, would it be possible to add
> code
> > > to
> > > > HTMLReader that strips offending characters
> from
> > > the
> > > > input stream?
> > > >
>
=== message truncated ===
__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/
|
|
From: Somik R. <so...@ya...> - 2003-02-26 06:44:52
|
That sounds like a good feature request. Derrick ->what do you think ?
Regards,
Somik
----- Original Message -----
From: "Bob Lewis" <bob...@ya...>
To: <htm...@li...>
Sent: Tuesday, February 25, 2003 12:20 PM
Subject: Re: [Htmlparser-user] Malformed Input Exception
> Sorry, there was a typo in my last message:
>
> > while (parser.hasMoreNodes())
> > {
> > // ... Do Something
> > }
>
> should be
>
> while (tags.hasMoreNodes())
> {
> // ... Do Something
> }
>
> Also, on another note, if I try to initialize the
> parser directly, I am unable to work with the
> URLConnection. For example:
>
> HttpURLConnection urlConn = null;
> HTMLParser parser = new
> HTMLParser("http://somedomain/somepath");
> urlConn =
> (HttpURLConnection)parser.getConnection();
> urlConn.setDoInput(true);
> // ...
>
> This code throws an exception because the HTTP request
> has already been made.
>
> Exception in thread "main"
> java.lang.IllegalAccessError: Already connected
> at
> java.net.URLConnection.setDoInput(URLConnection.java:677)
>
> --- Bob Lewis <bob...@ya...> wrote:
> >
> > I tried using the parser directly, as you suggested,
> > and it seems to work. However, I need to be able
> > work
> > with the URLConnection to set headers, cookies and
> > send POST data.
> >
> > Typically, this is what I'm doing:
> >
> > //create and initialize the URL Connection
> > HttpURLConnection urlConn = null;
> > URL url = new URL("http://somedomain/somepath");
> > urlConn =
> > (HttpURLConnection)url.openConnection();
> > urlConn.setDoInput(true);
> > urlConn.setDoOutput(true);
> > urlConn.setUseCaches(false);
> > urlConn.setAllowUserInteraction(false);
> > urlConn.setRequestMethod("POST");
> >
> > // ... usually many HTTP Headers and cookie
> > values
> > set
> > urlConn.setRequestProperty("someHeader",
> > "someValue");
> > urlConn.setRequestProperty("anotherHeader",
> > "anotherValue");
> >
> > StringBuffer postData = new StringBuffer();
> > // ... generate post data in buffer
> >
> > //Send the post data
> > PrintWriter printWriter = new
> > PrintWriter(urlConn.getOutputStream());
> > printWriter.println(postData.toString());
> > printWriter.close();
> >
> > //parse the response
> > HTMLEnumeration tags = parser.elements();
> >
> > while (parser.hasMoreNodes())
> > {
> > // ... Do Something
> > }
> >
> > This works fine on most URLs. I am normally able to
> > execute the server-side web application, obtain and
> > parse the HTML response. However, in the case of
> > these two URLs, I get the MalformedInputException.
> >
> > Is there something I'm missing?
> >
> > Thanks,
> >
> > Bob Lewis
> >
> > --- Somik Raha <so...@ya...> wrote:
> >
> > >Date: 2003-02-24 21:33
> > >Sender: somik
> > >Logged In: YES
> > >user_id=187944
> > >
> > >I ran the parser on these pages and it worked fine.
> > Try
> > >runParser.bat
> > http://www.flytango.com/en/index.html.
> > >
> > >It could be that you have intialized your
> > urlconnection
> > >incorrectly. Have you tried using the parser
> > directly, like :
> > >
> > >HTMLParser parser = new HTMLParser
> > >("http://www.flytango.com/en/index.html");
> > >for (NodeIterator
> > i=parser.elements();i.hasMoreNodes();) {
> > > System.out.println(i.nextNode().toHtml());
> > >}
> >
> > --- Somik Raha <so...@ya...> wrote:
> > > Hi Bob,
> > > Sounds like a bug.
> > > Can you file a bug report at
> > > http://htmlparser.sourceforge.net?
> > >
> > > Regards,
> > > Somik
> > > --- Bob Lewis <bob...@ya...> wrote:
> > > > Hi,
> > > >
> > > > I am trying to use htmlparser 1.3 to parse the
> > > HTML
> > > > at
> > > > http://www.flytango.com/en/taschedule.html and
> > > > http://www.flytango.com/en/index.html. When I
> > > > attempt
> > > > to parse these pages, I get
> > > > com.sun.io.MalformedInputException:
> > > >
> > > > sun.io.MalformedInputException
> > > > at
> > > >
> > >
> >
> sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105)
> > > > at
> > > >
> > >
> >
> java.io.InputStreamReader.convertInto(InputStreamReader.java:132)
> > > > at
> > > >
> > >
> >
> java.io.InputStreamReader.fill(InputStreamReader.java:181)
> > > > at
> > > >
> > >
> >
> java.io.InputStreamReader.read(InputStreamReader.java:244)
> > > > at
> > > >
> > >
> > java.io.BufferedReader.fill(BufferedReader.java:134)
> > > > at
> > > >
> > >
> >
> java.io.BufferedReader.readLine(BufferedReader.java:294)
> > > > at
> > > >
> > >
> >
> java.io.BufferedReader.readLine(BufferedReader.java:357)
> > > > at
> > > >
> > >
> >
> org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139)
> > > > at
> > > >
> > >
> >
> org.htmlparser.HTMLReader.readElement(HTMLReader.java:176)
> > > > at
> > > >
> > >
> >
> org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60)
> > > > at
> > > >
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav
a:91)
> > > >
> > > > Now, if I copy the source of these pages from a
> > > > browser into a file and put them on my own
> > > > webserver,
> > > > I can parse them without any errors.
> > > >
> > > > It's my guess that there is some strange control
> > > > character in the source that is causing the
> > > > exception,
> > > > but I'm not entirely sure. Any suggestions? If
> > > it
> > > > is
> > > > a bad character, would it be possible to add
> > code
> > > to
> > > > HTMLReader that strips offending characters from
> > > the
> > > > input stream?
> > > >
> > > > Here is the code I am using to parse:
> > > >
> > > > DefaultHTMLParserFeedback feedback
> > > > = new
> > > >
> > >
> >
> DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG);
> > > >
> > > > HTMLReader reader = null;
> > > > HTMLParser parser = null;
> > > > InputStreamReader isr
> > > > = new
> > > > InputStreamReader(urlConn.getInputStream());
> > > > reader = new HTMLReader(isr, 8192);
> > > > parser = new HTMLParser(reader,
> > feedback);
> > > > boolean inForm = false;
> > > >
> > > > parser.addScanner(new
> > > > HTMLInputTagScanner());
> > > >
> > > > HTMLEnumeration tags =
> > parser.elements();
> > > >
> > > > RequestParameters params = new
> > > > RequestParameters();
> > > >
> > > > while (tags.hasMoreNodes())
> > > > {
> > > > ...
> > > > }
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Bob Lewis
> > > >
> >
> === message truncated ===
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, more
> http://taxes.yahoo.com/
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Somik R. <so...@ya...> - 2003-02-26 06:44:02
|
Hi Bob,
Can you try this - get the data from the url in question into a file
(using a post request). Then try to parse the file. If it breaks, we would
know why.
Regards,
Somik
----- Original Message -----
From: "Bob Lewis" <bob...@ya...>
To: <htm...@li...>
Sent: Tuesday, February 25, 2003 12:07 PM
Subject: Re: [Htmlparser-user] Malformed Input Exception
>
> I tried using the parser directly, as you suggested,
> and it seems to work. However, I need to be able work
> with the URLConnection to set headers, cookies and
> send POST data.
>
> Typically, this is what I'm doing:
>
> //create and initialize the URL Connection
> HttpURLConnection urlConn = null;
> URL url = new URL("http://somedomain/somepath");
> urlConn = (HttpURLConnection)url.openConnection();
> urlConn.setDoInput(true);
> urlConn.setDoOutput(true);
> urlConn.setUseCaches(false);
> urlConn.setAllowUserInteraction(false);
> urlConn.setRequestMethod("POST");
>
> // ... usually many HTTP Headers and cookie values
> set
> urlConn.setRequestProperty("someHeader",
> "someValue");
> urlConn.setRequestProperty("anotherHeader",
> "anotherValue");
>
> StringBuffer postData = new StringBuffer();
> // ... generate post data in buffer
>
> //Send the post data
> PrintWriter printWriter = new
> PrintWriter(urlConn.getOutputStream());
> printWriter.println(postData.toString());
> printWriter.close();
>
> //parse the response
> HTMLEnumeration tags = parser.elements();
>
> while (parser.hasMoreNodes())
> {
> // ... Do Something
> }
>
> This works fine on most URLs. I am normally able to
> execute the server-side web application, obtain and
> parse the HTML response. However, in the case of
> these two URLs, I get the MalformedInputException.
>
> Is there something I'm missing?
>
> Thanks,
>
> Bob Lewis
>
> --- Somik Raha <so...@ya...> wrote:
>
> >Date: 2003-02-24 21:33
> >Sender: somik
> >Logged In: YES
> >user_id=187944
> >
> >I ran the parser on these pages and it worked fine.
> Try
> >runParser.bat http://www.flytango.com/en/index.html.
> >
> >It could be that you have intialized your
> urlconnection
> >incorrectly. Have you tried using the parser
> directly, like :
> >
> >HTMLParser parser = new HTMLParser
> >("http://www.flytango.com/en/index.html");
> >for (NodeIterator
> i=parser.elements();i.hasMoreNodes();) {
> > System.out.println(i.nextNode().toHtml());
> >}
>
> --- Somik Raha <so...@ya...> wrote:
> > Hi Bob,
> > Sounds like a bug.
> > Can you file a bug report at
> > http://htmlparser.sourceforge.net?
> >
> > Regards,
> > Somik
> > --- Bob Lewis <bob...@ya...> wrote:
> > > Hi,
> > >
> > > I am trying to use htmlparser 1.3 to parse the
> > HTML
> > > at
> > > http://www.flytango.com/en/taschedule.html and
> > > http://www.flytango.com/en/index.html. When I
> > > attempt
> > > to parse these pages, I get
> > > com.sun.io.MalformedInputException:
> > >
> > > sun.io.MalformedInputException
> > > at
> > >
> >
> sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105)
> > > at
> > >
> >
> java.io.InputStreamReader.convertInto(InputStreamReader.java:132)
> > > at
> > >
> >
> java.io.InputStreamReader.fill(InputStreamReader.java:181)
> > > at
> > >
> >
> java.io.InputStreamReader.read(InputStreamReader.java:244)
> > > at
> > >
> > java.io.BufferedReader.fill(BufferedReader.java:134)
> > > at
> > >
> >
> java.io.BufferedReader.readLine(BufferedReader.java:294)
> > > at
> > >
> >
> java.io.BufferedReader.readLine(BufferedReader.java:357)
> > > at
> > >
> >
> org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139)
> > > at
> > >
> >
> org.htmlparser.HTMLReader.readElement(HTMLReader.java:176)
> > > at
> > >
> >
> org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60)
> > > at
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.jav
a:91)
> > >
> > > Now, if I copy the source of these pages from a
> > > browser into a file and put them on my own
> > > webserver,
> > > I can parse them without any errors.
> > >
> > > It's my guess that there is some strange control
> > > character in the source that is causing the
> > > exception,
> > > but I'm not entirely sure. Any suggestions? If
> > it
> > > is
> > > a bad character, would it be possible to add code
> > to
> > > HTMLReader that strips offending characters from
> > the
> > > input stream?
> > >
> > > Here is the code I am using to parse:
> > >
> > > DefaultHTMLParserFeedback feedback
> > > = new
> > >
> >
> DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG);
> > >
> > > HTMLReader reader = null;
> > > HTMLParser parser = null;
> > > InputStreamReader isr
> > > = new
> > > InputStreamReader(urlConn.getInputStream());
> > > reader = new HTMLReader(isr, 8192);
> > > parser = new HTMLParser(reader, feedback);
> > > boolean inForm = false;
> > >
> > > parser.addScanner(new
> > > HTMLInputTagScanner());
> > >
> > > HTMLEnumeration tags = parser.elements();
> > >
> > > RequestParameters params = new
> > > RequestParameters();
> > >
> > > while (tags.hasMoreNodes())
> > > {
> > > ...
> > > }
> > >
> > >
> > > Thanks,
> > >
> > > Bob Lewis
> > >
> > >
> > > __________________________________________________
> > > Do you Yahoo!?
> > > Yahoo! Tax Center - forms, calculators, tips, more
> > > http://taxes.yahoo.com/
> > >
> > >
> > >
> >
> -------------------------------------------------------
> > > This sf.net email is sponsored by:ThinkGeek
> > > Welcome to geek heaven.
> > > http://thinkgeek.com/sf
> > > _______________________________________________
> > > Htmlparser-user mailing list
> > > Htm...@li...
> > >
> >
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Tax Center - forms, calculators, tips, more
> > http://taxes.yahoo.com/
> >
> >
> >
> -------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> >
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, more
> http://taxes.yahoo.com/
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Bob L. <bob...@ya...> - 2003-02-25 20:20:39
|
Sorry, there was a typo in my last message:
> while (parser.hasMoreNodes())
> {
> // ... Do Something
> }
should be
while (tags.hasMoreNodes())
{
// ... Do Something
}
Also, on another note, if I try to initialize the
parser directly, I am unable to work with the
URLConnection. For example:
HttpURLConnection urlConn = null;
HTMLParser parser = new
HTMLParser("http://somedomain/somepath");
urlConn =
(HttpURLConnection)parser.getConnection();
urlConn.setDoInput(true);
// ...
This code throws an exception because the HTTP request
has already been made.
Exception in thread "main"
java.lang.IllegalAccessError: Already connected
at
java.net.URLConnection.setDoInput(URLConnection.java:677)
--- Bob Lewis <bob...@ya...> wrote:
>
> I tried using the parser directly, as you suggested,
> and it seems to work. However, I need to be able
> work
> with the URLConnection to set headers, cookies and
> send POST data.
>
> Typically, this is what I'm doing:
>
> //create and initialize the URL Connection
> HttpURLConnection urlConn = null;
> URL url = new URL("http://somedomain/somepath");
> urlConn =
> (HttpURLConnection)url.openConnection();
> urlConn.setDoInput(true);
> urlConn.setDoOutput(true);
> urlConn.setUseCaches(false);
> urlConn.setAllowUserInteraction(false);
> urlConn.setRequestMethod("POST");
>
> // ... usually many HTTP Headers and cookie
> values
> set
> urlConn.setRequestProperty("someHeader",
> "someValue");
> urlConn.setRequestProperty("anotherHeader",
> "anotherValue");
>
> StringBuffer postData = new StringBuffer();
> // ... generate post data in buffer
>
> //Send the post data
> PrintWriter printWriter = new
> PrintWriter(urlConn.getOutputStream());
> printWriter.println(postData.toString());
> printWriter.close();
>
> //parse the response
> HTMLEnumeration tags = parser.elements();
>
> while (parser.hasMoreNodes())
> {
> // ... Do Something
> }
>
> This works fine on most URLs. I am normally able to
> execute the server-side web application, obtain and
> parse the HTML response. However, in the case of
> these two URLs, I get the MalformedInputException.
>
> Is there something I'm missing?
>
> Thanks,
>
> Bob Lewis
>
> --- Somik Raha <so...@ya...> wrote:
>
> >Date: 2003-02-24 21:33
> >Sender: somik
> >Logged In: YES
> >user_id=187944
> >
> >I ran the parser on these pages and it worked fine.
> Try
> >runParser.bat
> http://www.flytango.com/en/index.html.
> >
> >It could be that you have intialized your
> urlconnection
> >incorrectly. Have you tried using the parser
> directly, like :
> >
> >HTMLParser parser = new HTMLParser
> >("http://www.flytango.com/en/index.html");
> >for (NodeIterator
> i=parser.elements();i.hasMoreNodes();) {
> > System.out.println(i.nextNode().toHtml());
> >}
>
> --- Somik Raha <so...@ya...> wrote:
> > Hi Bob,
> > Sounds like a bug.
> > Can you file a bug report at
> > http://htmlparser.sourceforge.net?
> >
> > Regards,
> > Somik
> > --- Bob Lewis <bob...@ya...> wrote:
> > > Hi,
> > >
> > > I am trying to use htmlparser 1.3 to parse the
> > HTML
> > > at
> > > http://www.flytango.com/en/taschedule.html and
> > > http://www.flytango.com/en/index.html. When I
> > > attempt
> > > to parse these pages, I get
> > > com.sun.io.MalformedInputException:
> > >
> > > sun.io.MalformedInputException
> > > at
> > >
> >
>
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105)
> > > at
> > >
> >
>
java.io.InputStreamReader.convertInto(InputStreamReader.java:132)
> > > at
> > >
> >
>
java.io.InputStreamReader.fill(InputStreamReader.java:181)
> > > at
> > >
> >
>
java.io.InputStreamReader.read(InputStreamReader.java:244)
> > > at
> > >
> >
> java.io.BufferedReader.fill(BufferedReader.java:134)
> > > at
> > >
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:294)
> > > at
> > >
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:357)
> > > at
> > >
> >
>
org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139)
> > > at
> > >
> >
>
org.htmlparser.HTMLReader.readElement(HTMLReader.java:176)
> > > at
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60)
> > > at
> > >
> >
>
org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91)
> > >
> > > Now, if I copy the source of these pages from a
> > > browser into a file and put them on my own
> > > webserver,
> > > I can parse them without any errors.
> > >
> > > It's my guess that there is some strange control
> > > character in the source that is causing the
> > > exception,
> > > but I'm not entirely sure. Any suggestions? If
> > it
> > > is
> > > a bad character, would it be possible to add
> code
> > to
> > > HTMLReader that strips offending characters from
> > the
> > > input stream?
> > >
> > > Here is the code I am using to parse:
> > >
> > > DefaultHTMLParserFeedback feedback
> > > = new
> > >
> >
>
DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG);
> > >
> > > HTMLReader reader = null;
> > > HTMLParser parser = null;
> > > InputStreamReader isr
> > > = new
> > > InputStreamReader(urlConn.getInputStream());
> > > reader = new HTMLReader(isr, 8192);
> > > parser = new HTMLParser(reader,
> feedback);
> > > boolean inForm = false;
> > >
> > > parser.addScanner(new
> > > HTMLInputTagScanner());
> > >
> > > HTMLEnumeration tags =
> parser.elements();
> > >
> > > RequestParameters params = new
> > > RequestParameters();
> > >
> > > while (tags.hasMoreNodes())
> > > {
> > > ...
> > > }
> > >
> > >
> > > Thanks,
> > >
> > > Bob Lewis
> > >
>
=== message truncated ===
__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/
|
|
From: Bob L. <bob...@ya...> - 2003-02-25 20:07:38
|
I tried using the parser directly, as you suggested,
and it seems to work. However, I need to be able work
with the URLConnection to set headers, cookies and
send POST data.
Typically, this is what I'm doing:
//create and initialize the URL Connection
HttpURLConnection urlConn = null;
URL url = new URL("http://somedomain/somepath");
urlConn = (HttpURLConnection)url.openConnection();
urlConn.setDoInput(true);
urlConn.setDoOutput(true);
urlConn.setUseCaches(false);
urlConn.setAllowUserInteraction(false);
urlConn.setRequestMethod("POST");
// ... usually many HTTP Headers and cookie values
set
urlConn.setRequestProperty("someHeader",
"someValue");
urlConn.setRequestProperty("anotherHeader",
"anotherValue");
StringBuffer postData = new StringBuffer();
// ... generate post data in buffer
//Send the post data
PrintWriter printWriter = new
PrintWriter(urlConn.getOutputStream());
printWriter.println(postData.toString());
printWriter.close();
//parse the response
HTMLEnumeration tags = parser.elements();
while (parser.hasMoreNodes())
{
// ... Do Something
}
This works fine on most URLs. I am normally able to
execute the server-side web application, obtain and
parse the HTML response. However, in the case of
these two URLs, I get the MalformedInputException.
Is there something I'm missing?
Thanks,
Bob Lewis
--- Somik Raha <so...@ya...> wrote:
>Date: 2003-02-24 21:33
>Sender: somik
>Logged In: YES
>user_id=187944
>
>I ran the parser on these pages and it worked fine.
Try
>runParser.bat http://www.flytango.com/en/index.html.
>
>It could be that you have intialized your
urlconnection
>incorrectly. Have you tried using the parser
directly, like :
>
>HTMLParser parser = new HTMLParser
>("http://www.flytango.com/en/index.html");
>for (NodeIterator
i=parser.elements();i.hasMoreNodes();) {
> System.out.println(i.nextNode().toHtml());
>}
--- Somik Raha <so...@ya...> wrote:
> Hi Bob,
> Sounds like a bug.
> Can you file a bug report at
> http://htmlparser.sourceforge.net?
>
> Regards,
> Somik
> --- Bob Lewis <bob...@ya...> wrote:
> > Hi,
> >
> > I am trying to use htmlparser 1.3 to parse the
> HTML
> > at
> > http://www.flytango.com/en/taschedule.html and
> > http://www.flytango.com/en/index.html. When I
> > attempt
> > to parse these pages, I get
> > com.sun.io.MalformedInputException:
> >
> > sun.io.MalformedInputException
> > at
> >
>
sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105)
> > at
> >
>
java.io.InputStreamReader.convertInto(InputStreamReader.java:132)
> > at
> >
>
java.io.InputStreamReader.fill(InputStreamReader.java:181)
> > at
> >
>
java.io.InputStreamReader.read(InputStreamReader.java:244)
> > at
> >
> java.io.BufferedReader.fill(BufferedReader.java:134)
> > at
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:294)
> > at
> >
>
java.io.BufferedReader.readLine(BufferedReader.java:357)
> > at
> >
>
org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139)
> > at
> >
>
org.htmlparser.HTMLReader.readElement(HTMLReader.java:176)
> > at
> >
>
org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60)
> > at
> >
>
org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91)
> >
> > Now, if I copy the source of these pages from a
> > browser into a file and put them on my own
> > webserver,
> > I can parse them without any errors.
> >
> > It's my guess that there is some strange control
> > character in the source that is causing the
> > exception,
> > but I'm not entirely sure. Any suggestions? If
> it
> > is
> > a bad character, would it be possible to add code
> to
> > HTMLReader that strips offending characters from
> the
> > input stream?
> >
> > Here is the code I am using to parse:
> >
> > DefaultHTMLParserFeedback feedback
> > = new
> >
>
DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG);
> >
> > HTMLReader reader = null;
> > HTMLParser parser = null;
> > InputStreamReader isr
> > = new
> > InputStreamReader(urlConn.getInputStream());
> > reader = new HTMLReader(isr, 8192);
> > parser = new HTMLParser(reader, feedback);
> > boolean inForm = false;
> >
> > parser.addScanner(new
> > HTMLInputTagScanner());
> >
> > HTMLEnumeration tags = parser.elements();
> >
> > RequestParameters params = new
> > RequestParameters();
> >
> > while (tags.hasMoreNodes())
> > {
> > ...
> > }
> >
> >
> > Thanks,
> >
> > Bob Lewis
> >
> >
> > __________________________________________________
> > Do you Yahoo!?
> > Yahoo! Tax Center - forms, calculators, tips, more
> > http://taxes.yahoo.com/
> >
> >
> >
>
-------------------------------------------------------
> > This sf.net email is sponsored by:ThinkGeek
> > Welcome to geek heaven.
> > http://thinkgeek.com/sf
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> >
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, more
> http://taxes.yahoo.com/
>
>
>
-------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
>
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
__________________________________________________
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/
|
|
From: Somik R. <so...@ya...> - 2003-02-24 18:29:52
|
Hi Bob, Sounds like a bug. Can you file a bug report at http://htmlparser.sourceforge.net? Regards, Somik --- Bob Lewis <bob...@ya...> wrote: > Hi, > > I am trying to use htmlparser 1.3 to parse the HTML > at > http://www.flytango.com/en/taschedule.html and > http://www.flytango.com/en/index.html. When I > attempt > to parse these pages, I get > com.sun.io.MalformedInputException: > > sun.io.MalformedInputException > at > sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) > at > java.io.InputStreamReader.convertInto(InputStreamReader.java:132) > at > java.io.InputStreamReader.fill(InputStreamReader.java:181) > at > java.io.InputStreamReader.read(InputStreamReader.java:244) > at > java.io.BufferedReader.fill(BufferedReader.java:134) > at > java.io.BufferedReader.readLine(BufferedReader.java:294) > at > java.io.BufferedReader.readLine(BufferedReader.java:357) > at > org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) > at > org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) > at > org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) > at > org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) > > Now, if I copy the source of these pages from a > browser into a file and put them on my own > webserver, > I can parse them without any errors. > > It's my guess that there is some strange control > character in the source that is causing the > exception, > but I'm not entirely sure. Any suggestions? If it > is > a bad character, would it be possible to add code to > HTMLReader that strips offending characters from the > input stream? > > Here is the code I am using to parse: > > DefaultHTMLParserFeedback feedback > = new > DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); > > HTMLReader reader = null; > HTMLParser parser = null; > InputStreamReader isr > = new > InputStreamReader(urlConn.getInputStream()); > reader = new HTMLReader(isr, 8192); > parser = new HTMLParser(reader, feedback); > boolean inForm = false; > > parser.addScanner(new > HTMLInputTagScanner()); > > HTMLEnumeration tags = parser.elements(); > > RequestParameters params = new > RequestParameters(); > > while (tags.hasMoreNodes()) > { > ... > } > > > Thanks, > > Bob Lewis > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Tax Center - forms, calculators, tips, more > http://taxes.yahoo.com/ > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
|
From: Somik R. <so...@ya...> - 2003-02-24 18:12:00
|
I was trying to integrate the changes of the latest parser with some existing projects at work - and of course, I had to modify the code to use the new API. I had some suggestions - as I know many of you will be facing the same issue. I use Eclipse, and I hope most of you use a decent IDE that supports refactoring. Get the parser into your IDE, and let all your other project code refer to it (thats how it is setup in my IDE). Then, rename Parser to HTMLParser using your refactoring tool. Rename it back to Parser, and all your existing code will automatically get fixed. Do this for some other classes like HTMLNode/Node, etc.. and within minutes it should be done. Regards, Somik --- Somik Raha <so...@ya...> wrote: > Hi Folks, > This week's release is out. I've finally taken > heed of all the feedback > I had been receiving about the terrible naming > convention, and have removed > "HTML" from all class names. In addition, > HTMLEnumeration is now > NodeIterator and SimpleEnumeration is > SimpleNodeIterator. HTMLParser is just > Parser. > > This is a big step, so to make it easy for > everyone, there have been no > major bug fixes that will require you to upgrade > right away. I apologize in > advance for inconvenience caused - I hope you don't > curse me too much for > having to modify your programs. I had the option of > doing it in stages, and > forcing you to modify some small thing in every > release, or get it over with > in one sweep. I chose the latter bcos there were too > many changes and > suffering over a long period of time didn't make > sense. Hopefully, once you > have migrated to the new names, you will appreciate > not having to type > "HTML" each time. > > The BodyScanner contributed by Dhaval Udani is > finally in (Dhaval - > sorry for the delay). > The interesting part is that the documentation > accompanying the package > is now the latest one on the site - it has been > ripped off a Php Wiki. I am > thinking that the ripping program might be useful > for those who wish to > provide wiki content as offline documentation (any > feedback on this is > welcome). > > From the change log : > Integration build 1.3 - 20030223 > -------------------------------- > [1] Modification of documentation packaging > - the new documentation is actually produced > by a tiny program that coverts wiki pages > into documentation (works with PhpWiki) > [2] Inclusion of BodyScanner, BodyTag > [3] HTMLVisitor is now NodeVisitor - and has an > extra param to > visit itself > [4] HTMLParser is now Parser. No class has HTML > prefix anymore. > [5] HTMLEnumeration is now NodeIterator, > SimpleEnumeration is > SimpleNodeIterator > > Regards, > Somik > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. > Develop an edge. > The most comprehensive and flexible code editor you > can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. > FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
|
From: Bob L. <bob...@ya...> - 2003-02-24 14:49:22
|
Hi, I am trying to use htmlparser 1.3 to parse the HTML at http://www.flytango.com/en/taschedule.html and http://www.flytango.com/en/index.html. When I attempt to parse these pages, I get com.sun.io.MalformedInputException: sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:105) at java.io.InputStreamReader.convertInto(InputStreamReader.java:132) at java.io.InputStreamReader.fill(InputStreamReader.java:181) at java.io.InputStreamReader.read(InputStreamReader.java:244) at java.io.BufferedReader.fill(BufferedReader.java:134) at java.io.BufferedReader.readLine(BufferedReader.java:294) at java.io.BufferedReader.readLine(BufferedReader.java:357) at org.htmlparser.HTMLReader.getNextLine(HTMLReader.java:139) at org.htmlparser.HTMLReader.readElement(HTMLReader.java:176) at org.htmlparser.util.HTMLEnumerationImpl.peek(HTMLEnumerationImpl.java:60) at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes(HTMLEnumerationImpl.java:91) Now, if I copy the source of these pages from a browser into a file and put them on my own webserver, I can parse them without any errors. It's my guess that there is some strange control character in the source that is causing the exception, but I'm not entirely sure. Any suggestions? If it is a bad character, would it be possible to add code to HTMLReader that strips offending characters from the input stream? Here is the code I am using to parse: DefaultHTMLParserFeedback feedback = new DefaultHTMLParserFeedback(DefaultHTMLParserFeedback.DEBUG); HTMLReader reader = null; HTMLParser parser = null; InputStreamReader isr = new InputStreamReader(urlConn.getInputStream()); reader = new HTMLReader(isr, 8192); parser = new HTMLParser(reader, feedback); boolean inForm = false; parser.addScanner(new HTMLInputTagScanner()); HTMLEnumeration tags = parser.elements(); RequestParameters params = new RequestParameters(); while (tags.hasMoreNodes()) { ... } Thanks, Bob Lewis __________________________________________________ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ |
|
From: Somik R. <so...@ya...> - 2003-02-24 06:15:44
|
Hi Folks,
This week's release is out. I've finally taken heed of all the feedback
I had been receiving about the terrible naming convention, and have removed
"HTML" from all class names. In addition, HTMLEnumeration is now
NodeIterator and SimpleEnumeration is SimpleNodeIterator. HTMLParser is just
Parser.
This is a big step, so to make it easy for everyone, there have been no
major bug fixes that will require you to upgrade right away. I apologize in
advance for inconvenience caused - I hope you don't curse me too much for
having to modify your programs. I had the option of doing it in stages, and
forcing you to modify some small thing in every release, or get it over with
in one sweep. I chose the latter bcos there were too many changes and
suffering over a long period of time didn't make sense. Hopefully, once you
have migrated to the new names, you will appreciate not having to type
"HTML" each time.
The BodyScanner contributed by Dhaval Udani is finally in (Dhaval -
sorry for the delay).
The interesting part is that the documentation accompanying the package
is now the latest one on the site - it has been ripped off a Php Wiki. I am
thinking that the ripping program might be useful for those who wish to
provide wiki content as offline documentation (any feedback on this is
welcome).
From the change log :
Integration build 1.3 - 20030223
--------------------------------
[1] Modification of documentation packaging
- the new documentation is actually produced
by a tiny program that coverts wiki pages
into documentation (works with PhpWiki)
[2] Inclusion of BodyScanner, BodyTag
[3] HTMLVisitor is now NodeVisitor - and has an extra param to
visit itself
[4] HTMLParser is now Parser. No class has HTML prefix anymore.
[5] HTMLEnumeration is now NodeIterator, SimpleEnumeration is
SimpleNodeIterator
Regards,
Somik
|
|
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-23 14:47:17
|
hi, sorry to bother you. I know that the input tag is in the HTMLFormTag. However when I try to parse this page with HTMLFormScanner http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ it returns an error and the process has been terminate. Below is my testing code.(Just to see if HTMLFormTag exist in the page) public String extractStrings() throws HTMLParserException { HTMLParser parser = new HTMLParser(resource); parser.addScanner(new HTMLFormScanner("")); HTMLNode node; String check; StringBuffer results= new StringBuffer(); for (HTMLEnumeration e = parser.elements();e.hasMoreNodes();) { node = e.nextHTMLNode(); if (node instanceof HTMLFormTag){//check the existence of HTMLFormTag System.out.print(node.toString());} check=node.toPlainTextString(); results.append(check); } return results.toString(); } however this error printed in the console. Its can compile but generate a runtime error. below is the error: ERROR: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scannersat Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> ERROR: HTMLReader.readElement() : Error occurred while trying to read the next element,at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> ERROR: Unexpected Exception occurred while reading http://developer.java.sun.com /developer/Quizzes/jbasics1-1/, in nextHTMLNode at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109"> org.htmlparser.util.HTMLParserException: Unexpected Exception occurred while reading http://developer.java.sun.com/developer/Quizzes/jbasics1-1/, in nextHTMLNode at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error occurred while trying to read the next element, at Line 72 : <form method="get" action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLReader.readElement() : Error occurred while trying to decipher the tag using scanners at Line 72 : <form method="get"action="http://servlet.java.sun.com/logRedirect/ frontpage-head/http://search.java.sun.com/search/java/"> Previous Line 71 : <td><table border="0" cellspacing="0" cellpadding="0" width="100%" height="109">; org.htmlparser.util.HTMLParserException: HTMLTag.scan() : Error while scanning tag, tag contents = form method="get" action="http://servlet.java.sun.com/logRedi rect/frontpage-head/http://search.java.sun.com/search/java/", tagLine = <form method="get" action="http://servlet.java.sun.com/logRedirect/frontpage- head/http://search.java.sun.com/search/java/">; org.htmlparser.util.HTMLParserException: HTMLFormScanner.scan() : Error while scanning the form tag, current line = <form method="get" action="http://servlet.ja va.sun.com/logRedirect/frontpage- head/http://search.java.sun.com/search/java/">; java.lang.NullPointerException at org.htmlparser.HTMLParser.addScanner(HTMLParser.java:863) at org.htmlparser.scanners.HTMLFormScanner.scan (HTMLFormScanner.java:164) at org.htmlparser.scanners.HTMLTagScanner.createScannedNode (HTMLTagScanner.java:193) at org.htmlparser.tags.HTMLTag.scan(HTMLTag.java:266) at org.htmlparser.HTMLReader.readElement(HTMLReader.java:193) at org.htmlparser.util.HTMLEnumerationImpl.peek (HTMLEnumerationImpl.java:60) at org.htmlparser.util.HTMLEnumerationImpl.hasMoreNodes (HTMLEnumerationImpl.java:91) at StringExtractor.extractStrings(StringExtractor.java:27) at StringExtractor.main(StringExtractor.java:49) there is two form in the page, one is for the searching part of the site and the other one is what i'm interested in that is form with questions. Please help me on this. Is this a bug? thank you. |
|
From: Somik R. <so...@ya...> - 2003-02-23 05:22:19
|
You could go thru the docs at http://htmlparser.sourceforge.net/docs/index.php/LinkExtraction Forms and Frames are represented by HTMLFormTag, and HTMLFrameTag. You could write your own visitor that could collect form tags, string nodes, and on encountering a frame tag, could open a new parser object for the frame url and visit it with the same visitor (different object probably). Try out the programs on this page, and it should be easy. Feel free to post here if you face any problems. Regards, Somik ----- Original Message ----- From: "Mohd-Taqiyuddin Zalfan" <mt...@ec...> To: <htm...@li...> Sent: Saturday, February 22, 2003 10:44 AM Subject: [Htmlparser-user] Harvester > hi, > > I would like to write a program that can harvest certain information (mostly > text) on the web page. Some of the web page requires feedback from the user > (existence of <form> tag) to get more information on the page. Some of the > page is just a plain text and some of the page is in frames. How can I wrote > a single harvester that can harvest these three types of pages with one > harvester code. > > below is the sample pages that I want to harvest. (harvest question and get > the correct answers.) > > i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ > ii)plain text: http://www.jchq.net/mockexams/exam3.htm > iii) with frames: http://www.angelfire.com/or/abhilash/Main.html > > hope you can give me some advice on how to do this. thank you. > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. Develop an edge. > The most comprehensive and flexible code editor you can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
|
From: Mohd-Taqiyuddin Z. <mt...@ec...> - 2003-02-22 18:45:38
|
hi, I would like to write a program that can harvest certain information (mostly text) on the web page. Some of the web page requires feedback from the user (existence of <form> tag) to get more information on the page. Some of the page is just a plain text and some of the page is in frames. How can I wrote a single harvester that can harvest these three types of pages with one harvester code. below is the sample pages that I want to harvest. (harvest question and get the correct answers.) i)with the form: http://developer.java.sun.com/developer/Quizzes/jbasics1-1/ ii)plain text: http://www.jchq.net/mockexams/exam3.htm iii) with frames: http://www.angelfire.com/or/abhilash/Main.html hope you can give me some advice on how to do this. thank you. |
|
From: Somik R. <so...@ya...> - 2003-02-19 19:37:15
|
The last line of all mails on this list (including the one you sent) has the link to go to the mailing list admin interface, from which you can unsubscribe yourself. Regards, Somik --- ChennaDulla <che...@go...> wrote: > > > Thanks, > Chenna Dulla, > GoneHome Inc. > 1278 SouthMain St. > Canton, Ohio - 44720 > tel: 330-649-9258 (W) > 440-605-1628 (R) > > > > > ------------------------------------------------------- > This SF.net email is sponsored by: SlickEdit Inc. > Develop an edge. > The most comprehensive and flexible code editor you > can use. > Code faster. C/C++, C#, Java, HTML, XML, many more. > FREE 30-Day Trial. > www.slickedit.com/sourceforge > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user __________________________________________________ Do you Yahoo!? Yahoo! Shopping - Send Flowers for Valentine's Day http://shopping.yahoo.com |
|
From: ChennaDulla <che...@go...> - 2003-02-19 16:44:40
|
Thanks,
Chenna Dulla,
GoneHome Inc.
1278 SouthMain St.
Canton, Ohio - 44720
tel: 330-649-9258 (W)
440-605-1628 (R)
|
|
From: Somik R. <so...@ya...> - 2003-02-19 16:41:50
|
setText() should not be used. We'll probably remove it from the API = asap. Pls use setAttribute(). Regards, Somik ----- Original Message -----=20 From: Aminudin Khalid=20 To: htm...@li...=20 Sent: Wednesday, February 19, 2003 1:41 AM Subject: [Htmlparser-user] getText() and setText() HTMLTag::getText() does work fine but setText() doesn't work ? Is it = true ? If possible I wanna use setText(). |
|
From: Somik R. <so...@ya...> - 2003-02-19 16:41:12
|
> May I know what is the key for each attribute in <a> tag ?
Usually href but you could get all the keys like this :
for (Enumeration keys = tag.getAttributes().keys(); keys.hasMoreElements();)
{
String key = (String)keys.nextElement();
String value = tag.getAttribute(key);
//...
}
> I've been trying to modify HTML tags and attributes but it doesnt work
> pretty well
If you show your code and tell us whats not working, we might be able to
help.
Regards,
Somik
|
|
From: Aminudin K. <ami...@mi...> - 2003-02-19 09:43:30
|
HTMLTag::getText() does work fine but setText() doesn't work ? Is it true ? If possible I wanna use setText(). |
|
From: Aminudin K. <ami...@mi...> - 2003-02-19 09:14:59
|
May I know what is the key for each attribute in <a> tag ? I've been trying to modify HTML tags and attributes but it doesnt work pretty well Thanks |
|
From: <wf...@ma...> - 2003-02-18 04:27:30
|
From: "Somik Raha" <so...@ya...> To: <htm...@li...> Subject: Re: [Htmlparser-user] Anyone around using htmlparser together=20 with=20 >Lotus Domino? >Date: Sat, 15 Feb 2003 20:23:09 -0800 >Thats interesting - can you tell us how you are using the parser with=20 Lotus >Domino, and what your doubt is ? Thank you for your reply, Somik. Since Domino R6 things have changed a little, however it will take some=20 time until this release becomes widely accepted. So what I'm investigatin= g=20 is related with R5 that supports Java 1.1.8 natively. There are several=20 things I'm investigating: 1) Referrer Spamming: This is becoming increasingly popular since referrers can be tweaked so=20 easily. The blogging scene often presents a list of recent referrers w/o=20 any validation. This can trick webmasters and visitors into clicking=20 spammed ones. I'm looking for a way to filter for valid references only. Using Domino one can retrieve a HTML page including a list of hyperlinks=20 however a) performance is not impressive and b) this requires a web=20 interface database (perweb.nsf) is set up on the server. I'd prefer to us= e=20 the HTMLParser class instead. This looks like a simple one. 2) HTML translation/validation/repair Domino's proprietary rich text format dates back to the 80s when HTML=20 wasn't a standard. Domino's rich-text capabilites are impressive,=20 including nested interactive sections, features like hotspots,=20 script-enabled buttons, tabbed forms and alike. Due to compatibility=20 reasons Domino was web-enabled mainly not by downsizing this format to=20 HTML's native capabilites but by adding a richtext-to-html task and addin= g=20 a special URL syntax. Although displayed properly by browsers the=20 generated HTML is not clean, e.g. list tags are not closed, stuff like=20 this. I'm investigating if HTMLParser could be used to do some automatic=20 repair - content will be edited in Domino's RTF for convenience and the=20 resulting HTML is parsed, corrected and seperately stored for web=20 delivery. I assume to parse HTML forgivingly the parser needs to perform=20 some stack correction and I hope this can easily be used for HTML repair=20 as well? --=20 Mit freundlichen Gr=FC=DFen / Kind regards Wolfgang Flamme wf...@ma... Am Jungst=FCck 32 55130 Mainz-Laubenheim Tel.: +49 (6131) 8 74 02 Mobil: +49 163 25 43 166 |
|
From: Aminudin K. <ami...@mi...> - 2003-02-18 01:04:07
|
You need the latest integration release .
HTMLVisitor is not in the version 1.2.
ps: make sure ur class path is correct
anumodh narayanan kutty wrote:
>
>
>
>> public class MyCustomizedVisitor extends HTMLVisitor {
>> public MyCustomizedVisitor(HTMLParser parser) {
>> super(true); /// Its usually a good idea to perform recursion
>> // Add the scanners you want.
>> // This decouples your application from having to know which
>> scanners are required
>> parser.addScanner(new HTMLLinkScanner(""));
>> parser.addScanner(new HTMLImageScanner(""));
>> // or add all scanners with registerScanners()
>> }
>>
>> public void visitTag(HTMLTag tag) {
>> // Collect any tags you want
>> // You can also do type checking like so:
>> if (tag instanceof HTMLMetaTag) {
>> // This tag is a meta tag
>> HTMLMetaTag metaTag = (HTMLMetaTag)tag;
>> }
>> }
>>
> *****************************************************************
> Hello Somik ,
>
> Thanks ,for the information,but I couldn't find HTMLVisitor class
> ,where is it located,plz let me know that.
>
> regards
> ANUMODH
>
>
>
> _________________________________________________________________
> Protect your PC - get McAfee.com VirusScan Online
> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|
|
From: anumodh n. k. <anu...@ho...> - 2003-02-18 00:50:02
|
> public class MyCustomizedVisitor extends HTMLVisitor {
> public MyCustomizedVisitor(HTMLParser parser) {
> super(true); /// Its usually a good idea to perform recursion
> // Add the scanners you want.
> // This decouples your application from having to know which
>scanners are required
> parser.addScanner(new HTMLLinkScanner(""));
> parser.addScanner(new HTMLImageScanner(""));
> // or add all scanners with registerScanners()
> }
>
> public void visitTag(HTMLTag tag) {
> // Collect any tags you want
> // You can also do type checking like so:
> if (tag instanceof HTMLMetaTag) {
> // This tag is a meta tag
> HTMLMetaTag metaTag = (HTMLMetaTag)tag;
> }
> }
>
*****************************************************************
Hello Somik ,
Thanks ,for the information,but I couldn't find HTMLVisitor class
,where is it located,plz let me know that.
regards
ANUMODH
_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
|