htmlparser-developer Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(4) |
Nov
(1) |
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(12) |
Feb
|
Mar
(7) |
Apr
(27) |
May
(14) |
Jun
(16) |
Jul
(27) |
Aug
(74) |
Sep
(1) |
Oct
(23) |
Nov
(12) |
Dec
(119) |
2003 |
Jan
(31) |
Feb
(23) |
Mar
(28) |
Apr
(59) |
May
(119) |
Jun
(10) |
Jul
(3) |
Aug
(17) |
Sep
(8) |
Oct
(38) |
Nov
(6) |
Dec
(1) |
2004 |
Jan
(4) |
Feb
(4) |
Mar
(1) |
Apr
(2) |
May
|
Jun
(7) |
Jul
(6) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2005 |
Jan
|
Feb
(1) |
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(10) |
Oct
(4) |
Nov
(15) |
Dec
|
2006 |
Jan
|
Feb
(1) |
Mar
|
Apr
(4) |
May
(11) |
Jun
|
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2007 |
Jan
(3) |
Feb
(2) |
Mar
|
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2008 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(5) |
Oct
(1) |
Nov
|
Dec
|
2009 |
Jan
|
Feb
(1) |
Mar
|
Apr
(2) |
May
|
Jun
(4) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
(2) |
2010 |
Jan
(1) |
Feb
|
Mar
|
Apr
(8) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(6) |
Oct
|
Nov
(1) |
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(3) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(1) |
2016 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
1
|
2
|
3
(1) |
4
|
5
(1) |
6
|
7
(1) |
8
(3) |
9
(2) |
10
|
11
|
12
|
13
|
14
|
15
|
16
(2) |
17
|
18
(2) |
19
|
20
|
21
|
22
|
23
|
24
|
25
|
26
|
27
|
28
|
29
|
30
|
31
|
|
|
From: Somik R. <so...@ya...> - 2002-01-18 23:55:06
|
> What is the Parse.jar file in htmlparser.jar? Ah, i was wondering why the size was so much. Thanks for pointing it out. > I would like if htmlparser.jar would be named to HTMLParser.jar > according to the name of the application. > > I happened to call it with capital letters in my application > and it's easy for me to make this change but perhaps > if someone else does it he does not notice the difference. Well, class naming conventions are different from jar naming conventions.. I thought keeping all small letters is simple. > I today replaced my modified version 0.98 > with the official version 1.02 and after I solved some > incompatibilities (mainly the BufferedReader thing) > it seemed to go as it should. Great! Any suggestions on where we go from here ? It really bothers me that the parser does not show up on google when I type "html parser java" in the search. How do we go about giving it more visibility? Cheers, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Kaarle K. <kaa...@ik...> - 2002-01-18 20:24:23
|
hi, What is the Parse.jar file in htmlparser.jar? I would like if htmlparser.jar would be named to HTMLParser.jar according to the name of the application. I happened to call it with capital letters in my application and it's easy for me to make this change but perhaps if someone else does it he does not notice the difference. I today replaced my modified version 0.98 with the official version 1.02 and after I solved some incompatibilities (mainly the BufferedReader thing) it seemed to go as it should. Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-01-16 14:09:40
|
Hi Folks, Check http://htmlparser.sourceforge.net for a totally new look. = Design documentation with sample programs has been added. Feedback is welcome. Regards, Somik |
From: Somik R. <so...@ki...> - 2002-01-16 14:08:44
|
Hi Folks, Check http://htmlparser.sourceforge.net for a totally new look. = Design documentation with sample programs has been added. Feedback is welcome. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-09 16:36:33
|
Hi Folks, Another bug was detected in HTMLStyleScanner, and has been = immediately fixed. v1.02 has been released with this fix, and another = one - which allows scanning of Finnish pages to proceed properly. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-09 11:50:17
|
Dear Kaarle, Thank you very much! You are quite right, I forgot I was using = Shift-JIS for Japanese encoding support and SJIS is a Microsoft specific = standard - not unicode, but if I use a unicode encoding, it should be = fine. I will try with UTF8, will need your help to co-ordinate some more = tests. Meanwhile this style thing is proving to be a headache, just got a = report that its crashing on google. Need to add more test cases.. Regards, Somik ----- Original Message -----=20 From: Kaarle Kaila=20 To: Somik Raha=20 Sent: Wednesday, January 09, 2002 2:40 AM Subject: Re: [Htmlparser-developer] htmlparser 1.0 (Issue with mtv3 is = that of internationalization) At 22:37 8.1.2002 +0530, Somik Raha wrote: Hi Kaarle, I found the reason for the last problem - the site : = http://www.mtv3.fi has a link in Finnish. That link is not being interpreted correctly = by the parser. The link is : <a href=3D"/ks/ks_20020701b.shtml">Palveluun p=E4=E4set = t=E4st=E4</a> hi Somik, HTMLParser reads lines from the net. It initiates the contact to that = line with a command=20 reader =3D new HTMLReader(new BufferedReader(new = InputStreamReader(uc.getInputStream(),"SJIS")),resourceLocn); I don't know what SJIS stands for. The Java API does not list that, = but lists among others ISO-8859-1. Check InputStreamReader constructor. By using ISO-8859-1 it does not = hang like it did with SJIS! SJIS seems to make everything 7-bit ascii.=20 reader =3D new HTMLReader(new BufferedReader(new = InputStreamReader(uc.getInputStream(),"ISO-8859-1")),resourceLocn); With this setting at least finnish characters come correctly.=20 I also downloaded two files you hade made changes from CVS=20 and I could read www.mtv3.fi. It even reads my webpage (rather strange = output though). In Japan I would expect the internationalizing to be an issue?? = Wouldn't UNICODE=20 be required there? regards Kaarle Whats happening is that the last < is being corrupted. I havent = faced a problem with internationalization till now - and I am kind of stuck = with this one. Maybe you'd be in a better position to solve it than me. I = will make the release with the other bug fixed, and Id be grateful if u = can proceed from there. Regards, Somik _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844=20 |
From: Somik R. <so...@ya...> - 2002-01-08 17:35:21
|
Hi Folks, An important bug fix has been done. The parser was crashing on style = tags - this has been fixed. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-08 15:46:16
|
Hi Kaarle, To answer your basic question - crawler will crawl through a url (like websnake and similar robot crawlers). It will pick up links and visit those links and so on recursively depending on the depth you define. The bugs you see are not bcos of the crawler code, but bcos of some parser bugs. The scanner bugs came in when I tried to fix the case when the style tags are in one big line with other stuff. Obviously, not enough test cases. Thankfully, you are htmlparser's best tester :) Your site and http://www.yle.fi are working fine now. mtv3 is giving the wierd out of mem excpetion and I am now fixing that. As soon as thats done, maintenance release 1.01 will be out. Cheers, Somik ----- Original Message ----- From: "Kaarle Kaila" <kaa...@ik...> To: <htm...@li...> Sent: Tuesday, January 08, 2002 3:34 AM Subject: [Htmlparser-developer] htmlparser 1.0 > I tried the example applications using the bat-files > with htmlparser 1.0 with not very good success. > > 1) > runCrawler http://www.google.com 1 > This gives a list of links on the abovementioned page I assume > > 2) (finnish broadcastin company) > runCrawler http://www.yle.fi 1 > This throws > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 27 > > 3) (finnish commercial tvstation ) > runCrawler http://www.mtv3.fi 1 > this throws > Exception in thread "main" java.lang.OutOfMemoryError > <<no stack trace available>> > > 4) my own simple homepage > > After a rather long time throws: > Crawling to > http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p > id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 23 > at java.lang.String.substring(Unknown Source) > ........ > I don't think I have such microsoft links on my page. Probably something to > to with the activeisp.com that provides me with diskspace?? > > Similar result from my software page at www.kk-software.fi > -------------------- > As a result of these experiments i did not understand what the Robot tries > to do?? > > Any explanations to this? > regards > Kaarle > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Somik R. <so...@ya...> - 2002-01-08 15:16:32
|
Hi Kaarle, Thanks for pointing this out. Its not a bug with the crawler, but with the parser itself - in HTMLStyleScanner... I am trying to fix it asap. Regards, Somik ----- Original Message ----- From: "Kaarle Kaila" <kaa...@ik...> To: <htm...@li...> Sent: Tuesday, January 08, 2002 3:34 AM Subject: [Htmlparser-developer] htmlparser 1.0 > I tried the example applications using the bat-files > with htmlparser 1.0 with not very good success. > > 1) > runCrawler http://www.google.com 1 > This gives a list of links on the abovementioned page I assume > > 2) (finnish broadcastin company) > runCrawler http://www.yle.fi 1 > This throws > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 27 > > 3) (finnish commercial tvstation ) > runCrawler http://www.mtv3.fi 1 > this throws > Exception in thread "main" java.lang.OutOfMemoryError > <<no stack trace available>> > > 4) my own simple homepage > > After a rather long time throws: > Crawling to > http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p > id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 > Exception in thread "main" java.lang.StringIndexOutOfBoundsException: > String ind > ex out of range: 23 > at java.lang.String.substring(Unknown Source) > ........ > I don't think I have such microsoft links on my page. Probably something to > to with the activeisp.com that provides me with diskspace?? > > Similar result from my software page at www.kk-software.fi > -------------------- > As a result of these experiments i did not understand what the Robot tries > to do?? > > Any explanations to this? > regards > Kaarle > > --------------------------------------------- > Kaarle Kaila > http://www.iki.fi/kaila > mailto:kaa...@ik... > tel: +358 50 3725844 > > > > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer _________________________________________________________ Do You Yahoo!? Get your free @yahoo.com address at http://mail.yahoo.com |
From: Kaarle K. <kaa...@ik...> - 2002-01-07 22:06:18
|
I tried the example applications using the bat-files with htmlparser 1.0 with not very good success. 1) runCrawler http://www.google.com 1 This gives a list of links on the abovementioned page I assume 2) (finnish broadcastin company) runCrawler http://www.yle.fi 1 This throws Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind ex out of range: 27 3) (finnish commercial tvstation ) runCrawler http://www.mtv3.fi 1 this throws Exception in thread "main" java.lang.OutOfMemoryError <<no stack trace available>> 4) my own simple homepage After a rather long time throws: Crawling to http://www.microsoft.com/ContentRedirect.asp?prd=iis&sbp=&pver=5.0&p id=&ID=404&cat=web&os=&over=&hrd=&Opt1=&Opt2=&Opt3= crawlDepth = 0 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String ind ex out of range: 23 at java.lang.String.substring(Unknown Source) ........ I don't think I have such microsoft links on my page. Probably something to to with the activeisp.com that provides me with diskspace?? Similar result from my software page at www.kk-software.fi -------------------- As a result of these experiments i did not understand what the Robot tries to do?? Any explanations to this? regards Kaarle --------------------------------------------- Kaarle Kaila http://www.iki.fi/kaila mailto:kaa...@ik... tel: +358 50 3725844 |
From: Somik R. <so...@ya...> - 2002-01-05 17:11:17
|
Hi Folks, Sorry bout that, the zip file that was uploaded seemed to be = corrupted. Its fixed, and you should be able to download it now. Regards, Somik |
From: Somik R. <so...@ya...> - 2002-01-03 20:04:50
|
Hi Folks, A new year present - HTMLParser 1.0 is released. We've finally made = the transition from alpha to a beta stage. Modifications henceforth = would only be of a maintenance nature and API should remain constant. There are huge changes in the architecture, and lots of bug fixes. = Thanks a lot to Kaarle Kaaila for some great support and ideas. Thanks = also to Rodney Foley, for some nice ideas for improvement. And thanks to = everyone else who's been supporting this project.=20 Looking forward to your continuing support, and wishing you a very = happy new year. Cheers, Somik |