htmlparser-user Mailing List for HTML Parser

Brought to you by: derrickoswald

htmlparser-user — The user mailing list for users of the htmlparser library

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
				1 (1)	2	3 (1)
4 (1)	5 (2)	6 (1)	7 (2)	8 (2)	9 (3)	10 (3)
11 (1)	12 (4)	13 (1)	14 (1)	15 (3)	16 (1)	17 (2)
18 (1)	19	20 (4)	21 (3)	22 (3)	23 (2)	24
25 (1)	26 (2)	27 (1)	28 (2)	29	30	31 (1)

Flat | Threaded

[Htmlparser-user] ConnectionTimeout after reconnect, caching?

From: Johann H. <h.h...@ic...> - 2010-07-28 12:47:26

Hello community,
I am writing a website parser with htmlparser and I think it's a great 
library.
My problem is, the website I'm parsing shows me a captcha after a 
certain number of crawls.
As a workaround I wrote a redial routine to reconnect my router and get 
a new ip.
That is working quite well, but my problem is, that my jvm seems to 
cache DNS.
I read this post http://forum.vis.ethz.ch/showthread.php?t=13457 and 
applied everything which is supposed there,
but still I can't continue parsing after a reconnect and I get a 
ConnectionTimeoutException from htmlparser.
It seems, that there might still be some kind of cache.
Could anybody tell me, how I can get the new instance of Parser to 
connect after a reconnect.

Thank you.
Hans.

[Htmlparser-user] Reminder: geeraza wants to add you to his friends list on Netlog

From: geeraza <nor...@ne...> - 2010-07-27 16:41:37

Hi,

Since Tuesday 20 July 2010, you have been invited by 1 of your contacts to join Netlog, 
the social community for over 49 million young people.

[---- Invitation from geeraza ---- ]
34 yrs - male - Baden-Wurttemberg (Germany)
Connect with geeraza:
http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0MQ__
	

On Netlog you can:

- Create your own web page
- Extend your social network
- Publish your music playlists- Share pictures and videos- Post blogs
- And much more ... ....

http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0OTg0MDgyMjYyMQ__

----------------------------------------------------------------
Don't want to receive invitations from your friends anymore?
http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0zJmdtPTE2JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NQ__



----------------------------------------------------------------
Netlog NV/SA. E. Braunplein 18. B-9000 Gent. Belgium BE0859635972. abu...@ne...

[Htmlparser-user] Visit my Netlog profile

From: geeraza <nor...@ne...> - 2010-07-20 15:46:13

Hey,

I have created a Netlog profile with my pictures, videos, blogs and events and I want to add you as a friend so you can see it. You first need to register on Netlog! When you log in, you can create your own profile.

Take a look:
http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0xJmdtPTM3JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTE_

Cheers,
geeraza

----------------------------------------------------------------
Don't want to receive invitations from your friends anymore?
http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0yJmdtPTM3JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTI_

Re: [Htmlparser-user] HTML parser parsing script incorrectly

From: Derrick O. <der...@gm...> - 2010-07-08 04:38:36

Did you set STRICT false:

http://htmlparser.sourceforge.net/javadoc/org/htmlparser/scanners/ScriptScanner.html



On Wed, Jul 7, 2010 at 9:48 PM, Niket Arora <nik...@ex...>wrote:

>  I m parsing a page
> http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using
> htmlparser api and I m getting content inside a script tag in some other tag
> and reason for this is html tags are present in a string inside javascript
> tags and are not escaped …. so htmlparser api is closing on those tags.
>
>
>
>
>
>
> ================================================================================================================================================================================================
>
>
>
> <div id="myHealthlineHeader">
>
>         <script>
>
>               if(isLoggedIn()) {
>
>                 document.write("<a href=\"/action/LogOutServlet\">Sign
> Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My
> Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>");
>
>                 document.getElementById("myHealthlineHeader").className =
> "hl_state_top_signed_in";
>
>               } else {
>
>
>
>                 document.write("<div
> style=\"float:right;text-align:right;padding:0 5px 0
> 0;\">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\"
> rel=\"nofollow\"
> href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>");
>
>                 document.write("<div style=\"float:right\"><a
> class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign
> in</a>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\"
> rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a>&nbsp;</div>")
>
>                 document.getElementById("myHealthlineHeader").className =
> "hl_state_top";
>
>               }
>
>         </script>
>
> </div>
>
>
>
>
> ================================================================================================================================================================================================
>
>
>
> Is there anyway to fix this issue?
>
>
>
> Regards
>
> Niket
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by Sprint
> What will you do first with EVO, the first 4G phone?
> Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>

[Htmlparser-user] HTML parser parsing script incorrectly

From: Niket A. <nik...@ex...> - 2010-07-07 20:07:05

I m parsing a page http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using htmlparser api and I m getting content inside a script tag in some other tag and reason for this is html tags are present in a string inside javascript tags and are not escaped .... so htmlparser api is closing on those tags.


================================================================================================================================================================================================

<div id="myHealthlineHeader">
        <script>
              if(isLoggedIn()) {
                document.write("<a href=\"/action/LogOutServlet\">Sign Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>");
                document.getElementById("myHealthlineHeader").className = "hl_state_top_signed_in";
              } else {

                document.write("<div style=\"float:right;text-align:right;padding:0 5px 0 0;\">&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\" rel=\"nofollow\" href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>");
                document.write("<div style=\"float:right\"><a class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign in</a>&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;<a class=\"underlineless\" rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a>&nbsp;</div>")
                document.getElementById("myHealthlineHeader").className = "hl_state_top";
              }
        </script>
</div>

================================================================================================================================================================================================

Is there anyway to fix this issue?

Regards
Niket

[Htmlparser-user] Extract HTML Body and output as (X)HTML standards

From: Oliver S. <oli...@gm...> - 2010-07-05 16:31:02

Hi,

I need to read arbitrary HTML (HTML 4 transitional, XHTML 1.0 strict, ...) extract the body as a fragment and output it again as another (XHTML standard).

Reading the file is simple enough:

		Parser p = new Parser(resource);
		NodeFilter f = new NodeClassFilter(BodyTag.class);
		NodeList listOfBodies = p.extractAllNodesThatMatch(f);
		Node firstBody = listOfBodies.elementAt(0);
		NodeList bodyChildren = firstBody.getChildren();
		System.out.println(bodyChildren.toHtml());

From this hpw can I output either valid HTML 4.0 code or valid XHTML 1.0 code?

Best regards
Oliver

43 messages has been excluded from this view by a project administrator.

Flat | Threaded

S	M	T	W	T	F	S
				1 (1)	2	3 (1)
4 (1)	5 (2)	6 (1)	7 (2)	8 (2)	9 (3)	10 (3)
11 (1)	12 (4)	13 (1)	14 (1)	15 (3)	16 (1)	17 (2)
18 (1)	19	20 (4)	21 (3)	22 (3)	23 (2)	24
25 (1)	26 (2)	27 (1)	28 (2)	29	30	31 (1)