htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
1
(1) |
2
|
3
(1) |
4
(1) |
5
(2) |
6
(1) |
7
(2) |
8
(2) |
9
(3) |
10
(3) |
11
(1) |
12
(4) |
13
(1) |
14
(1) |
15
(3) |
16
(1) |
17
(2) |
18
(1) |
19
|
20
(4) |
21
(3) |
22
(3) |
23
(2) |
24
|
25
(1) |
26
(2) |
27
(1) |
28
(2) |
29
|
30
|
31
(1) |
From: Johann H. <h.h...@ic...> - 2010-07-28 12:47:26
|
Hello community, I am writing a website parser with htmlparser and I think it's a great library. My problem is, the website I'm parsing shows me a captcha after a certain number of crawls. As a workaround I wrote a redial routine to reconnect my router and get a new ip. That is working quite well, but my problem is, that my jvm seems to cache DNS. I read this post http://forum.vis.ethz.ch/showthread.php?t=13457 and applied everything which is supposed there, but still I can't continue parsing after a reconnect and I get a ConnectionTimeoutException from htmlparser. It seems, that there might still be some kind of cache. Could anybody tell me, how I can get the new instance of Parser to connect after a reconnect. Thank you. Hans. |
From: geeraza <nor...@ne...> - 2010-07-27 16:41:37
|
Hi, Since Tuesday 20 July 2010, you have been invited by 1 of your contacts to join Netlog, the social community for over 49 million young people. [---- Invitation from geeraza ---- ] 34 yrs - male - Baden-Wurttemberg (Germany) Connect with geeraza: http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0MQ__ On Netlog you can: - Create your own web page - Extend your social network - Publish your music playlists- Share pictures and videos- Post blogs - And much more ... .... http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0xJmdtPTE2JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNnIlM0R0OTg0MDgyMjYyMQ__ ---------------------------------------------------------------- Don't want to receive invitations from your friends anymore? http://en.netlog.com/go/mailurl/-bT05ODQwODIyNjImbD0zJmdtPTE2JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NQ__ ---------------------------------------------------------------- Netlog NV/SA. E. Braunplein 18. B-9000 Gent. Belgium BE0859635972. abu...@ne... |
From: geeraza <nor...@ne...> - 2010-07-20 15:46:13
|
Hey, I have created a Netlog profile with my pictures, videos, blogs and events and I want to add you as a friend so you can see it. You first need to register on Netlog! When you log in, you can create your own profile. Take a look: http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0xJmdtPTM3JnU9JTJGZ28lMkZyZWdpc3RlciUyRmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTE_ Cheers, geeraza ---------------------------------------------------------------- Don't want to receive invitations from your friends anymore? http://en.netlog.com/go/mailurl/-bT05Njg2NzYwNDMmbD0yJmdtPTM3JnU9JTJGZ28lMkZub21haWxzJTJGaW52aXRlJTJGZW1haWwlM0QtYUhSdGJIQmhjbk5sY2kxMWMyVnlRR3hwYzNSekxuTnZkWEpqWldadmNtZGxMbTVsZEFfXyUyNmNvZGUlM0QxMTc1MjQxNSUyNmlkJTNEMjEwNTI2NTY0NSUyNmklM0R0OTI_ |
From: Derrick O. <der...@gm...> - 2010-07-08 04:38:36
|
Did you set STRICT false: http://htmlparser.sourceforge.net/javadoc/org/htmlparser/scanners/ScriptScanner.html On Wed, Jul 7, 2010 at 9:48 PM, Niket Arora <nik...@ex...>wrote: > I m parsing a page > http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using > htmlparser api and I m getting content inside a script tag in some other tag > and reason for this is html tags are present in a string inside javascript > tags and are not escaped …. so htmlparser api is closing on those tags. > > > > > > > ================================================================================================================================================================================================ > > > > <div id="myHealthlineHeader"> > > <script> > > if(isLoggedIn()) { > > document.write("<a href=\"/action/LogOutServlet\">Sign > Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My > Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>"); > > document.getElementById("myHealthlineHeader").className = > "hl_state_top_signed_in"; > > } else { > > > > document.write("<div > style=\"float:right;text-align:right;padding:0 5px 0 > 0;\"> | <a class=\"underlineless\" > rel=\"nofollow\" > href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>"); > > document.write("<div style=\"float:right\"><a > class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign > in</a> | <a class=\"underlineless\" > rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a> </div>") > > document.getElementById("myHealthlineHeader").className = > "hl_state_top"; > > } > > </script> > > </div> > > > > > ================================================================================================================================================================================================ > > > > Is there anyway to fix this issue? > > > > Regards > > Niket > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by Sprint > What will you do first with EVO, the first 4G phone? > Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Niket A. <nik...@ex...> - 2010-07-07 20:07:05
|
I m parsing a page http://www.healthline.com/search?q1=how+to+improve+prostate+blood+levels using htmlparser api and I m getting content inside a script tag in some other tag and reason for this is html tags are present in a string inside javascript tags and are not escaped .... so htmlparser api is closing on those tags. ================================================================================================================================================================================================ <div id="myHealthlineHeader"> <script> if(isLoggedIn()) { document.write("<a href=\"/action/LogOutServlet\">Sign Off</a> | <a rel=\"nofollow\" href=\"/myhealthline/account_overview.jsp\">My Healthline</a> | Welcome, <strong>" + getNickname() + "</strong>"); document.getElementById("myHealthlineHeader").className = "hl_state_top_signed_in"; } else { document.write("<div style=\"float:right;text-align:right;padding:0 5px 0 0;\"> | <a class=\"underlineless\" rel=\"nofollow\" href=\"/yourfeedback.jsp?url=%2Fsearch%3Fq1%3Dhow%2Bto%2Bimprove%2Bprostate%2Bblood%2Blevels\">Feedback</a></div>"); document.write("<div style=\"float:right\"><a class=\"underlineless\" rel=\"nofollow\" href=\"/signin.jsp\">Sign in</a> | <a class=\"underlineless\" rel=\"nofollow\" href=\"/registration.jsp\">Join Now</a> </div>") document.getElementById("myHealthlineHeader").className = "hl_state_top"; } </script> </div> ================================================================================================================================================================================================ Is there anyway to fix this issue? Regards Niket |
From: Oliver S. <oli...@gm...> - 2010-07-05 16:31:02
|
Hi, I need to read arbitrary HTML (HTML 4 transitional, XHTML 1.0 strict, ...) extract the body as a fragment and output it again as another (XHTML standard). Reading the file is simple enough: Parser p = new Parser(resource); NodeFilter f = new NodeClassFilter(BodyTag.class); NodeList listOfBodies = p.extractAllNodesThatMatch(f); Node firstBody = listOfBodies.elementAt(0); NodeList bodyChildren = firstBody.getChildren(); System.out.println(bodyChildren.toHtml()); From this hpw can I output either valid HTML 4.0 code or valid XHTML 1.0 code? Best regards Oliver |