htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
(3) |
12
|
13
|
14
|
15
|
16
|
17
(1) |
18
(2) |
19
(5) |
20
|
21
|
22
|
23
|
24
(2) |
25
|
26
|
27
|
28
|
29
|
30
|
31
|
|
|
|
|
|
From: Derrick O. <der...@ro...> - 2008-03-24 11:40:52
|
Unknown tags are returned as generic tags. The code that does this is in PrototypicalNodefactory.createTagNode(). If you create a class that implements the NodeFactory interface you can substitute it for the default PrototypicalNodefactory and determine which nodes are not in the list of known tags or handle it any way you want. You can probably just create a class derived from PrototypicalNodefactory and override createTagNode (). Then set it on your parser with setNodeFactory(). ----- Original Message ---- From: Binod Dokania <bdo...@is...> To: htm...@li... Sent: Monday, March 24, 2008 12:23:40 AM Subject: [Htmlparser-user] HTML validator for html parser <!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;font-family:"Times New Roman";} a:link, span.MsoHyperlink {color:blue;text-decoration:underline;} a:visited, span.MsoHyperlinkFollowed {color:purple;text-decoration:underline;} span.EmailStyle17 {font-family:Arial;color:windowtext;} _filtered {margin:1.0in 1.25in 1.0in 1.25in;} div.Section1 {} --> Hi, I have a query regarding using HTML parser. Does html parser provide any html validator before start parsing a html document with HTML parser? I have some html documents created using different html development tools. These html documents create different html tags. So If we can get html validator which provides the information about the lines/tags not recognized by html parser. Any ideas/suggestion on this would be appreciated. Thanks in advance, Warm Regards, Binod Dokania -----Inline Attachment Follows----- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ -----Inline Attachment Follows----- _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Binod D. <bdo...@is...> - 2008-03-24 04:24:51
|
Hi, I have a query regarding using HTML parser. Does html parser provide any html validator before start parsing a html document with HTML parser? I have some html documents created using different html development tools. These html documents create different html tags. So If we can get html validator which provides the information about the lines/tags not recognized by html parser. Any ideas/suggestion on this would be appreciated. Thanks in advance, Warm Regards, Binod Dokania |
From: Derrick O. <der...@ro...> - 2008-03-19 22:26:24
|
If you have the tag, it's just tag.setImageURL (String url); The trick is to get the output again. You probably want a list of everything, then filter to extract your particular IMG tag, fix it, and then call toHtml() on everything. NodeList list = parser.parse(null); NodeList images = list.extractAllNodesThatMatch(<my special filter>); images.elementAt(0).setImageURL ("whatever"); System.out.println (list.toHtml ()); ----- Original Message ---- From: Narindra Jeethan <Nar...@te...> To: htmlparser user list <htm...@li...> Sent: Wednesday, March 19, 2008 10:57:45 AM Subject: [Htmlparser-user] editing tags Hi, How do I edit the src attribute in img tag within an html file? Thanks, Narindra -----Inline Attachment Follows----- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ -----Inline Attachment Follows----- _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Wojciech G. <woj...@gm...> - 2008-03-19 21:19:54
|
Hi Martin, You're absolutely correct -- it's because of the incorrect closing (at least that's what it seems to be), and it occurs multiple times. Thanks for letting me know -- I ended up writing a quick regex script to extract this stuff, so it works in its own special way, now. :) Thanks again, Wojciech On Mon, Mar 17, 2008 at 5:41 AM, Martin Sturm <mst...@gm...> wrote: > 2008/3/11, Wojciech Gryc <woj...@gm...>: > > > Specifically, the link appears in the page, but surrounds an image, like > so: > > > > <p><a href="http://feeds.feedburner.com/mydd"><img > > src="..."></a> > > > > When I use a basic tag name filter, the actual tag above doesn't get > > returned (while other <a> tag links do)... I've been playing around with > the > > code and don't know where to go from here. Is it because it surrounds an > > image? Is there anything I can do to fix this? > > If it is only a problem with this particular piece of HTML, then it > could be useful to post it verbatim. Looking to the part you have in > your e-mail, it seems that the img-tag is not properly closed (it > should be <img src="..." /> if it is a XHTML document, that is). > However, I'm not sure if HTMLParser just ignores this... from previous > expieriences, I noted that HTMLParser is pretty strict when it comes > to standard compliance. > > -- > Martin Sturm > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Five Minutes to Midnight: Youth on human rights and current affairs http://www.fiveminutestomidnight.org/ |
From: Narindra J. <Nar...@te...> - 2008-03-19 14:58:52
|
Hi, How do I edit the src attribute in img tag within an html file? Thanks, Narindra |
From: Narindra J. <Nar...@te...> - 2008-03-19 13:55:09
|
Hi Henry, Try this: public static String getKeywords(String file){ try { Parser parser = new Parser (file); NodeList list = parser.parse (new TagNameFilter ("table")); System.out.println(list.toHtml()); } catch (Exception pe) { pe.printStackTrace (); } return Keywords; } Narindra Jeethan Office: 780.493.7211 Mobile: 780.288.5961 ________________________________ From: htm...@li... [mailto:htm...@li...] On Behalf Of Henry Tran Sent: Wednesday, March 19, 2008 12:36 AM To: Htm...@li... Subject: [Htmlparser-user] How to extract the content of certain html tableonly Hi, I would like to read the content of all the tables from a web page using HTML Parser. Below is an example of what make up a html table: </table> <tr> <td class="propType"><b>Address</b></td> <td class="propType"><b>Company</b></td> <td class="propType"><b>Department</b></td> <td class="propType" align="right"><b>Employee</b></td> <td colspan="6"><strong class="propType"> <td><strong>Firstname</strong></td> <td><strong>Surname</strong></td> <td><strong>DOB</strong></td> <td><strong>Sex</strong></td> <td class="even">John</td> <td class="even">Smith</td> <td class="even">01/02/2001</td> <td class="even">Male</td> </tr> </table> I am using the following example provided in html parser filter page but still not quite get there just yet: 1 import java.io.*; 2 import java.net.*; 3 import org.htmlparser.*; 4 import org.htmlparser.filters.TagNameFilter; 5 import org.htmlparser.filters.NodeClassFilter; 6 import org.htmlparser.filters.HasParentFilter; 7 import org.htmlparser.filters.*; 8 import org.htmlparser.util.*; 9 10 public class DnldURL { 11 public static void main (String[] args) throws ParserException { 12 DnldURL dnldURL = new DnldURL(); 13 } 14 public DnldURL() throws ParserException { 15 try { 16 Parser parser = new Parser ("http://www.abc.com"); 17 parser.parse (new HasParentFilter()); 18 NodeList list = new NodeList(); 19 NodeFilter filter = new OrFilter( 20 new TagNameFilter ("table"), 21 new HasChildFilter( 22 new TagNameFilter("tr"))); 23 for (NodeIterator e = parser.elements(); e.hasMoreNodes(); ) 24 // System.out.println(e.nextNode().toHtml()); 25 System.out.println(e.nextNode().collectInto(list, filter); 26 } catch (MalformedURLException mue) { 27 System.out.println("Ouch - a MalformedURLException ha2pened."); 28 mue.printStackTrace(); 29 System.exit(1); 30 } catch (IOException ioe) { 31 System.out.println("Oops- an IOException happened."); 32 ioe.printStackTrace(); 33 System.exit(1); 34 } 35 } The important thing is to get lines 17, 19-22 correctly set up so that the filter could pick up the content and printed on line 25. Not only am I confused on how to set up the table filter dependencies (<table> ...<tr> ...<td>...) but also how to get line 25 to combine both the filter and to.Html() together. For instance: System.out.println(e.nextNode().collectInto(list, filter).toHtml()); which doesn't work currently. I also would like to set up some dependency on what the content of <table>, <tr> and <td> should be so that only those relevant tables are being retrieved as opposed to all the tables. Many thanks, Jack ________________________________ Get the name you always wanted with the new y7mail email address <http://au.rd.yahoo.com/mail/taglines/au/y7mail/default/*http://au.yahoo.com/y7mail/?p1=ni&p2=general&p3=tagline&p4=other> . |
From: Henry T. <htr...@ya...> - 2008-03-19 06:35:44
|
Hi, I would like to read the content of all the tables from a web page using HTML Parser. Below is an example of what make up a html table: </table> <tr> <td class="propType"><b>Address</b></td> <td class="propType"><b>Company</b></td> <td class="propType"><b>Department</b></td> <td class="propType" align="right"><b>Employee</b></td> <td colspan="6"><strong class="propType"> <td><strong>Firstname</strong></td> <td><strong>Surname</strong></td> <td><strong>DOB</strong></td> <td><strong>Sex</strong></td> <td class="even">John</td> <td class="even">Smith</td> <td class="even">01/02/2001</td> <td class="even">Male</td> </tr> </table> I am using the following example provided in html parser filter page but still not quite get there just yet: 1 import java.io.*; 2 import java.net.*; 3 import org.htmlparser.*; 4 import org.htmlparser.filters.TagNameFilter; 5 import org.htmlparser.filters.NodeClassFilter; 6 import org.htmlparser.filters.HasParentFilter; 7 import org.htmlparser.filters.*; 8 import org.htmlparser.util.*; 9 10 public class DnldURL { 11 public static void main (String[] args) throws ParserException { 12 DnldURL dnldURL = new DnldURL(); 13 } 14 public DnldURL() throws ParserException { 15 try { 16 Parser parser = new Parser (“http://www.abc.com”); 17 parser.parse (new HasParentFilter()); 18 NodeList list = new NodeList(); 19 NodeFilter filter = new OrFilter( 20 new TagNameFilter ("table"), 21 new HasChildFilter( 22 new TagNameFilter("tr"))); 23 for (NodeIterator e = parser.elements(); e.hasMoreNodes(); ) 24 // System.out.println(e.nextNode().toHtml()); 25 System.out.println(e.nextNode().collectInto(list, filter); 26 } catch (MalformedURLException mue) { 27 System.out.println("Ouch - a MalformedURLException ha2pened."); 28 mue.printStackTrace(); 29 System.exit(1); 30 } catch (IOException ioe) { 31 System.out.println("Oops- an IOException happened."); 32 ioe.printStackTrace(); 33 System.exit(1); 34 } 35 } The important thing is to get lines 17, 19-22 correctly set up so that the filter could pick up the content and printed on line 25. Not only am I confused on how to set up the table filter dependencies (<table> …<tr> …<td>…) but also how to get line 25 to combine both the filter and to.Html() together. For instance: System.out.println(e.nextNode().collectInto(list, filter).toHtml()); which doesn’t work currently. I also would like to set up some dependency on what the content of <table>, <tr> and <td> should be so that only those relevant tables are being retrieved as opposed to all the tables. Many thanks, Jack Get the name you always wanted with the new y7mail email address. www.yahoo7.com.au/y7mail |
From: Derrick O. <der...@ro...> - 2008-03-18 23:03:05
|
You would use a filter to get the title tag like TagNameFilter ("TITLE") and then ask for the plain text from every node (there's probably only one) in the list of filtered tags, like tag.toPlainText (). ----- Original Message ---- From: Narindra Jeethan <Nar...@te...> To: htm...@li... Sent: Tuesday, March 18, 2008 4:48:54 PM Subject: [Htmlparser-user] Using Filter Hi, How do I use the filter to return the values between tags? For example I want to get test from the title tag <html> <head> <title>test</title> </head> <body>body test</body> </html> Thanks, Narindra Jeethan Office: 780.493.7211 Mobile: 780.288.5961 -----Inline Attachment Follows----- ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ -----Inline Attachment Follows----- _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Narindra J. <Nar...@te...> - 2008-03-18 20:55:14
|
Hi, How do I use the filter to return the values between tags? For example I want to get test from the title tag <html> <head> <title>test</title> </head> <body>body test</body> </html> Thanks, Narindra Jeethan Office: 780.493.7211 Mobile: 780.288.5961 |
From: Martin S. <mst...@gm...> - 2008-03-17 09:42:29
|
2008/3/11, Wojciech Gryc <woj...@gm...>: > Specifically, the link appears in the page, but surrounds an image, like so: > > <p><a href="http://feeds.feedburner.com/mydd"><img > src="..."></a> > > When I use a basic tag name filter, the actual tag above doesn't get > returned (while other <a> tag links do)... I've been playing around with the > code and don't know where to go from here. Is it because it surrounds an > image? Is there anything I can do to fix this? If it is only a problem with this particular piece of HTML, then it could be useful to post it verbatim. Looking to the part you have in your e-mail, it seems that the img-tag is not properly closed (it should be <img src="..." /> if it is a XHTML document, that is). However, I'm not sure if HTMLParser just ignores this... from previous expieriences, I noted that HTMLParser is pretty strict when it comes to standard compliance. -- Martin Sturm |
From: Daniel D. <me...@cr...> - 2008-03-11 22:57:49
|
Was able to fix the problem using this code, pulled from the extractAllNodesThatMatch method itself: ========================= NodeIterator e; for (e = parser.elements (); e.hasMoreNodes (); ) { Node currentNode = e.nextNode(); currentNode.collectInto(titleList, titleFilter); currentNode.collectInto(summaryTableList, summaryTableFilter); } =================== -Daniel On Tue, Mar 11, 2008 at 9:03 AM, Daniel Dixon <me...@cr...> wrote: > Hello, > > Anyone know why I can't use two extractAllNodesThatMatch(filter) > methods back-to-back on the same Parser instance? > > More specifically I have this code: > > ======================================== > Parser parser = new Parser(google); > > NodeList titleList = parser.extractAllNodesThatMatch(titleFilter); > NodeList summaryTableList = parser.extractAllNodesThatMatch(summaryTableFilter); > ======================================== > > The Google search results page I'm parsing has a series of these: > > <a href="blah">Title</a> > <table><tr><td>.....Summary info....</td></tr></table> > > The two filters above, when independent, work fine. Run them > back-to-back and the second will come up empty. I don't see where the > extractAllNodesThatMatch method literally pulls the nodes out of the > captured source, thus affecting the second filter. Here are my > filters: > > ======================================== > // filter to pull out titles (all links that are next to a table) > NodeFilter titleFilter = new AndFilter ( > new NodeClassFilter (LinkTag.class), > new HasSiblingFilter (new NodeClassFilter(TableTag.class)) > ); > // filter to pull out summaries (all tables that are next to a title link) > NodeFilter summaryTableFilter = new AndFilter ( > new NodeClassFilter (TableTag.class), > new NodeClassFilterOnPreviousSibling (LinkTag.class) > // custom filter > ); > ======================================== > > Thanks for the help. I've already tried subclassing the Parser so > that I could implement the clone() method, but got the same result. > > -Daniel > -- ------------------------------- Daniel me...@da... www.OneDanShow.com ------------------------------- |
From: Daniel D. <me...@cr...> - 2008-03-11 14:03:57
|
Hello, Anyone know why I can't use two extractAllNodesThatMatch(filter) methods back-to-back on the same Parser instance? More specifically I have this code: ======================================== Parser parser = new Parser(google); NodeList titleList = parser.extractAllNodesThatMatch(titleFilter); NodeList summaryTableList = parser.extractAllNodesThatMatch(summaryTableFilter); ======================================== The Google search results page I'm parsing has a series of these: <a href="blah">Title</a> <table><tr><td>.....Summary info....</td></tr></table> The two filters above, when independent, work fine. Run them back-to-back and the second will come up empty. I don't see where the extractAllNodesThatMatch method literally pulls the nodes out of the captured source, thus affecting the second filter. Here are my filters: ======================================== // filter to pull out titles (all links that are next to a table) NodeFilter titleFilter = new AndFilter ( new NodeClassFilter (LinkTag.class), new HasSiblingFilter (new NodeClassFilter(TableTag.class)) ); // filter to pull out summaries (all tables that are next to a title link) NodeFilter summaryTableFilter = new AndFilter ( new NodeClassFilter (TableTag.class), new NodeClassFilterOnPreviousSibling (LinkTag.class) // custom filter ); ======================================== Thanks for the help. I've already tried subclassing the Parser so that I could implement the clone() method, but got the same result. -Daniel |
From: Wojciech G. <woj...@gm...> - 2008-03-11 06:46:38
|
Hi, I'm a fairly new user of the HTML Parser software, and am already grateful for all the work that has gone into it... I'm currently building some software to extract RSS feed links. These appear in many different forms, and there's one particular site that's giving me some trouble (www.mydd.com)... Specifically, the link appears in the page, but surrounds an image, like so: *<p><a href="http://feeds.feedburner.com/mydd"><img src="..."></a>* When I use a basic tag name filter, the actual tag above doesn't get returned (while other <a> tag links do)... I've been playing around with the code and don't know where to go from here. Is it because it surrounds an image? Is there anything I can do to fix this? I'd be grateful for any help. Thank you, Wojciech -- Five Minutes to Midnight: Youth on human rights and current affairs http://www.fiveminutestomidnight.org/ |
From: <bo...@ti...> - 2008-03-01 17:03:41
|
Hi, I searched high and low for an answer to this question but I don't seem to be able to find it. I use htmlparser 1.6 (what a great product by the way) but for one of my applications I need to change the UserAgent used by htmlparser when requesting pages. At the moment that always seems to be: HTMLParser/1.6 Is there any way I can change that field? Thanks in advance Brian. __________________________________________________ Up to 33% off Norton Security from Tiscali - http://www.tiscali.co.uk/securepc/ |