htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
(1) |
11
(3) |
12
(2) |
13
(2) |
14
(1) |
15
|
16
|
17
|
18
|
19
|
20
(3) |
21
|
22
(2) |
23
(3) |
24
|
25
|
26
(2) |
27
|
28
(1) |
29
(2) |
30
(1) |
31
|
From: Derrick O. <der...@ro...> - 2008-05-30 01:43:45
|
The results of applying new AndFilter (new TagNameFilter ("TD"), new HasSiblingFilter (new StringFilter ("Job ID", true))) would give you the <tdclass="FormContentFieldValue">524</td> tag, so you could ask for toPlainText() and convert resulting the string into an integer value if you want. ----- Original Message ---- From: neethu joseph <nee...@gm...> To: htmlparser user list <htm...@li...> Sent: Thursday, May 29, 2008 1:07:26 AM Subject: Re: [Htmlparser-user] how to extract content from the html tag Thanks for your reply ...Could you please explain a little more on this one .. Well ultimately i'm interested in the field value of the job id i.e 524 . On Wed, May 28, 2008 at 7:53 PM, Derrick Oswald <der...@ro...> wrote: You should be able to construct a filter using the FilterBuilder application to look for the "Job ID" in the adjacent TD. It will be something like: new AndFilter (new TagNameFilter ("TD"), new HasSiblingFilter (new StringFilter ("Job ID", true))) ----- Original Message ---- From: neethu joseph <nee...@gm...> To: htm...@li... Sent: Wednesday, May 28, 2008 1:06:00 PM Subject: [Htmlparser-user] how to extract content from the html tag Hi I'm new to HtmlParser.Could you please help me to extract the Job ID from the table .I was trying to located it as the 3rd element of the table, but the page is getting modified day by day so i need to work out an alternative to find the job ID </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">City</td> <td class="FormContentFieldValue">St. Louis</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">State/Province</td> <td class="FormContentFieldValue">Missouri [MO]</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job Title</td> <td class="FormContentFieldValue">Director, Graduate Studies in IS Management</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job ID</td> <td class="FormContentFieldValue">524</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job Type</td> <td class="FormContentFieldValue">Director</td> </tr> regards NAT ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: neethu j. <nee...@gm...> - 2008-05-29 05:07:29
|
Thanks for your reply ...Could you please explain a little more on this one .. Well ultimately i'm interested in the field value of the job id i.e 524 . On Wed, May 28, 2008 at 7:53 PM, Derrick Oswald <der...@ro...> wrote: > > You should be able to construct a filter using the FilterBuilder > application to look for the "Job ID" in the adjacent TD. > It will be something like: > new AndFilter (new TagNameFilter ("TD"), new HasSiblingFilter (new > StringFilter ("Job ID", true))) > > > ----- Original Message ---- > From: neethu joseph <nee...@gm...> > To: htm...@li... > Sent: Wednesday, May 28, 2008 1:06:00 PM > Subject: [Htmlparser-user] how to extract content from the html tag > > Hi I'm new to HtmlParser.Could you please help me to extract the *Job ID*from the table .I was trying to located it as the 3rd element of the table, > but the page is getting modified day by day so i need to work out an > alternative to find the job ID > > > </tr> > <tr class="FormContent"> > <td class="FormContentFieldLabel">City</td> > > <td class="FormContentFieldValue">St. Louis</td> > </tr> > > <tr class="FormContent"> > <td class="FormContentFieldLabel">State/Province</td> > > <td class="FormContentFieldValue">Missouri [MO]</td> > </tr> > > <tr class="FormContent"> > <td class="FormContentFieldLabel">Job Title</td> > > <td class="FormContentFieldValue">Director, Graduate Studies in IS Management</td> > > </tr> > <tr class="FormContent"> > <td class="FormContentFieldLabel">Job ID</td> > > <td class="FormContentFieldValue">524</td> > </tr> > > <tr class="FormContent"> > > <td class="FormContentFieldLabel">Job Type</td> > > <td class="FormContentFieldValue">Director</td> > </tr> > > > regards > > NAT > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@ro...> - 2008-05-29 00:53:12
|
You should be able to construct a filter using the FilterBuilder application to look for the "Job ID" in the adjacent TD. It will be something like: new AndFilter (new TagNameFilter ("TD"), new HasSiblingFilter (new StringFilter ("Job ID", true))) ----- Original Message ---- From: neethu joseph <nee...@gm...> To: htm...@li... Sent: Wednesday, May 28, 2008 1:06:00 PM Subject: [Htmlparser-user] how to extract content from the html tag Hi I'm new to HtmlParser.Could you please help me to extract the Job ID from the table .I was trying to located it as the 3rd element of the table, but the page is getting modified day by day so i need to work out an alternative to find the job ID </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">City</td> <td class="FormContentFieldValue">St. Louis</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">State/Province</td> <td class="FormContentFieldValue">Missouri [MO]</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job Title</td> <td class="FormContentFieldValue">Director, Graduate Studies in IS Management</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job ID</td> <td class="FormContentFieldValue">524</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job Type</td> <td class="FormContentFieldValue">Director</td> </tr> regards NAT |
From: neethu j. <nee...@gm...> - 2008-05-28 17:06:01
|
Hi I'm new to HtmlParser.Could you please help me to extract the *Job ID*from the table .I was trying to located it as the 3rd element of the table, but the page is getting modified day by day so i need to work out an alternative to find the job ID </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">City</td> <td class="FormContentFieldValue">St. Louis</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">State/Province</td> <td class="FormContentFieldValue">Missouri [MO]</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job Title</td> <td class="FormContentFieldValue">Director, Graduate Studies in IS Management</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job ID</td> <td class="FormContentFieldValue">524</td> </tr> <tr class="FormContent"> <td class="FormContentFieldLabel">Job Type</td> <td class="FormContentFieldValue">Director</td> </tr> regards NAT |
From: answers s. <fas...@gm...> - 2008-05-26 09:15:16
|
hi i am the same behaviour for div tag also i am using like this NodeFilter filterStyleClass = new HasAttributeFilter("class",(String)list.get(j)); NodeList listStyleClass=parse.extractAllNodesThatMatch(filterStyleClass); i want extract a div tag with attribute "class" and also the childTags inside tht DivTag On 5/24/08, Derrick Oswald <der...@ro...> wrote: > > The HtmlParser doesn't come with a Font tag that is composite. > If you want this behaviour you need to define your own tag as described > here: http://htmlparser.sourceforge.net/faq.html#composite > > ----- Original Message ---- > From: answers solutions <fas...@gm...> > To: htm...@li... > Sent: Friday, May 23, 2008 4:59:54 AM > Subject: [Htmlparser-user] how to get a node struture with particular > attribute > > hi > > i am filter like this > > > NodeFilter filterClass = new AndFilter(new TagNameFilter("font"),new > HasAttributeFilter("class","leftnavi")); > > > > I am using this filter aginst this Text > > <FONT class=leftnavi size=2> > <a href="http://epaper.thehindu.com">ePaper</a><br<http://epaper.thehindu.com%22%3eepaper%3c/a%3E%3Cbr> > > > <A href="01hdline.htm">Front Page</A><BR> > <A href="02hdline.htm">National</A><BR> > <font class=leftnavi color=black>States:</font><br> > • <A href="23hdline.htm">Tamil Nadu</A><BR> > • <A href="21hdline.htm">Andhra Pradesh</A><BR> > • <A href="22hdline.htm">Karnataka</A><BR> > • <A href="25hdline.htm">Kerala</A><BR> > • <A href="24hdline.htm">New Delhi</A><BR> > • <A href="14hdline.htm">Other States</A><BR> > <A href="03hdline.htm">International</A><BR> > <A href="05hdline.htm">Opinion</A><BR> > <A href="06hdline.htm">Business</A><BR> > <A href="07hdline.htm">Sport</A><BR> > <A href="10hdline.htm">Miscellaneous</A><BR> > • <A href="10hdline.htm#019">Cartoons</A><BR> > <A href="26hdline.htm">Engagements</A><BR> > </FONT> > > > but the i am getting is <FONT class=leftnavi size=2> > > > but i want o/p as whole font tag as > > <FONT class=leftnavi size=2> > <a href="http://epaper.thehindu.com">ePaper</a><br<http://epaper.thehindu.com%22%3eepaper%3c/a%3E%3Cbr> > > > <A href="01hdline.htm">Front Page</A><BR> > <A href="02hdline.htm">National</A><BR> > <font class=leftnavi color=black>States:</font><br> > • <A href="23hdline.htm">Tamil Nadu</A><BR> > • <A href="21hdline.htm">Andhra Pradesh</A><BR> > • <A href="22hdline.htm">Karnataka</A><BR> > • <A href="25hdline.htm">Kerala</A><BR> > • <A href="24hdline.htm">New Delhi</A><BR> > • <A href="14hdline.htm">Other States</A><BR> > <A href="03hdline.htm">International</A><BR> > <A href="05hdline.htm">Opinion</A><BR> > <A href="06hdline.htm">Business</A><BR> > <A href="07hdline.htm">Sport</A><BR> > <A href="10hdline.htm">Miscellaneous</A><BR> > • <A href="10hdline.htm#019">Cartoons</A><BR> > <A href="26hdline.htm">Engagements</A><BR> > </FONT> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: <Sri...@ba...> - 2008-05-26 07:08:41
|
Hi Abdullah and everyone else, Thank you for looking into my request for help. I have attached an example of the HTML file I want to parse using HTMLParser. Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of htm...@li... Sent: 22 May 2008 21:14 To: htm...@li... Subject: Htmlparser-user Digest, Vol 23, Issue 3 Send Htmlparser-user mailing list submissions to htm...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/htmlparser-user or, via email, send a message with subject or body 'help' to htm...@li... You can reach the person managing the list at htm...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of Htmlparser-user digest..." Today's Topics: 1. Help with a link extraction program (Sri...@ba...) 2. Replacing attributes of DOCTYPE tag (?? ??) 3. Re: Help with a link extraction program (abdullah) 4. How to extract table without a nested table in it (answers solutions) 5. Re: How to extract table without a nested table in it (Derrick Oswald) ---------------------------------------------------------------------- Message: 1 Date: Tue, 20 May 2008 15:13:39 +0800 From: <Sri...@ba...> Subject: [Htmlparser-user] Help with a link extraction program To: <htm...@li...> Message-ID: <B89...@SG...RCA PINT.COM> Content-Type: text/plain; charset="us-ascii" Hi everyone, I am a new user of the HTMLParser API. I have found the link extraction features to be very useful even in this short space of time. I would like to seek help with a program that I have to write. It involves link extraction, but the logic is slightly more convoluted. Currently, I know how to use the LinkExtractor to supply a HTML document as input and output the links in that document to either the command prompt or a text file (with suitable modifications where required of course). I have a HTML document in which there is a hierarchy of links in the form of lists. I would like the output of the link information given by LinkExtractor to reflect this hierarchy in some way. For example, I have a list of items in a <ul> tag. Each of these items may/may not contain their own sub-items with their own links, so that the HTML looks something like: <ul> <li> <a href="...."> Item 1 </a> <ul> <li> <a href="...."> Sub-Item 1 </a> </li> <li> <a href="...."> Sub-Item 2 </a> </li> </ul> <li> Item 2 </li> </ul> I would like to know how I can parse a document full of lists like these and extract the links while having some indication of the hierarchy, either the "tree path" of the link (i.e. if I extract the link underyling Sub-Item 1 in my example, my text file should contain something along the lines of "Item 1 > Sub-Item 1" before printing the actual link path) or outputting a page identical to the one I am parsing but with the full path of the link printed beside each of those list items. Thanks for all your help in this regard. Warm Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered offic e at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ ------------------------------ Message: 2 Date: Tue, 20 May 2008 17:34:15 +0900 From: ?? ?? <nag...@by...> Subject: [Htmlparser-user] Replacing attributes of DOCTYPE tag To: htm...@li... Message-ID: <483...@by...> Content-Type: text/plain; charset=ISO-2022-JP Dear All, I am new to HTML Parser, and I don't understand well how to handle !DOCTYPE tag. Shortly speaking, I'd like to replace tag like this: <!DOCTYPE html PUBLIC "XXXX" "AAAA"> into: <! DOCTYPE html PUBLIC "YYYY" "BBBB"> I sat on my chair and had a lots of trial and error, but it did'nt work. I'd appreciate it if you could give me advice. (My e-mail address had changed.) ------------------------------ Message: 3 Date: Tue, 20 May 2008 15:37:18 +0300 From: abdullah <abd...@id...> Subject: Re: [Htmlparser-user] Help with a link extraction program To: "htmlparser user list" <htm...@li...> Message-ID: <17d...@ma...> Content-Type: text/plain; charset="iso-8859-1" you dont need a linkExtractor you need a listExtractor , if all the links are inside lists you should get the list and navigate to its children which is the links .. for this case i suggest you parse the page with filter as following : Parser parser = new Parser(); NodeList lists = parser.parse(new NodeClassFilter(BulletList.class)); for(int i=0 i < lists.size() ;i++ ){ BulletList list = lists.elementAt(i); links = list.getChildern(); // this will give you another NodeList with children tags // do whatever you want with the links note that you need to cast each child them forn Node to LinkTag } i didnt test this code , but hopefully it will work if you gave me a specific example of the html page you want to parse i may help more good luck : ) On Tue, May 20, 2008 at 10:13 AM, <Sri...@ba...> wrote: > > Hi everyone, > > I am a new user of the HTMLParser API. I have found the link > extraction features to be very useful even in this short space of time. > > I would like to seek help with a program that I have to write. It > involves link extraction, but the logic is slightly more convoluted. > > Currently, I know how to use the LinkExtractor to supply a HTML > document as input and output the links in that document to either the > command prompt or a text file (with suitable modifications where > required of course). I have a HTML document in which there is a > hierarchy of links in the form of lists. I would like the output of > the link information given by LinkExtractor to reflect this hierarchy in some way. > > For example, I have a list of items in a <ul> tag. Each of these items > may/may not contain their own sub-items with their own links, so that > the HTML looks something like: > > <ul> > <li> <a href="...."> Item 1 </a> > <ul> > <li> <a href="...."> Sub-Item 1 </a> </li> > <li> <a href="...."> Sub-Item 2 </a> </li> > </ul> > > <li> Item 2 </li> > </ul> > > I would like to know how I can parse a document full of lists like > these and extract the links while having some indication of the > hierarchy, either the "tree path" of the link (i.e. if I extract the > link underyling Sub-Item 1 in my example, my text file should contain > something along the lines of "Item 1 > Sub-Item 1" before printing the > actual link path) or outputting a page identical to the one I am > parsing but with the full path of the link printed beside each of > those list items. > > Thanks for all your help in this regard. > > Warm Regards, > > Sridhar Venkataraman > Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital > Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - > 238891 > + (65) 6828 4609 (O) > + (65) 9871 0076 (m) | sri...@ba... > > > _______________________________________________ > > This e-mail may contain information that is confidential, privileged > or otherwise protected from disclosure. If you are not an intended > recipient of this e-mail, do not duplicate or redistribute it by any > means. Please delete it and any attachments and notify the sender that > you have received it in error. Unless specifically indicated, this > e-mail is not an offer to buy or sell or a solicitation to buy or sell > any securities, investment products or other financial product or > service, an official confirmation of any transaction, or an official > statement of Barclays. Any views or opinions presented are solely > those of the author and do not necessarily represent those of > Barclays. This e-mail is subject to terms available at the following > link: www.barcap.com/emaildisclaimer. By messaging with Barclays you > consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered offic > e at 1 Churchill Place, London, E14 5HP. This email may relate to or > be sent from other members of the Barclays Group. > _______________________________________________ > > ---------------------------------------------------------------------- > --- This SF.net email is sponsored by: Microsoft Defy all challenges. > Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ Message: 4 Date: Thu, 22 May 2008 18:06:00 +0530 From: "answers solutions" <fas...@gm...> Subject: [Htmlparser-user] How to extract table without a nested table in it To: htm...@li... Message-ID: <992...@ma...> Content-Type: text/plain; charset="iso-8859-1" Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ Message: 5 Date: Thu, 22 May 2008 06:14:19 -0700 (PDT) From: Derrick Oswald <der...@ro...> Subject: Re: [Htmlparser-user] How to extract table without a nested table in it To: htmlparser user list <htm...@li...> Message-ID: <423...@we...> Content-Type: text/plain; charset="us-ascii" You probably catch these because the inner tables are not direct children of the outer table. You need the HasChildFilter (NodeFilter filter, boolean recursive) constructor with recursive set to true. ----- Original Message ---- From: answers solutions <fas...@gm...> To: htm...@li... Sent: Thursday, May 22, 2008 5:36:00 AM Subject: [Htmlparser-user] How to extract table without a nested table in it Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . -------------- next part -------------- An HTML attachment was scrubbed... ------------------------------ ------------------------------------------------------------------------ - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ ------------------------------ _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user End of Htmlparser-user Digest, Vol 23, Issue 3 ********************************************** _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ |
From: Derrick O. <der...@ro...> - 2008-05-23 21:28:00
|
The HtmlParser doesn't come with a Font tag that is composite. If you want this behaviour you need to define your own tag as described here: http://htmlparser.sourceforge.net/faq.html#composite ----- Original Message ---- From: answers solutions <fas...@gm...> To: htm...@li... Sent: Friday, May 23, 2008 4:59:54 AM Subject: [Htmlparser-user] how to get a node struture with particular attribute hi i am filter like this NodeFilter filterClass = new AndFilter(new TagNameFilter("font"),new HasAttributeFilter("class","leftnavi")); I am using this filter aginst this Text <FONT class=leftnavi size=2> <a href="http://epaper.thehindu.com">ePaper</a><br> <A href="01hdline.htm">Front Page</A><BR> <A href="02hdline.htm">National</A><BR> <font class=leftnavi color=black>States:</font><br> • <A href="23hdline.htm">Tamil Nadu</A><BR> • <A href="21hdline.htm">Andhra Pradesh</A><BR> • <A href="22hdline.htm">Karnataka</A><BR> • <A href="25hdline.htm">Kerala</A><BR> • <A href="24hdline.htm">New Delhi</A><BR> • <A href="14hdline.htm">Other States</A><BR> <A href="03hdline.htm">International</A><BR> <A href="05hdline.htm">Opinion</A><BR> <A href="06hdline.htm">Business</A><BR> <A href="07hdline.htm">Sport</A><BR> <A href="10hdline.htm">Miscellaneous</A><BR> • <A href="10hdline.htm#019">Cartoons</A><BR> <A href="26hdline.htm">Engagements</A><BR> </FONT> but the i am getting is <FONT class=leftnavi size=2> but i want o/p as whole font tag as <FONT class=leftnavi size=2> <a href="http://epaper.thehindu.com">ePaper</a><br> <A href="01hdline.htm">Front Page</A><BR> <A href="02hdline.htm">National</A><BR> <font class=leftnavi color=black>States:</font><br> • <A href="23hdline.htm">Tamil Nadu</A><BR> • <A href="21hdline.htm">Andhra Pradesh</A><BR> • <A href="22hdline.htm">Karnataka</A><BR> • <A href="25hdline.htm">Kerala</A><BR> • <A href="24hdline.htm">New Delhi</A><BR> • <A href="14hdline.htm">Other States</A><BR> <A href="03hdline.htm">International</A><BR> <A href="05hdline.htm">Opinion</A><BR> <A href="06hdline.htm">Business</A><BR> <A href="07hdline.htm">Sport</A><BR> <A href="10hdline.htm">Miscellaneous</A><BR> • <A href="10hdline.htm#019">Cartoons</A><BR> <A href="26hdline.htm">Engagements</A><BR> </FONT> |
From: abdullah <abd...@id...> - 2008-05-23 13:47:45
|
what ive understood is that you want the children tag of the FONT tag ?and you've been able to have the FONT tag .. so just call .getChildern() funciton on the Node you've extracted e.g : NodeList childern =fontTag.getChildern() ; On Fri, May 23, 2008 at 11:59 AM, answers solutions < fas...@gm...> wrote: > hi > > i am filter like this > > > NodeFilter filterClass = new AndFilter(new TagNameFilter("font"),new > HasAttributeFilter("class","leftnavi")); > > > > I am using this filter aginst this Text > > <FONT class=leftnavi size=2> > <a href="http://epaper.thehindu.com">ePaper</a><br<http://epaper.thehindu.com%22%3Eepaper%3C/a%3E%3Cbr> > > > <A href="01hdline.htm">Front Page</A><BR> > <A href="02hdline.htm">National</A><BR> > <font class=leftnavi color=black>States:</font><br> > • <A href="23hdline.htm">Tamil Nadu</A><BR> > • <A href="21hdline.htm">Andhra Pradesh</A><BR> > • <A href="22hdline.htm">Karnataka</A><BR> > • <A href="25hdline.htm">Kerala</A><BR> > • <A href="24hdline.htm">New Delhi</A><BR> > • <A href="14hdline.htm">Other States</A><BR> > <A href="03hdline.htm">International</A><BR> > <A href="05hdline.htm">Opinion</A><BR> > <A href="06hdline.htm">Business</A><BR> > <A href="07hdline.htm">Sport</A><BR> > <A href="10hdline.htm">Miscellaneous</A><BR> > • <A href="10hdline.htm#019">Cartoons</A><BR> > <A href="26hdline.htm">Engagements</A><BR> > </FONT> > > > but the i am getting is <FONT class=leftnavi size=2> > > > but i want o/p as whole font tag as > > <FONT class=leftnavi size=2> > <a href="http://epaper.thehindu.com">ePaper</a><br<http://epaper.thehindu.com%22%3Eepaper%3C/a%3E%3Cbr> > > > <A href="01hdline.htm">Front Page</A><BR> > <A href="02hdline.htm">National</A><BR> > <font class=leftnavi color=black>States:</font><br> > • <A href="23hdline.htm">Tamil Nadu</A><BR> > • <A href="21hdline.htm">Andhra Pradesh</A><BR> > • <A href="22hdline.htm">Karnataka</A><BR> > • <A href="25hdline.htm">Kerala</A><BR> > • <A href="24hdline.htm">New Delhi</A><BR> > • <A href="14hdline.htm">Other States</A><BR> > <A href="03hdline.htm">International</A><BR> > <A href="05hdline.htm">Opinion</A><BR> > <A href="06hdline.htm">Business</A><BR> > <A href="07hdline.htm">Sport</A><BR> > <A href="10hdline.htm">Miscellaneous</A><BR> > • <A href="10hdline.htm#019">Cartoons</A><BR> > <A href="26hdline.htm">Engagements</A><BR> > </FONT> > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: answers s. <fas...@gm...> - 2008-05-23 09:00:02
|
hi i am filter like this NodeFilter filterClass = new AndFilter(new TagNameFilter("font"),new HasAttributeFilter("class","leftnavi")); I am using this filter aginst this Text <FONT class=leftnavi size=2> <a href="http://epaper.thehindu.com">ePaper</a><br<http://epaper.thehindu.com">epaper</a><br> > <A href="01hdline.htm">Front Page</A><BR> <A href="02hdline.htm">National</A><BR> <font class=leftnavi color=black>States:</font><br> • <A href="23hdline.htm">Tamil Nadu</A><BR> • <A href="21hdline.htm">Andhra Pradesh</A><BR> • <A href="22hdline.htm">Karnataka</A><BR> • <A href="25hdline.htm">Kerala</A><BR> • <A href="24hdline.htm">New Delhi</A><BR> • <A href="14hdline.htm">Other States</A><BR> <A href="03hdline.htm">International</A><BR> <A href="05hdline.htm">Opinion</A><BR> <A href="06hdline.htm">Business</A><BR> <A href="07hdline.htm">Sport</A><BR> <A href="10hdline.htm">Miscellaneous</A><BR> • <A href="10hdline.htm#019">Cartoons</A><BR> <A href="26hdline.htm">Engagements</A><BR> </FONT> but the i am getting is <FONT class=leftnavi size=2> but i want o/p as whole font tag as <FONT class=leftnavi size=2> <a href="http://epaper.thehindu.com">ePaper</a><br<http://epaper.thehindu.com">epaper</a><br> > <A href="01hdline.htm">Front Page</A><BR> <A href="02hdline.htm">National</A><BR> <font class=leftnavi color=black>States:</font><br> • <A href="23hdline.htm">Tamil Nadu</A><BR> • <A href="21hdline.htm">Andhra Pradesh</A><BR> • <A href="22hdline.htm">Karnataka</A><BR> • <A href="25hdline.htm">Kerala</A><BR> • <A href="24hdline.htm">New Delhi</A><BR> • <A href="14hdline.htm">Other States</A><BR> <A href="03hdline.htm">International</A><BR> <A href="05hdline.htm">Opinion</A><BR> <A href="06hdline.htm">Business</A><BR> <A href="07hdline.htm">Sport</A><BR> <A href="10hdline.htm">Miscellaneous</A><BR> • <A href="10hdline.htm#019">Cartoons</A><BR> <A href="26hdline.htm">Engagements</A><BR> </FONT> |
From: Derrick O. <der...@ro...> - 2008-05-22 13:14:28
|
You probably catch these because the inner tables are not direct children of the outer table. You need the HasChildFilter (NodeFilter filter, boolean recursive) constructor with recursive set to true. ----- Original Message ---- From: answers solutions <fas...@gm...> To: htm...@li... Sent: Thursday, May 22, 2008 5:36:00 AM Subject: [Htmlparser-user] How to extract table without a nested table in it Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . |
From: answers s. <fas...@gm...> - 2008-05-22 12:36:02
|
Hi i am strututre like to extract a table so that it doesnot have nested table inside it . nodefilter filtertable = new AndFilter( new HasParentFilter(new TagNameFilter("table"),new NotFilter(new HasChildFilter(new TagNameFilter("table))); still the o/p i see a table with nested table in it . |
From: abdullah <abd...@id...> - 2008-05-20 12:37:28
|
you dont need a linkExtractor you need a listExtractor , if all the links are inside lists you should get the list and navigate to its children which is the links .. for this case i suggest you parse the page with filter as following : Parser parser = new Parser(); NodeList lists = parser.parse(new NodeClassFilter(BulletList.class)); for(int i=0 i < lists.size() ;i++ ){ BulletList list = lists.elementAt(i); links = list.getChildern(); // this will give you another NodeList with children tags // do whatever you want with the links note that you need to cast each child them forn Node to LinkTag } i didnt test this code , but hopefully it will work if you gave me a specific example of the html page you want to parse i may help more good luck : ) On Tue, May 20, 2008 at 10:13 AM, <Sri...@ba...> wrote: > > Hi everyone, > > I am a new user of the HTMLParser API. I have found the link extraction > features to be very useful even in this short space of time. > > I would like to seek help with a program that I have to write. It > involves link extraction, but the logic is slightly more convoluted. > > Currently, I know how to use the LinkExtractor to supply a HTML document > as input and output the links in that document to either the command > prompt or a text file (with suitable modifications where required of > course). I have a HTML document in which there is a hierarchy of links > in the form of lists. I would like the output of the link information > given by LinkExtractor to reflect this hierarchy in some way. > > For example, I have a list of items in a <ul> tag. Each of these items > may/may not contain their own sub-items with their own links, so that > the HTML looks something like: > > <ul> > <li> <a href="...."> Item 1 </a> > <ul> > <li> <a href="...."> Sub-Item 1 </a> </li> > <li> <a href="...."> Sub-Item 2 </a> </li> > </ul> > > <li> Item 2 </li> > </ul> > > I would like to know how I can parse a document full of lists like these > and extract the links while having some indication of the hierarchy, > either the "tree path" of the link (i.e. if I extract the link > underyling Sub-Item 1 in my example, my text file should contain > something along the lines of "Item 1 > Sub-Item 1" before printing the > actual link path) or outputting a page identical to the one I am parsing > but with the full path of the link printed beside each of those list > items. > > Thanks for all your help in this regard. > > Warm Regards, > > Sridhar Venkataraman > Summer Analyst, Global Technology (Asia-Pacific) > Barclays Capital Services Ltd > 60B Orchard Road #10-00, TheAtrium@Orchard, > Singapore - 238891 > + (65) 6828 4609 (O) > + (65) 9871 0076 (m) | sri...@ba... > > > _______________________________________________ > > This e-mail may contain information that is confidential, privileged or > otherwise protected from disclosure. If you are not an intended recipient of > this e-mail, do not duplicate or redistribute it by any means. Please delete > it and any attachments and notify the sender that you have received it in > error. Unless specifically indicated, this e-mail is not an offer to buy or > sell or a solicitation to buy or sell any securities, investment products or > other financial product or service, an official confirmation of any > transaction, or an official statement of Barclays. Any views or opinions > presented are solely those of the author and do not necessarily represent > those of Barclays. This e-mail is subject to terms available at the > following link: www.barcap.com/emaildisclaimer. By messaging with Barclays > you consent to the foregoing. Barclays Capital is the investment banking > division of Barclays Bank PLC, a company registered in England (number > 1026167) with its registered offic > e at 1 Churchill Place, London, E14 5HP. This email may relate to or be > sent from other members of the Barclays Group. > _______________________________________________ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2008. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: 長弘 大樹 <nag...@by...> - 2008-05-20 08:34:13
|
Dear All, I am new to HTML Parser, and I don't understand well how to handle !DOCTYPE tag. Shortly speaking, I'd like to replace tag like this: <!DOCTYPE html PUBLIC "XXXX" "AAAA"> into: <! DOCTYPE html PUBLIC "YYYY" "BBBB"> I sat on my chair and had a lots of trial and error, but it did'nt work. I'd appreciate it if you could give me advice. (My e-mail address had changed.) |
From: <Sri...@ba...> - 2008-05-20 07:13:52
|
Hi everyone, I am a new user of the HTMLParser API. I have found the link extraction features to be very useful even in this short space of time. I would like to seek help with a program that I have to write. It involves link extraction, but the logic is slightly more convoluted. Currently, I know how to use the LinkExtractor to supply a HTML document as input and output the links in that document to either the command prompt or a text file (with suitable modifications where required of course). I have a HTML document in which there is a hierarchy of links in the form of lists. I would like the output of the link information given by LinkExtractor to reflect this hierarchy in some way. For example, I have a list of items in a <ul> tag. Each of these items may/may not contain their own sub-items with their own links, so that the HTML looks something like: <ul> <li> <a href="...."> Item 1 </a> <ul> <li> <a href="...."> Sub-Item 1 </a> </li> <li> <a href="...."> Sub-Item 2 </a> </li> </ul> <li> Item 2 </li> </ul> I would like to know how I can parse a document full of lists like these and extract the links while having some indication of the hierarchy, either the "tree path" of the link (i.e. if I extract the link underyling Sub-Item 1 in my example, my text file should contain something along the lines of "Item 1 > Sub-Item 1" before printing the actual link path) or outputting a page identical to the one I am parsing but with the full path of the link printed beside each of those list items. Thanks for all your help in this regard. Warm Regards, Sridhar Venkataraman Summer Analyst, Global Technology (Asia-Pacific) Barclays Capital Services Ltd 60B Orchard Road #10-00, TheAtrium@Orchard, Singapore - 238891 + (65) 6828 4609 (O) + (65) 9871 0076 (m) | sri...@ba... _______________________________________________ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. _______________________________________________ |
From: <bo...@ti...> - 2008-05-14 08:23:47
|
All the pages which don't work come from the same source... They all have these meta tags. I believe there is an option to force decoding with a different character set but the way I retrieve the pages - I don't seem to have the opportunity to do so maybe if someone can give me a few lines of sample code on how to do that - I would appreciate it. What I do at the moment is: parser = new Parser(URL); ThePage = parser.parse(null); MyPage = ThePage.toHtml(); And that doesn't give the oportunity to change the decoding. I believe you can read the page and then "force" decoding with a different character set but I can't figure out how to do that. Is there an example somewhere of how to do this? Thanks again Brian ----- Original Message ---- There might be an issue between the ISO-8859-1 and UTF-8. Here's a random explanation - out of many on the net - http://www. stanford.edu/~laurik/fsmbook/faq/utf8.html You'll have to determine if the character you want has an encoding in ISO-8859-1. The parser should switch to interpreting in UTF-8 when it encounters the meta tag. Do all pages have the meta tag? Or just the ones that are OK. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: bo...@ti...; htm...@li... Sent: Tuesday, May 13, 2008 3:33:57 AM Subject: Re: [Htmlparser-user] Character Encoding Thanks Derrick, The relevant section of the ConnectionMonitor output is: INFO: HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html; charset=ISO-8859-1 Transfer-Encoding: chunked Does that help? Thanks Brian ----- Original Message ---- That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar; lexer\target\htmllexer. jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- ----Original Message---- From: bo...@ti... Date: 12/05/2008 12:55 To: <htm...@li...> Subj: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ Free games from Tiscali Play - http://www.tiscali.co.uk/play |
From: <bo...@ti...> - 2008-05-13 07:34:11
|
Thanks Derrick, The relevant section of the ConnectionMonitor output is: INFO: HTTP/1.1 200 OK Cache-Control: private Content-Type: text/html; charset=ISO-8859-1 Transfer-Encoding: chunked Does that help? Thanks Brian ----- Original Message ---- That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar;lexer\target\htmllexer. jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- ----Original Message---- From: bo...@ti... Date: 12/05/2008 12:55 To: <htm...@li...> Subj: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co. uk/protection ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun. com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
From: Derrick O. <der...@ro...> - 2008-05-13 01:17:42
|
That <meta> tag doesn't look like the problem. If you use the built in ConnectionMonitor on the parser, you can see the header: C:>java -classpath parser\target\htmlparser.jar;lexer\target\htmllexer.jar org.htmlparser.Parser http://cbc.ca INFO: GET http://cbc.ca HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 301 Moved Permanently Date: Tue, 13 May 2008 01:12:31 GMT Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Location: http://www.cbc.ca/ Cache-Control: max-age=120 Expires: Tue, 13 May 2008 01:14:31 GMT Content-Length: 226 Keep-Alive: timeout=15, max=150 Connection: Keep-Alive Content-Type: text/html; charset=iso-8859-1 INFO: GET http://www.cbc.ca/ HTTP/1.1 Accept-Encoding: gzip, deflate User-Agent: HTMLParser/2.0 INFO: HTTP/1.1 200 OK Server: Apache/2.0.59 (Linux/SuSE) mod_jk/1.2.6-dev Accept-Ranges: bytes Content-Type: text/html Cache-Control: max-age=61 Expires: Tue, 13 May 2008 01:13:32 GMT Date: Tue, 13 May 2008 01:12:31 GMT Content-Length: 28625 Connection: keep-alive ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:55:56 AM Subject: [Htmlparser-user] Character Encoding Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: <bo...@ti...> - 2008-05-12 11:56:08
|
Thanks Derrick, The page in question includes the following tags: <META http-equiv=Content-Type content="text/html; charset=utf-8"> <META http-equiv=content-type> I don't understand why the second one is there but it really is. With that information can you suggest a resolution? I am not entirely sure how to verify your point (1). Best Regards Brian ----------------------------------------------------------------------------- There are two possibilities. 1) The HTTP server is/is not serving up content type meta information in the HTTP header like so: text/html; charset=utf-8 2) The source HTML does/does not contain a meta tag like so: <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> You need to determine which one so the appropriate 'fix' can be applied. ----- Original Message ---- From: "bo...@ti..." <bo...@ti...> To: htm...@li... Sent: Monday, May 12, 2008 7:31:39 AM Subject: [Htmlparser-user] Character Encoding Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
From: <bo...@ti...> - 2008-05-12 11:31:57
|
Hi, I have a strange problem and I can’t get my head around it. Hopefully someone can point me in the right direction. I’m using the following code with HTMLParser 1.6 to retrieve web pages: parser = new Parser (URL); ThePage = parser.parse (null); MyPage = ThePage.toHtml(); On some pages (not all…) if the HTML page contains: £10 Free “My Page” contains “?10 Free” on other pages it works fine. I guess it has something to do with character encoding? Can someone suggest what I add where to get this to work correctly (I would like to keep the “£10 Free”) Thanks in advance Brian _______________________________ How can you protect children online? Find out - http://www.tiscali.co.uk/protection |
From: Derrick O. <der...@ro...> - 2008-05-11 13:03:18
|
A brute force approach would be to generate the parse tree in a NodeList with Parser.parse(null). Then recursively traverse the tree converting each sublist into text, until a plain text match occurs. In pseudo code the method would look something like this: findString (string, node_list) make a new StringBean apply visitAllNodesWith to the node list using the string_bean get the plain_text from the string_bean if string matches plain_text you are done, return the node_list else for each child in node_list try recursing into findString with the string and child ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htmlparser user list <htm...@li...> Sent: Sunday, May 11, 2008 1:10:04 AM Subject: Re: [Htmlparser-user] Regex Filter Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro MorascaUniversità dell'Insubria People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position. If you need to keep the tags it is more difficult. Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer. This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html. ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htm...@li... Sent: Saturday, May 10, 2008 2:30:40 PM Subject: [Htmlparser-user] Regex Filter Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it ------------------------------------------------------------------------- This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
From: Davide T. <da...@ta...> - 2008-05-11 08:10:08
|
Unfortunately I think that I need to remember the container tag. I'll try to better explain my problem. My aim is to extract all the text included in a tag that contain a substring. I have a list of excerpt from an RSS feed and I need to extract the whole content of a web post only knowing the excerpt (the first sentence of the post). In example I have this excerpt: "Davide Taibi, Luigi Lavazza, and Sandro Morasca* Università dell'Insubria* People and organizations that are considering the adoption of OSS..." and I have to extract the content of this post http://www.taibi.it/?p=39 The first part of the excerpt is in a <strong> tag while the second not. My Idea is to find the tag container and then extract all the content. Which strategy should I use? Thanks Davide On Sun, May 11, 2008 at 5:28 AM, Derrick Oswald <der...@ro...> wrote: > Do you want to keep the tags? If not just use the StringBean to extract all > the text and then look for the string to get its position. > If you need to keep the tags it is more difficult. > Someone else had modified the StringBean to remember the node or offset of > each piece of text added to the buffer. > This list of nodes or offsets could be used after a straight string > comparison on the text to figure out the start and end node or offsets. From > there you can extract the complete html. > > > ----- Original Message ---- > From: Davide Taibi <da...@ta...> > To: htm...@li... > Sent: Saturday, May 10, 2008 2:30:40 PM > Subject: [Htmlparser-user] Regex Filter > > Dear all, I have a problem with regular expressions. > > I'd like to extract a block of text from an html page. > > I know how the text start (the first 10 words) but I don'k know if there > are any tags inside. > > In other words, I have to find if a sentence "A" is written in an Html page > "B". My problem is that the sentence "A" is written in plain text and the > second one in html and could be nested in several nodes. > > Then... the first sentence can be written in the second including some html > tags or spaces between words: > > Example: > > sentence a: "After hours of trying to sort the problem with uploading..." > sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort > the <strong> problem with <a href="xxxxxx.html" >uploading > pictures</a> </strong>to this thing I decided..." > > The sentence a should match correctly the b at position 15. > > > I've tried to do this but it doesn't works: > > protected static String extractContent(String html, String searchText) > throws ParserException{ > Page page = new Page(html); > Lexer lex = new Lexer(page); > Parser parser = new Parser(lex); > NodeList list = new NodeList(); > > NodeFilter filter = new RegexFilter(searchText); > for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { > it.nextNode().collectInto(list, filter); > } > if(list.size()>0){ > System.out.println("text found n."+list.size() + "times"); > return Translate.decode(list.toHtml()); > } > else > System.out.println("text not found"); > return null; > } > > > Tanks in advance > > Davide Taibi > http://www.taibi.it > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <der...@ro...> - 2008-05-11 03:29:16
|
Do you want to keep the tags? If not just use the StringBean to extract all the text and then look for the string to get its position. If you need to keep the tags it is more difficult. Someone else had modified the StringBean to remember the node or offset of each piece of text added to the buffer. This list of nodes or offsets could be used after a straight string comparison on the text to figure out the start and end node or offsets. From there you can extract the complete html. ----- Original Message ---- From: Davide Taibi <da...@ta...> To: htm...@li... Sent: Saturday, May 10, 2008 2:30:40 PM Subject: [Htmlparser-user] Regex Filter Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After <i>hours of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it |
From: Davide T. <da...@ta...> - 2008-05-10 21:30:42
|
Dear all, I have a problem with regular expressions. I'd like to extract a block of text from an html page. I know how the text start (the first 10 words) but I don'k know if there are any tags inside. In other words, I have to find if a sentence "A" is written in an Html page "B". My problem is that the sentence "A" is written in plain text and the second one in html and could be nested in several nodes. Then... the first sentence can be written in the second including some html tags or spaces between words: Example: sentence a: "After hours of trying to sort the problem with uploading..." sentence b: "Dear All, <br/>After *<i>**hours* of trying</i> to sort the <strong> problem with <a href="xxxxxx.html" >uploading pictures</a> </strong>to this thing I decided..." The sentence a should match correctly the b at position 15. I've tried to do this but it doesn't works: protected static String extractContent(String html, String searchText) throws ParserException{ Page page = new Page(html); Lexer lex = new Lexer(page); Parser parser = new Parser(lex); NodeList list = new NodeList(); NodeFilter filter = new RegexFilter(searchText); for (NodeIterator it = parser.elements(); it.hasMoreNodes();) { it.nextNode().collectInto(list, filter); } if(list.size()>0){ System.out.println("text found n."+list.size() + "times"); return Translate.decode(list.toHtml()); } else System.out.println("text not found"); return null; } Tanks in advance Davide Taibi http://www.taibi.it |