htmlparser-user Mailing List for HTML Parser

Brought to you by: derrickoswald

htmlparser-user — The user mailing list for users of the htmlparser library

You can subscribe to this list here.

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec

S	M	T	W	T	F	S
		1	2	3 (1)	4 (2)	5
6	7	8 (2)	9 (1)	10 (1)	11	12
13	14	15	16	17	18	19
20	21	22	23 (1)	24	25 (2)	26
27	28 (1)	29	30 (1)	31

Flat | Threaded

Re: [Htmlparser-user] How to parse the form tag input attributes when table tag is placed above the form tag

From: Ian M. <ian...@gm...> - 2006-08-30 16:09:08

Can you give a copy of the file that shows this problem?

On 8/25/06, Srinivas N <sn...@os...> wrote:
>
>
>
> hi , all
>
> Please help me it is very urgent
>
>
> I have an html content which consists of 48 input tags in a form tag when
> formTag.getFormInputs() is called it returned me 48 counts consisting of
> many table tags inside the form tag , but when the same content is paced
> including the formtag in table tag the parsed parsed upto 14 input tags and
> could not return the count of 48 tags which is expected
>
> please let me know the problem with the parser of the way of representation
> of table tag above the form tag
>
> with regards
> Srinivas
>
>
>
>
>
>
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>

Re: [Htmlparser-user] parsing XHTML and XML

From: Eugeny N D. <bo...@re...> - 2006-08-28 08:10:15

On Fri, Aug 25, 2006 at 09:56:48AM +0100, Ian Macfarlane wrote:
> If it's guaranteed to be valid XML, I'd use an XML parser instead.
> Java has one built in, or look into Xerces.

The thing is I will get the document as input, and I don't know which of
formats - HTML, XHTML or XML - it will be, so I'm looking for common way to
build DOM for these formats.

-- 
Eugene N Dzhurinsky

[Htmlparser-user] How to parse the form tag input attributes when table tag is placed above the form tag

From: Srinivas N <sn...@os...> - 2006-08-25 12:12:14

hi , all

Please help me it is very urgent


I have an html content which consists of 48 input tags in a form tag =
when formTag.getFormInputs() is called it returned me 48 counts =
consisting of many table tags inside the form tag , but when the same =
content is paced including the formtag in table tag the parsed parsed =
upto 14 input tags and could not return the count of 48 tags which is =
expected

please let me know the problem with the parser of the way of =
representation of table tag above the form tag

with regards
Srinivas






   =20

Re: [Htmlparser-user] parsing XHTML and XML

From: Ian M. <ian...@gm...> - 2006-08-25 08:56:52

If it's guaranteed to be valid XML, I'd use an XML parser instead.
Java has one built in, or look into Xerces.

Ian

On 8/23/06, Eugeny N Dzhurinsky <bo...@re...> wrote:
> Is it possible to parse XML documents as well as XHTML documents with
> htmlparser?
>
> --
> Eugene N Dzhurinsky
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>

[Htmlparser-user] parsing XHTML and XML

From: Eugeny N D. <bo...@re...> - 2006-08-23 08:34:56

Is it possible to parse XML documents as well as XHTML documents with
htmlparser?

-- 
Eugene N Dzhurinsky

[Htmlparser-user] user stories

From: Derrick O. <Der...@Ro...> - 2006-08-10 02:59:48

Hi,

I would be interested to hear some real user stories.  The traffic on 
this list is pretty much all problems encountered - and solutions 
provided hopefully - but there must be a whole bunch of people who are 
using it for weird and wild projects without a problem.  After all there 
are 3000 downloads a month, and it's not that hard to use is it?

So how about it?  Tell us your success story or something small or large 
you are proud of accomplishing with htmlparser.

Derrick

[Htmlparser-user] (no subject)

From: lu d. <dom...@gm...> - 2006-08-09 02:38:54

Re: [Htmlparser-user] A strange question?

From: Derrick O. <Der...@Ro...> - 2006-08-08 20:37:07

Jesse,

The problem may be within the HtmlUtils.registerTags.
What does this do? What tags does it register?

The div tag filter will return multiple elements with the same text as
in the case of:
<div class='A'><div class='B'>the text</div></div>
will return a list containing two items:
1) <div class='A'><div class='B'>the text</div></div>
2) <div class='B'>the text</div>
which if you pass it to string extractor will return:
the textthe text

Derrick

hpq852 wrote:

> Hi All, I encountered a very strange question. My code is very simple
> as following:
> public void doTest() throws Exception
> {
> URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK");
> InputStream in = url.openStream();
> BufferedReader br = new BufferedReader(new InputStreamReader(in,
> "GB2312"));
> String line = null;
> StringBuffer sb = new StringBuffer();
> while ((line = br.readLine()) != null)
> {
> sb.append(line);
> sb.append("\n");
> }
> extractText2(sb.toString());
> }
>
> public String extractText2(String inputHtml) throws Exception
> {
> Parser parser = Parser.createParser(new
> String(inputHtml.getBytes(),"GB2312"), "GB2312");
> HtmlUtils.registerTags(parser);
> NodeFilter tagNameFilter = new TagNameFilter("div");
> NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter);
>
> System.out.println(nodeList.toHtml());
> return null;
> }
> I just want to get all of div tags, so I used a TagNameFilter, but the
> result I got in the console is strange, it includes many repeated div
> tags with same content.
> I have tried for many times, but what I got was the same, I really
> don't know what't the reason. Could you help me please?
> Thanks and Best Regards
> Jesse
>

[Htmlparser-user] A strange question?

From: hpq852 <hp...@gm...> - 2006-08-08 16:19:16

Hi All,  I encountered a very strange question. My code is very simple as following:

 public void doTest() throws Exception
 {
  URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK");
  InputStream in = url.openStream();
  BufferedReader br = new BufferedReader(new InputStreamReader(in, "GB2312"));
  String line = null;
  StringBuffer sb = new StringBuffer();
  while ((line = br.readLine()) != null) 
  {
   sb.append(line);    
   sb.append("\n");
  }
  extractText2(sb.toString());
 }
 
 public String extractText2(String inputHtml) throws Exception
 {
  Parser parser = Parser.createParser(new String(inputHtml.getBytes(),"GB2312"), "GB2312");
  HtmlUtils.registerTags(parser);
  NodeFilter tagNameFilter = new TagNameFilter("div");
  NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter);

  System.out.println(nodeList.toHtml());
  return null;
 }
 
I just want to get all of div tags, so I used a TagNameFilter, but the result I got in the console is strange, it includes many repeated div tags with same content.
I have tried for many times, but what I got was the same, I really don't know what't the reason. Could you help me please?

Thanks and Best Regards
Jesse

Re: [Htmlparser-user] How to extract more than one tag by only once parsering?

From: Derrick O. <Der...@Ro...> - 2006-08-04 11:42:35

Jesse,

 From your example, you can also get all the div tags at once and filter 
on class in a secondary pass:

NodeList divs = nodelist.extractAllTagsThatMatch (new TagNameFilter 
("DIV"));
DivTag div_a = divs.extractAllTagsThatMatch (new HasAttributeFilter 
("class", "A")).element (0); // presuming there is only one
DivTag div_b = divs.extractAllTagsThatMatch (new HasAttributeFilter 
("class", "B")).element (0); // presuming there is only one

and this may be faster than searching the entire page each time.

Derrick

Ian Macfarlane wrote:

>As long as you keep the original reference to the NodeList created by
>Parser.parse, and you haven't modified that NodeList, you should be
>able to reuse it, I think.
>
>Ian
>
>On 8/3/06, Jesse Hou <hp...@gm...> wrote:
>  
>
>>Hi All,   When I'm using the htmlparser library, I suffered from a
>>difficulty. In a html there are many tags such as title, div, input, span
>>and so on. For example:
>>
>><title>this is a test </title>
>>
>>
>>//...... any other tags
>>
>><div class="A">
>>       <span class="B"><a href=" www.google.com ">google</a></span>
>></div>
>>
>>
>>//...... any other tags
>>
>><div class="C">
>>       <div class="D"><input type="text" id="E" value="msn" /></div>
>></div>
>>
>>//...... any other tags
>>
>>
>><div class="C">
>>       <div class="E"><span class="B"><input type="text" id="E" value="aol"
>>/><a href=" www.live.com ">live</a></span></div>
>></div>
>>
>>In this example maybe the whole html include many tags. if I want to get the
>>content 'this is a test',  maybe I can use a TagNameFilter, I have to parse
>>the whole html. If I want to get the content 'google' or ' www.google.com'
>>then I have to parse the whole html for the second time and if I want to get
>>'msn', 'aol', 'live' maybe I should parse the whole html for several times.
>>In this way I can get the content what I need but maybe this way will impact
>>the performance. Is there any other way to do that?  Maybe I can also use
>>OrFilter to get the Nodes but how can I identify a text match which tag? If
>>I want to store them into DB I have no idea how to do that by only once
>>parsing the html (the best performance).  I beg your help. :-)
>>
>>Thanks and Best Regards
>>
>>Jesse
>>
>  
>

Re: [Htmlparser-user] How to extract more than one tag by only once parsering?

From: Ian M. <ian...@gm...> - 2006-08-04 10:42:24

As long as you keep the original reference to the NodeList created by
Parser.parse, and you haven't modified that NodeList, you should be
able to reuse it, I think.

Ian

On 8/3/06, Jesse Hou <hp...@gm...> wrote:
>
> Hi All,   When I'm using the htmlparser library, I suffered from a
> difficulty. In a html there are many tags such as title, div, input, span
> and so on. For example:
>
> <title>this is a test </title>
>
>
> //...... any other tags
>
> <div class="A">
>        <span class="B"><a href=" www.google.com ">google</a></span>
> </div>
>
>
> //...... any other tags
>
> <div class="C">
>        <div class="D"><input type="text" id="E" value="msn" /></div>
> </div>
>
> //...... any other tags
>
>
> <div class="C">
>        <div class="E"><span class="B"><input type="text" id="E" value="aol"
> /><a href=" www.live.com ">live</a></span></div>
> </div>
>
> In this example maybe the whole html include many tags. if I want to get the
> content 'this is a test',  maybe I can use a TagNameFilter, I have to parse
> the whole html. If I want to get the content 'google' or ' www.google.com'
> then I have to parse the whole html for the second time and if I want to get
> 'msn', 'aol', 'live' maybe I should parse the whole html for several times.
> In this way I can get the content what I need but maybe this way will impact
> the performance. Is there any other way to do that?  Maybe I can also use
> OrFilter to get the Nodes but how can I identify a text match which tag? If
> I want to store them into DB I have no idea how to do that by only once
> parsing the html (the best performance).  I beg your help. :-)
>
> Thanks and Best Regards
>
> Jesse
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>

[Htmlparser-user] How to extract more than one tag by only once parsering?

From: Jesse H. <hp...@gm...> - 2006-08-03 02:21:56

Hi All,   When I'm using the htmlparser library, I suffered from a
difficulty. In a html there are many tags such as title, div, input,
span and so on. For example:

<title>this is a test </title>

//...... any other tags

<div class="A">
       <span class="B"><a href=" www.google.com ">google</a></span>
</div>

//...... any other tags

<div class="C">
       <div class="D"><input type="text" id="E" value="msn" /></div>
</div>

//...... any other tags

<div class="C">
       <div class="E"><span class="B"><input type="text" id="E" value="aol"
/><a href=" www.live.com ">live</a></span></div>
</div>

In this example maybe the whole html include many tags. if I want to get the
content 'this is a test',  maybe I can use a TagNameFilter, I have to parse
the whole html. If I want to get the content 'google' or 'www.google.com'
then I have to parse the whole html for the second time and if I want to get
'msn', 'aol', 'live' maybe I should parse the whole html for several times.
In this way I can get the content what I need but maybe this way will impact
the performance. Is there any other way to do that?  Maybe I can also use
OrFilter to get the Nodes but how can I identify a text match which tag? If
I want to store them into DB I have no idea how to do that by only once
parsing the html (the best performance).  I beg your help. :-)

Thanks and Best Regards

Jesse

Flat | Threaded

S	M	T	W	T	F	S
		1	2	3 (1)	4 (2)	5
6	7	8 (2)	9 (1)	10 (1)	11	12
13	14	15	16	17	18	19
20	21	22	23 (1)	24	25 (2)	26
27	28 (1)	29	30 (1)	31