htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
1
|
2
|
3
(1) |
4
(2) |
5
|
6
|
7
|
8
(2) |
9
(1) |
10
(1) |
11
|
12
|
13
|
14
|
15
|
16
|
17
|
18
|
19
|
20
|
21
|
22
|
23
(1) |
24
|
25
(2) |
26
|
27
|
28
(1) |
29
|
30
(1) |
31
|
|
|
From: Ian M. <ian...@gm...> - 2006-08-30 16:09:08
|
Can you give a copy of the file that shows this problem? On 8/25/06, Srinivas N <sn...@os...> wrote: > > > > hi , all > > Please help me it is very urgent > > > I have an html content which consists of 48 input tags in a form tag when > formTag.getFormInputs() is called it returned me 48 counts consisting of > many table tags inside the form tag , but when the same content is paced > including the formtag in table tag the parsed parsed upto 14 input tags and > could not return the count of 48 tags which is expected > > please let me know the problem with the parser of the way of representation > of table tag above the form tag > > with regards > Srinivas > > > > > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Eugeny N D. <bo...@re...> - 2006-08-28 08:10:15
|
On Fri, Aug 25, 2006 at 09:56:48AM +0100, Ian Macfarlane wrote: > If it's guaranteed to be valid XML, I'd use an XML parser instead. > Java has one built in, or look into Xerces. The thing is I will get the document as input, and I don't know which of formats - HTML, XHTML or XML - it will be, so I'm looking for common way to build DOM for these formats. -- Eugene N Dzhurinsky |
From: Srinivas N <sn...@os...> - 2006-08-25 12:12:14
|
hi , all Please help me it is very urgent I have an html content which consists of 48 input tags in a form tag = when formTag.getFormInputs() is called it returned me 48 counts = consisting of many table tags inside the form tag , but when the same = content is paced including the formtag in table tag the parsed parsed = upto 14 input tags and could not return the count of 48 tags which is = expected please let me know the problem with the parser of the way of = representation of table tag above the form tag with regards Srinivas =20 |
From: Ian M. <ian...@gm...> - 2006-08-25 08:56:52
|
If it's guaranteed to be valid XML, I'd use an XML parser instead. Java has one built in, or look into Xerces. Ian On 8/23/06, Eugeny N Dzhurinsky <bo...@re...> wrote: > Is it possible to parse XML documents as well as XHTML documents with > htmlparser? > > -- > Eugene N Dzhurinsky > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Eugeny N D. <bo...@re...> - 2006-08-23 08:34:56
|
Is it possible to parse XML documents as well as XHTML documents with htmlparser? -- Eugene N Dzhurinsky |
From: Derrick O. <Der...@Ro...> - 2006-08-10 02:59:48
|
Hi, I would be interested to hear some real user stories. The traffic on this list is pretty much all problems encountered - and solutions provided hopefully - but there must be a whole bunch of people who are using it for weird and wild projects without a problem. After all there are 3000 downloads a month, and it's not that hard to use is it? So how about it? Tell us your success story or something small or large you are proud of accomplishing with htmlparser. Derrick |
From: lu d. <dom...@gm...> - 2006-08-09 02:38:54
|
From: Derrick O. <Der...@Ro...> - 2006-08-08 20:37:07
|
Jesse, The problem may be within the HtmlUtils.registerTags. What does this do? What tags does it register? The div tag filter will return multiple elements with the same text as in the case of: <div class='A'><div class='B'>the text</div></div> will return a list containing two items: 1) <div class='A'><div class='B'>the text</div></div> 2) <div class='B'>the text</div> which if you pass it to string extractor will return: the textthe text Derrick hpq852 wrote: > Hi All, I encountered a very strange question. My code is very simple > as following: > public void doTest() throws Exception > { > URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK"); > InputStream in = url.openStream(); > BufferedReader br = new BufferedReader(new InputStreamReader(in, > "GB2312")); > String line = null; > StringBuffer sb = new StringBuffer(); > while ((line = br.readLine()) != null) > { > sb.append(line); > sb.append("\n"); > } > extractText2(sb.toString()); > } > > public String extractText2(String inputHtml) throws Exception > { > Parser parser = Parser.createParser(new > String(inputHtml.getBytes(),"GB2312"), "GB2312"); > HtmlUtils.registerTags(parser); > NodeFilter tagNameFilter = new TagNameFilter("div"); > NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter); > > System.out.println(nodeList.toHtml()); > return null; > } > I just want to get all of div tags, so I used a TagNameFilter, but the > result I got in the console is strange, it includes many repeated div > tags with same content. > I have tried for many times, but what I got was the same, I really > don't know what't the reason. Could you help me please? > Thanks and Best Regards > Jesse > |
From: hpq852 <hp...@gm...> - 2006-08-08 16:19:16
|
Hi All, I encountered a very strange question. My code is very simple as following: public void doTest() throws Exception { URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK"); InputStream in = url.openStream(); BufferedReader br = new BufferedReader(new InputStreamReader(in, "GB2312")); String line = null; StringBuffer sb = new StringBuffer(); while ((line = br.readLine()) != null) { sb.append(line); sb.append("\n"); } extractText2(sb.toString()); } public String extractText2(String inputHtml) throws Exception { Parser parser = Parser.createParser(new String(inputHtml.getBytes(),"GB2312"), "GB2312"); HtmlUtils.registerTags(parser); NodeFilter tagNameFilter = new TagNameFilter("div"); NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter); System.out.println(nodeList.toHtml()); return null; } I just want to get all of div tags, so I used a TagNameFilter, but the result I got in the console is strange, it includes many repeated div tags with same content. I have tried for many times, but what I got was the same, I really don't know what't the reason. Could you help me please? Thanks and Best Regards Jesse |
From: Derrick O. <Der...@Ro...> - 2006-08-04 11:42:35
|
Jesse, From your example, you can also get all the div tags at once and filter on class in a secondary pass: NodeList divs = nodelist.extractAllTagsThatMatch (new TagNameFilter ("DIV")); DivTag div_a = divs.extractAllTagsThatMatch (new HasAttributeFilter ("class", "A")).element (0); // presuming there is only one DivTag div_b = divs.extractAllTagsThatMatch (new HasAttributeFilter ("class", "B")).element (0); // presuming there is only one and this may be faster than searching the entire page each time. Derrick Ian Macfarlane wrote: >As long as you keep the original reference to the NodeList created by >Parser.parse, and you haven't modified that NodeList, you should be >able to reuse it, I think. > >Ian > >On 8/3/06, Jesse Hou <hp...@gm...> wrote: > > >>Hi All, When I'm using the htmlparser library, I suffered from a >>difficulty. In a html there are many tags such as title, div, input, span >>and so on. For example: >> >><title>this is a test </title> >> >> >>//...... any other tags >> >><div class="A"> >> <span class="B"><a href=" www.google.com ">google</a></span> >></div> >> >> >>//...... any other tags >> >><div class="C"> >> <div class="D"><input type="text" id="E" value="msn" /></div> >></div> >> >>//...... any other tags >> >> >><div class="C"> >> <div class="E"><span class="B"><input type="text" id="E" value="aol" >>/><a href=" www.live.com ">live</a></span></div> >></div> >> >>In this example maybe the whole html include many tags. if I want to get the >>content 'this is a test', maybe I can use a TagNameFilter, I have to parse >>the whole html. If I want to get the content 'google' or ' www.google.com' >>then I have to parse the whole html for the second time and if I want to get >>'msn', 'aol', 'live' maybe I should parse the whole html for several times. >>In this way I can get the content what I need but maybe this way will impact >>the performance. Is there any other way to do that? Maybe I can also use >>OrFilter to get the Nodes but how can I identify a text match which tag? If >>I want to store them into DB I have no idea how to do that by only once >>parsing the html (the best performance). I beg your help. :-) >> >>Thanks and Best Regards >> >>Jesse >> > > |
From: Ian M. <ian...@gm...> - 2006-08-04 10:42:24
|
As long as you keep the original reference to the NodeList created by Parser.parse, and you haven't modified that NodeList, you should be able to reuse it, I think. Ian On 8/3/06, Jesse Hou <hp...@gm...> wrote: > > Hi All, When I'm using the htmlparser library, I suffered from a > difficulty. In a html there are many tags such as title, div, input, span > and so on. For example: > > <title>this is a test </title> > > > //...... any other tags > > <div class="A"> > <span class="B"><a href=" www.google.com ">google</a></span> > </div> > > > //...... any other tags > > <div class="C"> > <div class="D"><input type="text" id="E" value="msn" /></div> > </div> > > //...... any other tags > > > <div class="C"> > <div class="E"><span class="B"><input type="text" id="E" value="aol" > /><a href=" www.live.com ">live</a></span></div> > </div> > > In this example maybe the whole html include many tags. if I want to get the > content 'this is a test', maybe I can use a TagNameFilter, I have to parse > the whole html. If I want to get the content 'google' or ' www.google.com' > then I have to parse the whole html for the second time and if I want to get > 'msn', 'aol', 'live' maybe I should parse the whole html for several times. > In this way I can get the content what I need but maybe this way will impact > the performance. Is there any other way to do that? Maybe I can also use > OrFilter to get the Nodes but how can I identify a text match which tag? If > I want to store them into DB I have no idea how to do that by only once > parsing the html (the best performance). I beg your help. :-) > > Thanks and Best Regards > > Jesse > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Jesse H. <hp...@gm...> - 2006-08-03 02:21:56
|
Hi All, When I'm using the htmlparser library, I suffered from a difficulty. In a html there are many tags such as title, div, input, span and so on. For example: <title>this is a test </title> //...... any other tags <div class="A"> <span class="B"><a href=" www.google.com ">google</a></span> </div> //...... any other tags <div class="C"> <div class="D"><input type="text" id="E" value="msn" /></div> </div> //...... any other tags <div class="C"> <div class="E"><span class="B"><input type="text" id="E" value="aol" /><a href=" www.live.com ">live</a></span></div> </div> In this example maybe the whole html include many tags. if I want to get the content 'this is a test', maybe I can use a TagNameFilter, I have to parse the whole html. If I want to get the content 'google' or 'www.google.com' then I have to parse the whole html for the second time and if I want to get 'msn', 'aol', 'live' maybe I should parse the whole html for several times. In this way I can get the content what I need but maybe this way will impact the performance. Is there any other way to do that? Maybe I can also use OrFilter to get the Nodes but how can I identify a text match which tag? If I want to store them into DB I have no idea how to do that by only once parsing the html (the best performance). I beg your help. :-) Thanks and Best Regards Jesse |