htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
S | M | T | W | T | F | S |
---|---|---|---|---|---|---|
|
|
|
|
|
|
1
(1) |
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
(2) |
15
|
16
|
17
(1) |
18
|
19
(2) |
20
|
21
|
22
|
23
|
24
(2) |
25
(1) |
26
|
27
|
28
(5) |
29
(4) |
30
(1) |
31
(3) |
|
|
|
|
|
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:52:07
|
Sorry, replied without thinking. You can apply the StringBean directly to a node list: Parser parser = new Parser ("http://yadda.yadda"); NodeList list = parser.parse (my_spiffo_DIV_finding_filter); Div div = list.elementAt (0); StringBean bean = new StringBean (); div.getChildren ().visitAllNodesWith (bean); System.out.println (bean.getStrings ()); Derrick Derrick Oswald wrote: >Jesse, > >The job breaks down into two tasks: > 1) get the outermost tag (your <div id="video_infobox_con"> tag) using >a filter you construct. > 2) use a StringBean as a visitor on that node and it's children to >extract the text, like so: > >Parser parser = new Parser ("http://yadda.yadda"); >NodeList list = parser.parse (my_spiffo_DIV_finding_filter); >Div div = list.elementAt (0); >// now re-create the HTML and pass it into another Parser >Parser parser = new Parser (div.toHtml ()); // Note: for older versions >you need to use setInputHtml() >StringBean bean = new StringBean (); >parser.visitAllNodesWith (bean); >System.out.println (bean.getStrings ()); > >Derrick > >h pq wrote: > > > >>Hi all, I have a question when I parsered the html content. In the >>html content there are many tags, if I want to get a tag text like >>LinkTag or TableTag , it's very easy to use the LinkRegexFilter or >>TagNameFilter, but if I want to get more than one tag's content , is >>there a filter chain ? Maybe the example following will explain what >>I said directly: >> >> <div id="video_infobox_con"> >> ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br /> >> ·Label: >> <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" >>class="lnk_04" target=_self><u>test_a</u></a> >> >> <a href="search.do?q=%D7%B4%D4%AA%D0%E3" >>class="lnk_04" target=_self><u>test_b</u></a> >> >> <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" >>target=_self><u>test_c</u></a> >> >> <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" >>target=_self><u>test_d</u></a> >> >> </div> >><input type="text" id="htmlurl" name="htmlurl" value='value_test' /> >> >>there are four tags such as div, span, a ,input, and all content in >>these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, >> test_c, test_d and value_test >>How should I do? Maybe I can parser the html for 4 times to get the >>four tags' content, but I think it'll impact the proformance. Could >>you help me ? Thank you very much. >> >>Best Regards >>Jesse >> >> >>------------------------------------------------------------------------ >> >>------------------------------------------------------------------------- >>Take Surveys. Earn Cash. Influence the Future of IT >>Join SourceForge.net's Techsay panel and you'll get the chance to share your >>opinions on IT & business topics through brief surveys -- and earn cash >>http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> >>------------------------------------------------------------------------ >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Derrick O. <Der...@Ro...> - 2006-07-31 04:47:16
|
Jesse, The job breaks down into two tasks: 1) get the outermost tag (your <div id="video_infobox_con"> tag) using a filter you construct. 2) use a StringBean as a visitor on that node and it's children to extract the text, like so: Parser parser = new Parser ("http://yadda.yadda"); NodeList list = parser.parse (my_spiffo_DIV_finding_filter); Div div = list.elementAt (0); // now re-create the HTML and pass it into another Parser Parser parser = new Parser (div.toHtml ()); // Note: for older versions you need to use setInputHtml() StringBean bean = new StringBean (); parser.visitAllNodesWith (bean); System.out.println (bean.getStrings ()); Derrick h pq wrote: > Hi all, I have a question when I parsered the html content. In the > html content there are many tags, if I want to get a tag text like > LinkTag or TableTag , it's very easy to use the LinkRegexFilter or > TagNameFilter, but if I want to get more than one tag's content , is > there a filter chain ? Maybe the example following will explain what > I said directly: > > <div id="video_infobox_con"> > ·add by:<span class="fcolor_03">2006.07.27 - 01:22</span><br /> > ·Label: > <a href="search.do?q=%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" > class="lnk_04" target=_self><u>test_a</u></a> > > <a href="search.do?q=%D7%B4%D4%AA%D0%E3" > class="lnk_04" target=_self><u>test_b</u></a> > > <a href=" search.do?q=%C0%BA%C7%F2" class="lnk_04" > target=_self><u>test_c</u></a> > > <a href="search.do?q=%CC%E5%D3%FD" class="lnk_04" > target=_self><u>test_d</u></a> > > </div> > <input type="text" id="htmlurl" name="htmlurl" value='value_test' /> > > there are four tags such as div, span, a ,input, and all content in > these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, > test_c, test_d and value_test > How should I do? Maybe I can parser the html for 4 times to get the > four tags' content, but I think it'll impact the proformance. Could > you help me ? Thank you very much. > > Best Regards > Jesse > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: h p. <hp...@gm...> - 2006-07-31 03:35:57
|
Hi all, I have a question when I parsered the html content. In the html content there are many tags, if I want to get a tag text like LinkTag or TableTag , it's very easy to use the LinkRegexFilter or TagNameFilter, but if I want to get more than one tag's content , is there a filter chain ? Maybe the example following will explain what I said directly: <div id=3D"video_infobox_con"> =B7add by:<span class=3D"fcolor_03">2006.07.27 - 01:22</span><br /> =B7Label: <a href=3D"search.do?q=3D%B0%CD%B6%FB%C4%E1%D1%C7%C4%E1" class=3D"lnk_04" target=3D_self><u>test_a</u></a> <a href=3D"search.do?q=3D%D7%B4%D4%AA%D0%E3" class=3D"lnk_= 04" target=3D_self><u>test_b</u></a> <a href=3D"search.do?q=3D%C0%BA%C7%F2" class=3D"lnk_04" target=3D_self><u>test_c</u></a> <a href=3D"search.do?q=3D%CC%E5%D3%FD" class=3D"lnk_04" target=3D_self><u>test_d</u></a> </div> <input type=3D"text" id=3D"htmlurl" name=3D"htmlurl" value=3D'value_test' = /> there are four tags such as div, span, a ,input, and all content in these tags are what I need like 2006.07.27 - 01:22, test_a, test_b, test_c, test_d and value_test How should I do? Maybe I can parser the html for 4 times to get the four tags' content, but I think it'll impact the proformance. Could you help me = ? Thank you very much. Best Regards Jesse |
From: Derrick O. <Der...@Ro...> - 2006-07-30 12:12:21
|
Kavorka, Maybe if you just want to remove the whole link, use something like: getParent ().getChildren ().remove (this); in the doSemanticAction() override of your custom LinkTag class. That will remove the current link tag from the enclosing parent tag by altering the children list. Derrick kavorka wrote: > Hi Oswald, > Yes i want to remove text within <a></a>. i'll try to do what you have > said, but > i'm a newbie java coder i didnt understand what you have said clearly. > I tried to override > linkTAg to not to take text <a></a> now myLinkTag doesnt find links. > but now how can i take > text other that <a></a>. > if i ask to much, i'm sorry. > thanks a lot > murat > > > On 7/29/06, *Derrick Oswald* <Der...@ro... > <mailto:Der...@ro...>> wrote: > > Murat, > > I'm not sure what you mean by 'pure' text. > The stringextractor program uses the StringBean under the hood. > It only collects text which would be presented in a browser - or at > least it's supposed to. > The stringextractor program has an option (-links) to output the links > within angle brackets. Make sure this is not used. > If you want to remove text within <a></a> pairs you will need to > override the default LinkTag to not do this and register it with the > PrototypicalNodeFactory. > > Derrick > > kavorka wrote: > > > Hi Oswald, > > I have another question. In HTMLPARSER, is it possible to > extract only > > the text in the webpage. In the stringextractor program, it extract > > also link text in the page, i want to extract "pure" text. can i > do it? > > thanks > > Murat > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to > share your > opinions on IT & business topics through brief surveys -- and earn > cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > <http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV> > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > <mailto:Htm...@li...> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------- >Take Surveys. Earn Cash. Influence the Future of IT >Join SourceForge.net's Techsay panel and you'll get the chance to share your >opinions on IT & business topics through brief surveys -- and earn cash >http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > >------------------------------------------------------------------------ > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: kavorka <the...@gm...> - 2006-07-29 13:07:11
|
Hi Oswald, Yes i want to remove text within <a></a>. i'll try to do what you have said, but i'm a newbie java coder i didnt understand what you have said clearly. I tried to override linkTAg to not to take text <a></a> now myLinkTag doesnt find links. but now how can i take text other that <a></a>. if i ask to much, i'm sorry. thanks a lot murat On 7/29/06, Derrick Oswald <Der...@ro...> wrote: > > Murat, > > I'm not sure what you mean by 'pure' text. > The stringextractor program uses the StringBean under the hood. > It only collects text which would be presented in a browser - or at > least it's supposed to. > The stringextractor program has an option (-links) to output the links > within angle brackets. Make sure this is not used. > If you want to remove text within <a></a> pairs you will need to > override the default LinkTag to not do this and register it with the > PrototypicalNodeFactory. > > Derrick > > kavorka wrote: > > > Hi Oswald, > > I have another question. In HTMLPARSER, is it possible to extract only > > the text in the webpage. In the stringextractor program, it extract > > also link text in the page, i want to extract "pure" text. can i do it? > > thanks > > Murat > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-07-29 11:26:58
|
Eugeny, Perhaps the web page is broken and has characters that can't be encoded by the encoding specified in the HTTP header or META tag. Or perhaps those are lying and the real encoding is something else. What does it look like in your browser? What encoding is it using to interpret it? Use parser.setEncoding ("XXXXX"); to set the encoding before beginning the parse. Derrick Eugeny N Dzhurinsky wrote: >Hello! >I'm trying to parse this page and extract all links there: >http://www.vu.lt/lt/naujienos/337/ > >for some reason the link to PDF file looks like: >http://www.vu.lt/site_files/InfS/Naujienos/istorik??%20dienos.pdf > >which is wrong. It seems like some wrong charset was used? > >Here is part of my code which does the parsing: > >public LinkedList parseDocument(InputStream document, String encoding) { > try { > Lexer lexer = new Lexer(new Page(document, encoding)); > String href; > try { > lexer.reset(); > if (banner != null) > validateBanner(lexer); > lexer.reset(); > Parser parser = new Parser(lexer); > NodeList list = null; > try { > list = parser > .extractAllNodesThatMatch(new InterestedTagsFilter()); > } catch (EncodingChangeException e) { > log.warn(e); > lexer.reset(); > lexer.getPage().setEncoding(parser.getEncoding()); > list = parser > .extractAllNodesThatMatch(new InterestedTagsFilter()); > } > for (SimpleNodeIterator it = list.elements(); it.hasMoreNodes();) { > TagNode node = (TagNode) it.nextNode(); > href = null; > if (LinkTag.class.equals(node.getClass()) > && validateLink((LinkTag) node)) { > href = ((LinkTag) node).getLink(); > } else if (ImageTag.class.equals(node.getClass()) > || FrameTag.class.equals(node.getClass())) { > href = node.getAttribute("src"); > } else if (TitleTag.class.equals(node.getClass())) { > title = ((TitleTag) node).getTitle(); > } else if (BaseHrefTag.class.equals(node.getClass())) { > try { > baseTag = getBaseURL(new URI(((BaseHrefTag) node) > .getBaseUrl(), false)); > } catch (URIException e2) { > } > } else if (MetaTag.class.equals(node.getClass()) > && "refresh".equalsIgnoreCase(((MetaTag) node) > .getHttpEquiv())) { > String URL = ((MetaTag) node).getMetaContent(); > if (URL != null && URL.length() > 0) { > String arr[] = URL.split("URL="); > if (arr != null && arr.length == 2) > href = arr[1]; > } > } > if (href != null && href.length() > 0) { > if (log.isDebugEnabled()) >-------> log.debug(href); <----------- > results.add(getURL(StringEscapeUtils > .unescapeHtml(getEscapedURL(href.trim())))); > } > } > this.encoding = parser.getEncoding(); > if (log.isDebugEnabled()) > log.debug(this.encoding); > } catch (ParserException e1) { > log.error(e1, e1); > } > } catch (UnsupportedEncodingException e) { > log.error(e, e); > } > return results; >} > >And on marked line application logs >/site_files/InfS/Naujienos/istorik??%20dienos.pdf > >what could be wrong there? > > > |
From: Derrick O. <Der...@Ro...> - 2006-07-29 11:18:57
|
Xue-Feng, There are many examples of collecting the parsed nodes in a nodelist, modify them and print the list. Something like this should work. NodeList list = parser.parse (null); TextNodes text = list.extractAllNodesThatMatch (new NodeClassFilter (TextNode.class)); // modify the text items in the text list System.out.println (list.toHtml ()); Derrick Xue-Feng Yang wrote: >I am trying to modify for the TextNodes in a lexer by >TextNode.setText(String). Then I tried to print the >lexer by > > Page toPage=lexer.getPage(); > String toString=toPage.getText(); > System.out.println(toString); > >The page was unchanged. > >Does any one have idea how to modify a lexer or simply >a html page? > >Thanks, > > > |
From: Derrick O. <Der...@Ro...> - 2006-07-29 11:14:28
|
Murat, I'm not sure what you mean by 'pure' text. The stringextractor program uses the StringBean under the hood. It only collects text which would be presented in a browser - or at least it's supposed to. The stringextractor program has an option (-links) to output the links within angle brackets. Make sure this is not used. If you want to remove text within <a></a> pairs you will need to override the default LinkTag to not do this and register it with the PrototypicalNodeFactory. Derrick kavorka wrote: > Hi Oswald, > I have another question. In HTMLPARSER, is it possible to extract only > the text in the webpage. In the stringextractor program, it extract > also link text in the page, i want to extract "pure" text. can i do it? > thanks > Murat > |
From: Eugeny N D. <bo...@re...> - 2006-07-28 21:30:11
|
Hello, I'm trying to parse page http://www.vu.lt/lt/naujienos/337/ but HtmlParser fails with this error: ERROR org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0x2013] != old: [0xe2?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 218 [junit] org.htmlparser.util.EncodingChangeException: character mismatch (new: ? [0x2013] != old: [0xe2?]) for encoding change from ISO-8859-1 to UTF-8 at character offset 218 [junit] at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:280) [junit] at org.htmlparser.lexer.Page.setEncoding(Page.java:865) [junit] at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150) [junit] at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69) [junit] at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160) [junit] at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92) [junit] at org.htmlparser.Parser.extractAllNodesThatMatch(Parser.java:768) at this line: Lexer lexer = new Lexer(new Page(document, encoding)); Parser parser = new Parser(lexer); ---->NodeList list = parser.extractAllNodesThatMatch(new InterestedTagsFilter());<---- I don't know the document encoding initially, and thus it's null. Could somebody please advice? -- Eugene N Dzhurinsky |
From: Eugeny N D. <bo...@re...> - 2006-07-28 21:24:31
|
Hello! I'm trying to parse this page and extract all links there: http://www.vu.lt/lt/naujienos/337/ for some reason the link to PDF file looks like: http://www.vu.lt/site_files/InfS/Naujienos/istorik??%20dienos.pdf which is wrong. It seems like some wrong charset was used? Here is part of my code which does the parsing: public LinkedList parseDocument(InputStream document, String encoding) { try { Lexer lexer = new Lexer(new Page(document, encoding)); String href; try { lexer.reset(); if (banner != null) validateBanner(lexer); lexer.reset(); Parser parser = new Parser(lexer); NodeList list = null; try { list = parser .extractAllNodesThatMatch(new InterestedTagsFilter()); } catch (EncodingChangeException e) { log.warn(e); lexer.reset(); lexer.getPage().setEncoding(parser.getEncoding()); list = parser .extractAllNodesThatMatch(new InterestedTagsFilter()); } for (SimpleNodeIterator it = list.elements(); it.hasMoreNodes();) { TagNode node = (TagNode) it.nextNode(); href = null; if (LinkTag.class.equals(node.getClass()) && validateLink((LinkTag) node)) { href = ((LinkTag) node).getLink(); } else if (ImageTag.class.equals(node.getClass()) || FrameTag.class.equals(node.getClass())) { href = node.getAttribute("src"); } else if (TitleTag.class.equals(node.getClass())) { title = ((TitleTag) node).getTitle(); } else if (BaseHrefTag.class.equals(node.getClass())) { try { baseTag = getBaseURL(new URI(((BaseHrefTag) node) .getBaseUrl(), false)); } catch (URIException e2) { } } else if (MetaTag.class.equals(node.getClass()) && "refresh".equalsIgnoreCase(((MetaTag) node) .getHttpEquiv())) { String URL = ((MetaTag) node).getMetaContent(); if (URL != null && URL.length() > 0) { String arr[] = URL.split("URL="); if (arr != null && arr.length == 2) href = arr[1]; } } if (href != null && href.length() > 0) { if (log.isDebugEnabled()) -------> log.debug(href); <----------- results.add(getURL(StringEscapeUtils .unescapeHtml(getEscapedURL(href.trim())))); } } this.encoding = parser.getEncoding(); if (log.isDebugEnabled()) log.debug(this.encoding); } catch (ParserException e1) { log.error(e1, e1); } } catch (UnsupportedEncodingException e) { log.error(e, e); } return results; } And on marked line application logs /site_files/InfS/Naujienos/istorik??%20dienos.pdf what could be wrong there? -- Eugene N Dzhurinsky |
From: Xue-Feng Y. <jus...@ya...> - 2006-07-28 21:14:58
|
I am trying to modify for the TextNodes in a lexer by TextNode.setText(String). Then I tried to print the lexer by Page toPage=lexer.getPage(); String toString=toPage.getText(); System.out.println(toString); The page was unchanged. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: kavorka <the...@gm...> - 2006-07-28 20:53:56
|
Hi Oswald, I have another question. In HTMLPARSER, is it possible to extract only the text in the webpage. In the stringextractor program, it extract also link text in the page, i want to extract "pure" text. can i do it? thanks Murat On 7/25/06, kavorka <the...@gm...> wrote: > > Hi Oswald, > > Thanks a lot for your help. > > Murat > > > On 7/24/06, Derrick Oswald <Der...@ro...> wrote: > > > > Kavorka, > > > > This should give you the meta tag, from which you can get the > > information you want: > > > > NodeList nodes = parser.parse (null); > > NodeList metas = nodes.extractAllNodesThatMatch (new TagNameFilter > > ("META")); > > MetaTag meta = (MetaTag)metas.elementAt (0); > > System.out.println (meta); > > > > Derrick > > > > kavorka wrote: > > > > > Hi all, > > > I'm new to HTML-parser. I used sample programs to understand how can i > > > find the meta data of the page but i could't use it. Do you have any > > > code samples that finds meta data of the page using HTMLparser. > > > Thank you > > > best regards > > > > > > > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share > > your > > opinions on IT & business topics through brief surveys -- and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > |
From: Xue-Feng Y. <jus...@ya...> - 2006-07-28 19:43:07
|
I am trying to modify for the TextNodes in a lexer by TextNode.setText(String). Then I tried to print the lexer by Page toPage=lexer.getPage(); String toString=toPage.getText(); System.out.println(toString); The page was unchanged. Does any one have idea how to modify a lexer or simply a html page? Thanks, Xue-Feng __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: kavorka <the...@gm...> - 2006-07-25 08:49:52
|
Hi Oswald, Thanks a lot for your help. Murat On 7/24/06, Derrick Oswald <Der...@ro...> wrote: > > Kavorka, > > This should give you the meta tag, from which you can get the > information you want: > > NodeList nodes = parser.parse (null); > NodeList metas = nodes.extractAllNodesThatMatch (new TagNameFilter > ("META")); > MetaTag meta = (MetaTag)metas.elementAt (0); > System.out.println (meta); > > Derrick > > kavorka wrote: > > > Hi all, > > I'm new to HTML-parser. I used sample programs to understand how can i > > find the meta data of the page but i could't use it. Do you have any > > code samples that finds meta data of the page using HTMLparser. > > Thank you > > best regards > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Derrick O. <Der...@Ro...> - 2006-07-24 03:25:32
|
Kavorka, This should give you the meta tag, from which you can get the information you want: NodeList nodes = parser.parse (null); NodeList metas = nodes.extractAllNodesThatMatch (new TagNameFilter ("META")); MetaTag meta = (MetaTag)metas.elementAt (0); System.out.println (meta); Derrick kavorka wrote: > Hi all, > I'm new to HTML-parser. I used sample programs to understand how can i > find the meta data of the page but i could't use it. Do you have any > code samples that finds meta data of the page using HTMLparser. > Thank you > best regards > |
From: Derrick O. <Der...@Ro...> - 2006-07-24 03:20:33
|
Eigeny, In general, you probably want to look at the filter package. Try running the filterbuilder application (startup script is in the bin directory) and read the help and tutorial. Using this application you can create a Java program that selects only the 'sometext' you want. Derrick Eugeny N Dzhurinsky wrote: >Hello! >I need to search for HTML code in a page, for instance the code to search >looks like this: > ><div class="someclass"><a href="somelocation" ><img src="image/here" >border="0"></a></div><span style="style2">sometext</span> > >This code could be placed as single line or formatted somehow, containing one >or more linebreaks. > >I need also to track situation while this code is commented out, or placed >outside <body> section. > >For now I created a Lexer instance for document and for this code, comparing >them token by token, but may be there is some better way? > > > |
From: Ian M. <ian...@gm...> - 2006-07-19 15:23:22
|
HTMLParser is usually capable of parsing just an HTML fragment. Parser.setInputHTML("html") and then Parser.parse(null). Ian On 7/14/06, Dennis Gesker <ge...@al...> wrote: > Since it was just a string I added html and body tags and it seems I'm > on my way. > > str = "<head><body> + str + "</head><;body>"; > > --Dennis > > Dennis Gesker wrote: > > I would like to parse a portion of html that I have in a buffer > > (String), that is to say not a complete page. The string contains an > > html table only. > > > > Could someone point to or provide some sample code for how to parse just > > a fragment of html? > > > > Dennis > > > > > > > > -- > Dennis R. Gesker > email: de...@al... > Key Id: 0xEFA10A51 > > > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: kavorka <the...@gm...> - 2006-07-19 11:44:23
|
Hi all, I'm new to HTML-parser. I used sample programs to understand how can i find the meta data of the page but i could't use it. Do you have any code samples that finds meta data of the page using HTMLparser. Thank you best regards |
From: Eugeny N D. <bo...@re...> - 2006-07-17 07:30:43
|
Hello! I need to search for HTML code in a page, for instance the code to search looks like this: <div class="someclass"><a href="somelocation" ><img src="image/here" border="0"></a></div><span style="style2">sometext</span> This code could be placed as single line or formatted somehow, containing one or more linebreaks. I need also to track situation while this code is commented out, or placed outside <body> section. For now I created a Lexer instance for document and for this code, comparing them token by token, but may be there is some better way? -- Eugene N Dzhurinsky |
From: Dennis G. <ge...@al...> - 2006-07-14 20:15:09
|
Since it was just a string I added html and body tags and it seems I'm on my way. str = "<head><body> + str + "</head><;body>"; --Dennis Dennis Gesker wrote: > I would like to parse a portion of html that I have in a buffer > (String), that is to say not a complete page. The string contains an > html table only. > > Could someone point to or provide some sample code for how to parse just > a fragment of html? > > Dennis > > > -- Dennis R. Gesker email: de...@al... Key Id: 0xEFA10A51 |
From: Dennis G. <ge...@al...> - 2006-07-14 20:07:47
|
I would like to parse a portion of html that I have in a buffer (String), that is to say not a complete page. The string contains an html table only. Could someone point to or provide some sample code for how to parse just a fragment of html? Dennis -- Dennis R. Gesker email: de...@al... Key Id: 0xEFA10A51 |
From: Derrick O. <Der...@Ro...> - 2006-07-01 23:34:18
|
This should give you the "Content": NodeList nodes = parser.parse (null); NodeList metas = nodes.extractAllNodesThatMatch (new TagNameFilter ("META")); System.out.println (metas.elementAt (0).getMetaContent ()); vasantha reddy wrote: > Hi, > > I am using HTML parser in my project.The HTML > parser doesn't give the contents of meta tag as its output.I need the > content of the meta tag.Is there any method that I can use to get the > content of a particular tag by giving the tag name as input? > > Thank you, > Regards, > Vasantha > > |