htmlparser-user Mailing List for HTML Parser
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
| 2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
| 2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
| 2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
| 2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
| 2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
| 2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
| 2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
| 2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
| 2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
| 2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
| 2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
| 2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
| 2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
| 2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
| 2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
| 2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
| 2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| S | M | T | W | T | F | S |
|---|---|---|---|---|---|---|
|
|
|
|
1
|
2
|
3
|
4
|
|
5
|
6
|
7
|
8
|
9
|
10
|
11
(2) |
|
12
|
13
(1) |
14
|
15
|
16
(1) |
17
(2) |
18
|
|
19
|
20
|
21
(2) |
22
(2) |
23
|
24
|
25
(2) |
|
26
(3) |
27
(2) |
28
|
29
|
30
(2) |
|
|
|
From: <ale...@cp...> - 2005-06-30 17:10:31
|
Zac,
It seems that it extracts text from comments for some reason.
I just started using the parser, too, but I am using it like this (found
it in one of the examples, import org.htmlparser.beans.*):
StringBean sb = new StringBean();
sb.setLinks(false);
sb.setURL(url);
String alltext = sb.getStrings();
It even seems to be faster than the method you are using.
Darko Aleksic
> Hello all,
>
> I just started using html_parser for a project I am working on. I need
> to extract all text from a webpage (ie. text that is meant to be read,
> such as that inside <p> tages), so I can analyse it. However, when I
> use the following code, it often extracts fragments of HTML and other
> unwanted data. Probably a simple mistake, but what am I doing wrong?
>
> Thanks,
>
> Zac Craven
> ------------------------------------------------
>
> public String getText(String url){
> try {
> System.out.println("Parsing: " + url);
> Parser parser = new Parser(url);
> TextExtractingVisitor visitor = new TextExtractingVisitor();
> parser.visitAllNodesWith(visitor);
> String alltext = visitor.getExtractedText();
> System.out.println("----------BEGINNING DUMP OF ALL TEXT
> EXTRACTED FROM "+url+"----------");
> System.out.println(alltext);
> System.out.println("----------END OF TEXT DUMP FROM
> "+url+"----------");
> return alltext;
> }
> catch (Exception e) {
> System.err.println(e.getMessage());
> return null;
> }
> }
>
>
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|
|
From: Zac C. <zac...@gm...> - 2005-06-30 16:25:02
|
Hello all,
I just started using html_parser for a project I am working on. I need
to extract all text from a webpage (ie. text that is meant to be read,
such as that inside <p> tages), so I can analyse it. However, when I
use the following code, it often extracts fragments of HTML and other
unwanted data. Probably a simple mistake, but what am I doing wrong?
Thanks,
Zac Craven
------------------------------------------------
public String getText(String url){
try {
System.out.println("Parsing: " + url);
Parser parser = new Parser(url);
TextExtractingVisitor visitor = new TextExtractingVisitor();
parser.visitAllNodesWith(visitor);
String alltext = visitor.getExtractedText();
System.out.println("----------BEGINNING DUMP OF ALL TEXT
EXTRACTED FROM "+url+"----------");
System.out.println(alltext);
System.out.println("----------END OF TEXT DUMP FROM
"+url+"----------");
return alltext;
}
catch (Exception e) {
System.err.println(e.getMessage());
return null;
}
}
|
|
From: Derrick O. <Der...@Ro...> - 2005-06-27 22:33:54
|
Rubén,
Every time you ask for an iterator using elements() it will give you a
new one.
So using "filas.getChildren().elements().hasMoreNodes()" will always
return true if the node has any children at all.
I think what you want to do is something like this (but you still need
to watch out for tables within tables since the
extractAllNodesThatMatch() flattens the heirarchy without unhooking
children from their parent):
try
{
//Filtramos todos los nodos que son tablas.
// Y además tienen class dentro
Parser parser = new Parser (url);
NodeFilter filter = new NodeClassFilter (TableTag.class);
NodeList nodes = parser.extractAllNodesThatMatch (filter);
for (int i = 0; i < nodes.size (); i++)
{
TagNode tagnode=(TagNode)nodes.elementAt (i);
if (tagnode.getAttribute ("class")!=null
&& tagnode.getAttribute ("class").indexOf
("productListing")>-1)
{
NodeList list = tagnode.getChildren ();
if (null != list)
{
NodeIterator iterator = list.elements ();
while (iterator.hasMoreNodes ())
{
Node filas = iterator.nextNode ();
NodeList list2 = filas.getChildren ();
if (null != list2)
{
NodeIterator iterator2 = list2.elements ();
while (iterator2.hasMoreNodes ())
System.out.println
(iterator2.nextNode ().toHtml ());
}
}
}
}
}
}
catch (ParserException e)
{
e.printStackTrace ();
}
Derrick
Rubén del Castillo Glez wrote:
> I have this code and it seems if it gooes into a infinite loop,
>
> i like to parse table row that has the atribute class with value
> productListing
>
> do you have any hint?
>
> try
> {
>
> //Filtramos todos los nodos que son tablas.
> // Y además tienen class dentro
>
> parser = new Parser (url);
>
> filter = new NodeClassFilter (TableTag.class);
>
> nodes = parser.extractAllNodesThatMatch (filter);
>
> for (int i = 0; i < nodes.size (); i++) {
>
> TagNode tagnode=(TagNode)nodes.elementAt (i);
>
>
>
> if (tagnode.getAttribute("class")!=null
> && tagnode.getAttribute("class").indexOf("productListing")>-1) {
>
>
>
> while(tagnode.getChildren()!=null &&
> tagnode.getChildren().elements()!=null &&
> tagnode.getChildren().elements().hasMoreNodes()) {
>
>
>
> Node filas= tagnode.getChildren().elements().nextNode();
>
>
>
>
> while (
> filas.getChildren()!=null &&
> filas.getChildren().elements()!=null
> &&
>
> filas.getChildren().elements().hasMoreNodes()
>
>
> ) {
>
> System.out.println
> (tagnode.getChildren().elements().nextNode().toHtml ());
>
> }
>
> }
>
> }
>
> }
>
> }
> catch (ParserException e)
> {
> e.printStackTrace ();
> }
> System.exit (0);
> }
> }
|
|
From: Rubén d. C. G. <r_d...@ho...> - 2005-06-27 21:55:42
|
I have this code and it seems if it gooes into a infinite loop,
i like to parse table row that has the atribute class with value
productListing
do you have any hint?
try
{
//Filtramos todos los nodos que son tablas.
// Y además tienen class dentro
parser = new Parser (url);
filter = new NodeClassFilter (TableTag.class);
nodes = parser.extractAllNodesThatMatch (filter);
for (int i = 0; i < nodes.size (); i++) {
TagNode tagnode=(TagNode)nodes.elementAt (i);
if (tagnode.getAttribute("class")!=null
&& tagnode.getAttribute("class").indexOf("productListing")>-1) {
while(tagnode.getChildren()!=null &&
tagnode.getChildren().elements()!=null &&
tagnode.getChildren().elements().hasMoreNodes()) {
Node filas= tagnode.getChildren().elements().nextNode();
while (
filas.getChildren()!=null &&
filas.getChildren().elements()!=null
&&
filas.getChildren().elements().hasMoreNodes()
) {
System.out.println (tagnode.getChildren().elements().nextNode().toHtml
());
}
}
}
}
}
catch (ParserException e)
{
e.printStackTrace ();
}
System.exit (0);
}
}
|
|
From: Derrick O. <Der...@Ro...> - 2005-06-26 19:56:05
|
Mike,
I think you need to use the 'action' input form property to set the
target URL for the query (and it may be relative), probably ...
URL url = new URL("http://www.msn.com.tw/include/searchexecute.asp");
Derrick
mike liu wrote:
> regards:
> I want to use org.htmlparser.http.*;
> post Keyword "HelloKitty" on http://www.msn.com.tw to do search action.
> but in vain.The following codes return the html source of
> http://www.msn.com.tw.
> but not the search result.
> do I miss something important?
|
|
From: mike l. <s9...@ya...> - 2005-06-26 16:58:03
|
regards: I want to use org.htmlparser.http.*; post Keyword "HelloKitty" on http://www.msn.com.tw to do search action. but in vain.The following codes return the html source of http://www.msn.com.tw. but not the search result. do I miss something important? Any positive suggestion is welcome. May goodness be with you all Followings is my codes. /*--------------------------------------------------------------------------- 把HTML成XHTML1.0的translator接起 */ import org.htmlparser.http.*; import org.htmlparser.Parser; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.*; import org.htmlparser.util.ParserException; import org.htmlparser.visitors.HtmlPage; import org.htmlparser.tags.*; import org.htmlparser.visitors.NodeVisitor; import org.htmlparser.*; import org.htmlparser.filters.*; import org.htmlparser.filters.*; import java.io.*; import java.util.Vector; import java.util.*; import org.htmlparser.Attribute; import org.htmlparser.nodes.*; import org.htmlparser.util.*; import org.htmlparser.visitors.TagFindingVisitor; //--- //---import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CharsetEncoder; //import java.nio.charset.CoderResult; import java.nio.ByteBuffer; import java.nio.CharBuffer; //--------------------------------------------------------------------------- //--------------------------------------------------------------------------- import java.io.*; import java.net.*; import java.util.*; import java.util.regex.*; import java.util.*; import java.net.*; public class htmlparserToDoPost{ public static void scopy(InputStreamReader ins,FileOutputStream outs) throws IOException { synchronized (ins) { synchronized (outs) { System.err.println("nokia"); //byte[] buf = new byte[256]; while (true) { int br = ins.read(); // public void write(int b) outs.write(br); if (br == -1) break; } } } }//end scopy public static void main(String args[]) throws Exception{ Properties prop = System.getProperties(); System.out.println(prop.getProperty("http.proxyHost")); System.out.println(prop.getProperty("http.proxyPort")); prop.put("http.proxyHost","140.138.2.10"); prop.put("http.proxyPort","8080"); try{ // from the "action" (relative to the refering page) URL url = new URL("http://www.msn.com.tw"); // URL url = new URL("http://www.msn.com.tw"); // http://www.msn.com.tw/"); // HTTPConnection con = new HTTPConnection("www.msn.com.tw"); HttpURLConnection myconnection=(HttpURLConnection)url.openConnection(); myconnection.setRequestMethod("POST"); myconnection.setDoOutput(true); myconnection.setDoInput(true); myconnection.setUseCaches(true); // more or less of these may be required // see Request Header Definitions: // http://www.ietf.org/rfc/rfc2616.txt myconnection.setRequestProperty("Accept-Charset", "*"); // myconnection.setRequestProperty ("Referer", imdbStartURL); // myconnection.setRequestProperty("User-Agent","IE/5.0"); //myconnection.setRequestProperty("User-Agent","IE/5.0"); myconnection.setRequestProperty("Accept-Charset","utf-8"); myconnection.setRequestProperty("Content-Type","application/x-www-form-urlencoded"); myconnection.setRequestProperty("Content-Length","25"); // id=searchtheweb method=post action="/https/sourceforge.net/include/searchexecute.asp" name=searchfrm> //System.out.println("Current Location="+myconnection.getURL().toString()); // Build up the logon "input" fields separated by ampersands (&) StringBuffer buffer = new StringBuffer(1024); //"q","HelloKitty" buffer.append("id"); buffer.append("="); buffer.append("searchtheweb"); buffer.append ("&"); buffer.append("name="); buffer.append("哈利波特"); System.out.println(buffer.toString().length()); // Output the input fields PrintWriter out = new PrintWriter(myconnection.getOutputStream()); out.print(buffer); out.flush(); InputStream in=myconnection.getInputStream(); InputStreamReader fromPostSource=new InputStreamReader(in,"ISO8859-1"); // Create an InputStreamReader that uses the named charset. FileOutputStream MsnFileOutputStream=new FileOutputStream("samplekitty.html"); scopy(fromPostSource,MsnFileOutputStream); //<FORM id=searchtheweb method=post action="/https/sourceforge.net/include/searchexecute.asp" name=searchfrm> MsnFileOutputStream.close(); fromPostSource.close(); in.close(); out.close(); System.out.println("Post Location="+myconnection.getURL().toString()); // So the parser can perform operations of a logged in user. Parser parser = new Parser(); parser.setConnection(myconnection); } catch(UnknownHostException e) { System.out.println(""); } //Temp inline version of getVoteHistory /* try{ //System.out.println("About to get vote history at" + this.imdbVoteHistoryURL); //parser.setURL(this.imdbVoteHistoryURL); //TextExtractingVisitor visitor = new TextExtractingVisitor(); //parser.visitAllNodesWith(visitor); //String textInPage = visitor.getExtractedText(); //ystem.out.println("Text="+textInPage); } catch(UnknownHostException e){ System.out.println("法取得IP位址"); }*/ //System.out.println("Login Complete"); //} } } --------------------------------- DO YOU YAHOO!? 雅虎免费G邮箱-中国第一绝无垃圾邮件骚扰超大邮箱 |
|
From: mike l. <s9...@ya...> - 2005-06-26 16:35:11
|
regards: I want to use org.htmlparser.http.*; post Keyword "HelloKitty" on http://www.msn.com.tw to do search action. but in vain.The following codes return the html source of http://www.msn.com.tw. but not the search result. do I miss something important? Any positive suggestion is welcome. May goodness be with you all Followings is my codes. /*--------------------------------------------------------------------------- 把HTML成XHTML1.0的translator接起 */ import org.htmlparser.http.*; import org.htmlparser.Parser; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.*; import org.htmlparser.util.ParserException; import org.htmlparser.visitors.HtmlPage; import org.htmlparser.tags.*; import org.htmlparser.visitors.NodeVisitor; import org.htmlparser.*; import org.htmlparser.filters.*; import org.htmlparser.filters.*; import java.io.*; import java.util.Vector; import java.util.*; import org.htmlparser.Attribute; import org.htmlparser.nodes.*; import org.htmlparser.util.*; import org.htmlparser.visitors.TagFindingVisitor; //--- //---import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CharsetEncoder; //import java.nio.charset.CoderResult; import java.nio.ByteBuffer; import java.nio.CharBuffer; //--------------------------------------------------------------------------- //--------------------------------------------------------------------------- import java.io.*; import java.net.*; import java.util.*; import java.util.regex.*; import java.util.*; import java.net.*; public class htmlparserToDoPost{ public static void scopy(InputStreamReader ins,FileOutputStream outs) throws IOException { synchronized (ins) { synchronized (outs) { System.err.println("nokia"); //byte[] buf = new byte[256]; while (true) { int br = ins.read(); // public void write(int b) outs.write(br); if (br == -1) break; } } } }//end scopy public static void main(String args[]) throws Exception{ Properties prop = System.getProperties(); System.out.println(prop.getProperty("http.proxyHost")); System.out.println(prop.getProperty("http.proxyPort")); prop.put("http.proxyHost","140.138.2.10"); prop.put("http.proxyPort","8080"); try{ // from the "action" (relative to the refering page) URL url = new URL("http://www.msn.com.tw"); // URL url = new URL("http://www.msn.com.tw"); // http://www.msn.com.tw/"); // HTTPConnection con = new HTTPConnection("www.msn.com.tw"); HttpURLConnection myconnection=(HttpURLConnection)url.openConnection(); myconnection.setRequestMethod("POST"); myconnection.setDoOutput(true); myconnection.setDoInput(true); myconnection.setUseCaches(true); // more or less of these may be required // see Request Header Definitions: // http://www.ietf.org/rfc/rfc2616.txt myconnection.setRequestProperty("Accept-Charset", "*"); // myconnection.setRequestProperty ("Referer", imdbStartURL); // myconnection.setRequestProperty("User-Agent","IE/5.0"); //myconnection.setRequestProperty("User-Agent","IE/5.0"); myconnection.setRequestProperty("Accept-Charset","utf-8"); myconnection.setRequestProperty("Content-Type","application/x-www-form-urlencoded"); myconnection.setRequestProperty("Content-Length","25"); // id=searchtheweb method=post action="/https/sourceforge.net/include/searchexecute.asp" name=searchfrm> //System.out.println("Current Location="+myconnection.getURL().toString()); // Build up the logon "input" fields separated by ampersands (&) StringBuffer buffer = new StringBuffer(1024); //"q","HelloKitty" buffer.append("id"); buffer.append("="); buffer.append("searchtheweb"); buffer.append ("&"); buffer.append("name="); buffer.append("哈利波特"); System.out.println(buffer.toString().length()); // Output the input fields PrintWriter out = new PrintWriter(myconnection.getOutputStream()); out.print(buffer); out.flush(); InputStream in=myconnection.getInputStream(); InputStreamReader fromPostSource=new InputStreamReader(in,"ISO8859-1"); // Create an InputStreamReader that uses the named charset. FileOutputStream MsnFileOutputStream=new FileOutputStream("samplekitty.html"); scopy(fromPostSource,MsnFileOutputStream); //<FORM id=searchtheweb method=post action="/https/sourceforge.net/include/searchexecute.asp" name=searchfrm> MsnFileOutputStream.close(); fromPostSource.close(); in.close(); out.close(); System.out.println("Post Location="+myconnection.getURL().toString()); // So the parser can perform operations of a logged in user. Parser parser = new Parser(); parser.setConnection(myconnection); } catch(UnknownHostException e) { System.out.println(""); } //Temp inline version of getVoteHistory /* try{ //System.out.println("About to get vote history at" + this.imdbVoteHistoryURL); //parser.setURL(this.imdbVoteHistoryURL); //TextExtractingVisitor visitor = new TextExtractingVisitor(); //parser.visitAllNodesWith(visitor); //String textInPage = visitor.getExtractedText(); //ystem.out.println("Text="+textInPage); } catch(UnknownHostException e){ System.out.println("法取得IP位址"); }*/ //System.out.println("Login Complete"); //} } } --------------------------------- DO YOU YAHOO!? 雅虎免费G邮箱-中国第一绝无垃圾邮件骚扰超大邮箱 |
|
From: Derrick O. <Der...@Ro...> - 2005-06-25 14:12:29
|
Ian, Assuming you've got the form input correct, it's likely that the login operation is trying to set a Cookie. For that you'll need to move to version 1.5 of the parser and enable cookie processing: Page.getConnectionManager ().setCookieProcessingEnabled (true); It's a global static connection manager so new Parsers will inherit the cookies set by the login. Hope that helps. Derrick Ian Moss wrote: > > Hi, > > I've just joined the list. > htmlparser seems to be pretty damn cool. > Thanks to all those that have contributed code / help / doc. > (What's with the Wiki vandalisation-guess it must be a bot?) > > Anyways, what I'm trying to do at the moment is extract my vote list > from imdb.com ... > > So being a geek I want to do this in a way that others can also use. > And it also frustrates me that they've never implemented a share history > with friends function. > Knowing htmlparser is there helps here. > > - Complexity : Easy (if you've done it before ;) > > - Requirement for helping : an account at imdb.com (with a vote history). > - I'm using htmlparser.jar from 1.4 > > - Problem : Vote History is obviously a function that requires you to > logon. > When i try to logon using the htmlparser Wiki doc as a reference > I don't seem to be able to get it to log on. > (It does compile and execute as I would expect ;) > > - I will include the relevant code below (I can email the full class > if needed / can contribute it if you want)... > > //------------------------------------------------------------------------- > > String currentImdbUsername="yourOwnID"; > String currentImdbPassword="yourOwnPassword"; > > String imdbBaseURL="http://uk.imdb.com/"; > String imdbloginSubmitURL="http://uk.imdb.com/register/login"; > String > imdbStartURL=imdbloginSubmitURL;//"http://uk.imdb.com/register/login"; > > String > imdbVoteHistoryURL="http://uk.imdb.com/mymovies/list?l=3477529&s=uservote&s=reverse_uservote"; > > > private void imdbLogin() > { > URL url; > StringBuffer buffer; > PrintWriter out; > > System.out.println("Attempting IMDB logon with > user="+this.currentImdbUsername); > > try > { > // from the 'action' (relative to the refering page) > url = new URL( imdbloginSubmitURL ); > connection = (HttpURLConnection)url.openConnection(); > connection.setRequestMethod("POST"); > > connection.setDoOutput(true); > connection.setDoInput(true); > connection.setUseCaches(false); > > // more or less of these may be required > // see Request Header Definitions: > http://www.ietf.org/rfc/rfc2616.txt > connection.setRequestProperty ("Accept-Charset", "*"); > connection.setRequestProperty ("Referer", imdbStartURL); > connection.setRequestProperty ("User-Agent", "IE/5.0"); > > System.out.println("Current > Location="+connection.getURL().toString()); > > // Build up the logon 'input' fields separated by ampersands (&) > buffer = new StringBuffer(1024); > buffer.append("login="); > buffer.append(currentImdbUsername); > buffer.append ("&"); > buffer.append("password="); > buffer.append(currentImdbPassword); > > System.out.println("Logon input fields="+buffer.toString()); > > // Output the input fields > out = new PrintWriter( connection.getOutputStream() ); > out.print(buffer); > out.close(); > > System.out.println("Post > Location="+connection.getURL().toString()); > > // So the parser can perform operations of a logged in user. > parser = new Parser(); > parser.setConnection(connection); > > //Temp inline version of getVoteHistory > try { > System.out.println("About to get vote history at " + > this.imdbVoteHistoryURL); > > parser.setURL(this.imdbVoteHistoryURL); > > TextExtractingVisitor visitor = new TextExtractingVisitor(); > parser.visitAllNodesWith(visitor); > String textInPage = visitor.getExtractedText(); > > System.out.println("Text="+textInPage); > } > catch(ParserException pex) > { > System.out.println("ParserException in logon"+pex); > } > catch(IOException e) > { > e.printStackTrace(); > } > > System.out.println("Login Complete"); > } > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
|
From: Ian M. <em...@ia...> - 2005-06-25 09:15:20
|
Hi, I've just joined the list. htmlparser seems to be pretty damn cool. Thanks to all those that have contributed code / help / doc. (What's with the Wiki vandalisation-guess it must be a bot?) Anyways, what I'm trying to do at the moment is extract my vote list from imdb.com ... So being a geek I want to do this in a way that others can also use. And it also frustrates me that they've never implemented a share history with friends function. Knowing htmlparser is there helps here. - Complexity : Easy (if you've done it before ;) - Requirement for helping : an account at imdb.com (with a vote history). - I'm using htmlparser.jar from 1.4 - Problem : Vote History is obviously a function that requires you to logon. When i try to logon using the htmlparser Wiki doc as a reference I don't seem to be able to get it to log on. (It does compile and execute as I would expect ;) - I will include the relevant code below (I can email the full class if needed / can contribute it if you want)... //------------------------------------------------------------------------- String currentImdbUsername="yourOwnID"; String currentImdbPassword="yourOwnPassword"; String imdbBaseURL="http://uk.imdb.com/"; String imdbloginSubmitURL="http://uk.imdb.com/register/login"; String imdbStartURL=imdbloginSubmitURL;//"http://uk.imdb.com/register/login"; String imdbVoteHistoryURL="http://uk.imdb.com/mymovies/list?l=3477529&s=uservote&s=reverse_uservote"; private void imdbLogin() { URL url; StringBuffer buffer; PrintWriter out; System.out.println("Attempting IMDB logon with user="+this.currentImdbUsername); try { // from the 'action' (relative to the refering page) url = new URL( imdbloginSubmitURL ); connection = (HttpURLConnection)url.openConnection(); connection.setRequestMethod("POST"); connection.setDoOutput(true); connection.setDoInput(true); connection.setUseCaches(false); // more or less of these may be required // see Request Header Definitions: http://www.ietf.org/rfc/rfc2616.txt connection.setRequestProperty ("Accept-Charset", "*"); connection.setRequestProperty ("Referer", imdbStartURL); connection.setRequestProperty ("User-Agent", "IE/5.0"); System.out.println("Current Location="+connection.getURL().toString()); // Build up the logon 'input' fields separated by ampersands (&) buffer = new StringBuffer(1024); buffer.append("login="); buffer.append(currentImdbUsername); buffer.append ("&"); buffer.append("password="); buffer.append(currentImdbPassword); System.out.println("Logon input fields="+buffer.toString()); // Output the input fields out = new PrintWriter( connection.getOutputStream() ); out.print(buffer); out.close(); System.out.println("Post Location="+connection.getURL().toString()); // So the parser can perform operations of a logged in user. parser = new Parser(); parser.setConnection(connection); //Temp inline version of getVoteHistory try { System.out.println("About to get vote history at " + this.imdbVoteHistoryURL); parser.setURL(this.imdbVoteHistoryURL); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); String textInPage = visitor.getExtractedText(); System.out.println("Text="+textInPage); } catch(ParserException pex) { System.out.println("ParserException in logon"+pex); } catch(IOException e) { e.printStackTrace(); } System.out.println("Login Complete"); } |
|
From: Derrick O. <Der...@Ro...> - 2005-06-22 22:10:42
|
Hi Mike,
You should be able to register as a ConnectionMonitor and change the
request method and other stuff in the PreConnect() method:
class Poster implements ConnectionMonitor
{
void preConnect (HttpURLConnection connection)
throws
ParserException
{
connection.setRequestMethod ("POST");
}
void postConnect (HttpURLConnection connection)
throws
ParserException
{
}
}
Poster = new Poster ();
Page.getConnectionManager ().setMonitor (poster);
Derrick
mike liu wrote:
> Regards:
> Could I use org.htmlparser.http
> to do other HTTP method like
> POST,PUT,DELETE,HEAD,TRACE,OPTIONS,CONNECT?....
>
> Any positive suggestion is welcome.
> thank you
> May goodness be with you all
>
>
> <http://cn.mail.yahoo.com/?id=77071>
|
|
From: mike l. <s9...@ya...> - 2005-06-22 12:27:29
|
Regards: Could I use org.htmlparser.http to do other HTTP method like POST,PUT,DELETE,HEAD,TRACE,OPTIONS,CONNECT?.... Any positive suggestion is welcome. thank you May goodness be with you all --------------------------------- DO YOU YAHOO!? 雅虎免费G邮箱-中国第一绝无垃圾邮件骚扰超大邮箱 |
|
From: Derrick O. <Der...@Ro...> - 2005-06-21 22:28:03
|
Jeremy,
It's not finding the HTML parser NodeVisitor class.
Try including the path to the HTML Parser class files (and also
StringDemo.class), either:
java -classpath .:../htmlparser1_5/src StringDemo
or
java -classpath .:../htmlparser1_5/lib/htmlparser.jar StringDemo
depending on where the jar file is of course.
Derrick
bu...@co... wrote:
>I have been trying to get the StringDemo example problem to work and kept getting an exception. I am using ANT to compile the program. there is the output:
>
>[staruser@arc HTMLParser_EXAMPLE]# ant
>Buildfile: build.xml
>
>build:
> [javac] Compiling 1 source file
>
>BUILD SUCCESSFUL
>Total time: 1 second
>
>
>There is the build.xml file:
>
><?xml version="1.0"?>
><!-- build file for text parser -->
>
><project name="parser" default="build" basedir=".">
> <target name="build">
> <javac
> srcdir="."
> classpath="../htmlparser1_5/src/"
> />
> </target>
></project>
>
>There is the output when I try to run the program:
>
>[staruser@arc HTMLParser_EXAMPLE]# java StringDemo
>Exception in thread "main" java.lang.NoClassDefFoundError: org/htmlparser/visitors/NodeVisitor
>
>It doesn't make sense that java cannot find the class when javac could. Am I missing a step in the build.xml file? Any help in understanding this problem would be greatly appreciated.
>
>-Jeremy
>
>
>
|
|
From: <bu...@co...> - 2005-06-21 17:15:22
|
I have been trying to get the StringDemo example problem to work and kept getting an exception. I am using ANT to compile the program. there is the output:
[staruser@arc HTMLParser_EXAMPLE]# ant
Buildfile: build.xml
build:
[javac] Compiling 1 source file
BUILD SUCCESSFUL
Total time: 1 second
There is the build.xml file:
<?xml version="1.0"?>
<!-- build file for text parser -->
<project name="parser" default="build" basedir=".">
<target name="build">
<javac
srcdir="."
classpath="../htmlparser1_5/src/"
/>
</target>
</project>
There is the output when I try to run the program:
[staruser@arc HTMLParser_EXAMPLE]# java StringDemo
Exception in thread "main" java.lang.NoClassDefFoundError: org/htmlparser/visitors/NodeVisitor
It doesn't make sense that java cannot find the class when javac could. Am I missing a step in the build.xml file? Any help in understanding this problem would be greatly appreciated.
-Jeremy
|
|
From: Qingyi Gu <q_z...@ya...> - 2005-06-17 21:20:52
|
Yes, you are right. The character set in <meta> tag causes the problem. The version I am using is v1.5 downloaded in March,2005, I don't think it is the latest one. The url I connect to returns a page has "<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">". My code is using "ISO-8859" as charset to parse this InputStream. I will try to use the latest version to see if this problem still exists. Meantime, I use "page = new Page(String)" instead of "page = new Page(InputStream,CharSet)", it works well. Thanks for your help. --- Derrick Oswald <Der...@Ro...> wrote: > The parser has run into a <meta> tag that specifies > the character set. > This causes a rescan of the characters read so far > to get the Reader > back in sync with the Stream with the new character > set. > The reset() call on the stream has caused an > exception since reset() > isn't supported. > I thought this was cured in recent releases where > the input stream is > wrapped if it doesn't support reset. > Which version are you using and what URL causes the > problem? > > > Qingyi Gu wrote: > > >Hi, > > > >I got the following errors when I use the parser. > >Anyone has clue what cause this problem. Thanks. > > > >*********************************************** > >org.htmlparser.util.ParserException: mark/reset not > >supported; > >java.io.IOException: mark/reset not supported > >at java.io.InputStream.reset(InputStream.java:332) > >at > >java.io.FilterInputStream.reset(FilterInputStream.java:207) > >at > >org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:236) > >at > >org.htmlparser.lexer.Page.setEncoding(Page.java:766) > >at > >org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:119) > >at > >org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69) > >at > >org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160) > >at > >org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:91) > >at > >org.htmlparser.Parser.visitAllNodesWith(Parser.java:522) > >*********************************************** > > > >ZZ > > > > > > > > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux > Migration Strategies > from IBM. Find simple to follow Roadmaps, > straightforward articles, > informative Webcasts and more! Get everything you > need to get up to > speed, fast. > http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > ____________________________________________________ Yahoo! Sports Rekindle the Rivalries. Sign up for Fantasy Football http://football.fantasysports.yahoo.com |
|
From: Derrick O. <Der...@Ro...> - 2005-06-17 01:53:13
|
The parser has run into a <meta> tag that specifies the character set. This causes a rescan of the characters read so far to get the Reader back in sync with the Stream with the new character set. The reset() call on the stream has caused an exception since reset() isn't supported. I thought this was cured in recent releases where the input stream is wrapped if it doesn't support reset. Which version are you using and what URL causes the problem? Qingyi Gu wrote: >Hi, > >I got the following errors when I use the parser. >Anyone has clue what cause this problem. Thanks. > >*********************************************** >org.htmlparser.util.ParserException: mark/reset not >supported; >java.io.IOException: mark/reset not supported >at java.io.InputStream.reset(InputStream.java:332) >at >java.io.FilterInputStream.reset(FilterInputStream.java:207) >at >org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:236) >at >org.htmlparser.lexer.Page.setEncoding(Page.java:766) >at >org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:119) >at >org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69) >at >org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160) >at >org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:91) >at >org.htmlparser.Parser.visitAllNodesWith(Parser.java:522) >*********************************************** > >ZZ > > > |
|
From: Qingyi Gu <q_z...@ya...> - 2005-06-16 14:25:25
|
Hi, I got the following errors when I use the parser. Anyone has clue what cause this problem. Thanks. *********************************************** org.htmlparser.util.ParserException: mark/reset not supported; java.io.IOException: mark/reset not supported at java.io.InputStream.reset(InputStream.java:332) at java.io.FilterInputStream.reset(FilterInputStream.java:207) at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:236) at org.htmlparser.lexer.Page.setEncoding(Page.java:766) at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:119) at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69) at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160) at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:91) at org.htmlparser.Parser.visitAllNodesWith(Parser.java:522) *********************************************** ZZ __________________________________ Discover Yahoo! Get on-the-go sports scores, stock quotes, news and more. Check it out! http://discover.yahoo.com/mobile.html |
|
From: Derrick O. <Der...@Ro...> - 2005-06-11 22:57:53
|
Without seeing the code in MyVistor.java it's hard to say. If you look at line 89 of MyVisitor.java it is probably trying to get the children list which may be null, or maybe get a property from a specific tag that is null. To solve it, you will need to check for null before using it. mike liu wrote: > regards: > > > Why does sometimes the following error happen?.... > http://0rz.net/4d0ry > > Any positive suggestion is welcome. > thank you > May goodness be with you all > Thank you,Positive Feedback to mi...@gm... > <mailto:mi...@gm...> |
|
From: mike l. <s9...@ya...> - 2005-06-11 15:31:21
|
regards:
Why does sometimes the following error happen?....
http://0rz.net/4d0ry
Any positive suggestion is welcome.
thank you
May goodness be with you all
Thank you,Positive Feedback to mi...@gm...
__________________________________________________
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com |