|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object | +--collector.Collector
Collects from Web servers the HTML pages to be classified.The collection starts from an "index site" and stays within its boundary. An index site is defined by: 1. a list of base URLs, 2. a list of boundary URL prefixes, 3. a stop-list.
The collector retrieves any URL which: 1. is reachable from a base URL, 2. is within the boundary URLs, 3. does not match an element in the stop-list.
Any URL present in a retrieved document is exhamined to determine whether: 1. it links to an internal document: in such case the referred document is collected 2. it links to an external document: in such case the referred document is classified according to the context of the URL.
The boundary URL (prefixes) are used to distinguish internal from external pages.
Ex.: if "http://www.yahoo.com/" is a boundary URL prefix, any page in Yahoo is considered within the "index site".
Ex.: if "http://www.server.it/indice/" is a boundary URL prefix, "http://www.server.it/doc/paper1.html" is NOT considered within the "index site".
Ex.: if "http://www.server.it/index.html" is a boundary URL prefix, only page "index.html" is within the index site.
Finally, the stop-list contains strings which identify pages internal to the site, but which should not be visited. This is particularly useful to avoid pages referring to CGI scripts or other services, or in general non HTML pages: Ex.: if "/cgi-bin/" is in the stop list, a link such as "http://www.server.it/cgi-bin/" will not be collected.
Constructor Summary | |
Collector()
|
Method Summary | |
void |
addDaUsare(java.lang.String url,
float[] pesi)
Specifica un URL da elaborare. |
void |
collect(Site site)
Collect pages from site. |
Contesto |
getContext(java.lang.String url)
builds the Context for URL either reading it from net or from collected Contexts |
boolean |
indiceUsato(java.lang.String URL)
|
static void |
main(java.lang.String[] args)
A linea di comando e' possibile testare il funzionamento della classe. |
static java.io.File |
newFile(java.lang.String dir,
java.lang.String prefix,
java.lang.String ext)
Costruisce il file locale sul quale scaricare l'URL. Assume che l'URL sia assoluto. |
Methods inherited from class java.lang.Object |
clone,
equals,
finalize,
getClass,
hashCode,
notify,
notifyAll,
toString,
wait,
wait,
wait |
Constructor Detail |
public Collector()
Method Detail |
public void addDaUsare(java.lang.String url, float[] pesi)
public boolean indiceUsato(java.lang.String URL)
public void collect(Site site)
public static java.io.File newFile(java.lang.String dir, java.lang.String prefix, java.lang.String ext)
public Contesto getContext(java.lang.String url)
public static void main(java.lang.String[] args) throws java.lang.Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: INNER | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |