collector
Class Collector

java.lang.Object
  |
  +--collector.Collector

public class Collector
extends java.lang.Object

 Collects from Web servers the HTML pages to be classified.
 

The collection starts from an "index site" and stays within its boundary. An index site is defined by: 1. a list of base URLs, 2. a list of boundary URL prefixes, 3. a stop-list.

The collector retrieves any URL which: 1. is reachable from a base URL, 2. is within the boundary URLs, 3. does not match an element in the stop-list.

Any URL present in a retrieved document is exhamined to determine whether: 1. it links to an internal document: in such case the referred document is collected 2. it links to an external document: in such case the referred document is classified according to the context of the URL.

The boundary URL (prefixes) are used to distinguish internal from external pages.

Ex.: if "http://www.yahoo.com/" is a boundary URL prefix, any page in Yahoo is considered within the "index site".

Ex.: if "http://www.server.it/indice/" is a boundary URL prefix, "http://www.server.it/doc/paper1.html" is NOT considered within the "index site".

Ex.: if "http://www.server.it/index.html" is a boundary URL prefix, only page "index.html" is within the index site.

Finally, the stop-list contains strings which identify pages internal to the site, but which should not be visited. This is particularly useful to avoid pages referring to CGI scripts or other services, or in general non HTML pages: Ex.: if "/cgi-bin/" is in the stop list, a link such as "http://www.server.it/cgi-bin/" will not be collected.


Constructor Summary
Collector()
           
 
Method Summary
 void addDaUsare(java.lang.String url, float[] pesi)
          Specifica un URL da elaborare.
 void collect(Site site)
          Collect pages from site.
 Contesto getContext(java.lang.String url)
          builds the Context for URL either reading it from net or from collected Contexts
 boolean indiceUsato(java.lang.String URL)
           
static void main(java.lang.String[] args)
          A linea di comando e' possibile testare il funzionamento della classe.
static java.io.File newFile(java.lang.String dir, java.lang.String prefix, java.lang.String ext)
          Costruisce il file locale sul quale scaricare l'URL.
Assume che l'URL sia assoluto.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Collector

public Collector()
Method Detail

addDaUsare

public void addDaUsare(java.lang.String url,
                       float[] pesi)
Specifica un URL da elaborare.

indiceUsato

public boolean indiceUsato(java.lang.String URL)
Returns:
whether the link has been visited.

collect

public void collect(Site site)
Collect pages from site. Tenta di recuperare anche gli indici il cui download e' fallito.

newFile

public static java.io.File newFile(java.lang.String dir,
                                   java.lang.String prefix,
                                   java.lang.String ext)
Costruisce il file locale sul quale scaricare l'URL.
Assume che l'URL sia assoluto.

getContext

public Contesto getContext(java.lang.String url)
builds the Context for URL either reading it from net or from collected Contexts
Returns:
new Context or null if failed

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
A linea di comando e' possibile testare il funzionamento della classe. E' necessario fornire una URL da classificare ed il file di specifica del sito. Dopo la URL si possono specificare coppie che definiscono la storia iniziale per la catalogazione.