: Class Collector

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: INNER | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

collector
Class Collector

java.lang.Object
  |
  +--collector.Collector

public class Collector
extends java.lang.Object

Collects from Web servers the HTML pages to be classified.

The collection starts from an "index site" and stays within its boundary. An index site is defined by: 1. a list of base URLs, 2. a list of boundary URL prefixes, 3. a stop-list.

The collector retrieves any URL which: 1. is reachable from a base URL, 2. is within the boundary URLs, 3. does not match an element in the stop-list.

Any URL present in a retrieved document is exhamined to determine whether: 1. it links to an internal document: in such case the referred document is collected 2. it links to an external document: in such case the referred document is classified according to the context of the URL.

The boundary URL (prefixes) are used to distinguish internal from external pages.

Ex.: if "http://www.yahoo.com/" is a boundary URL prefix, any page in Yahoo is considered within the "index site".

Ex.: if "http://www.server.it/indice/" is a boundary URL prefix, "http://www.server.it/doc/paper1.html" is NOT considered within the "index site".

Ex.: if "http://www.server.it/index.html" is a boundary URL prefix, only page "index.html" is within the index site.

Finally, the stop-list contains strings which identify pages internal to the site, but which should not be visited. This is particularly useful to avoid pages referring to CGI scripts or other services, or in general non HTML pages: Ex.: if "/cgi-bin/" is in the stop list, a link such as "http://www.server.it/cgi-bin/" will not be collected.

Constructor Summary
`Collector()`

Method Summary
`void`	`addDaUsare(java.lang.String url, float[] pesi)` Specifica un URL da elaborare.
`void`	`collect(Site site)` Collect pages from site.
`Contesto`	`getContext(java.lang.String url)` builds the Context for URL either reading it from net or from collected Contexts
`boolean`	`indiceUsato(java.lang.String URL)`
`static void`	`main(java.lang.String[] args)` A linea di comando e' possibile testare il funzionamento della classe.
`static java.io.File`	`newFile(java.lang.String dir, java.lang.String prefix, java.lang.String ext)` Costruisce il file locale sul quale scaricare l'URL. Assume che l'URL sia assoluto.

Methods inherited from class java.lang.Object

clone, 
equals, 
finalize, 
getClass, 
hashCode, 
notify, 
notifyAll, 
toString, 
wait, 
wait, 
wait

Constructor Detail

Collector

public Collector()

Method Detail

addDaUsare

public void addDaUsare(java.lang.String url,
                       float[] pesi)

Specifica un URL da elaborare.

indiceUsato

public boolean indiceUsato(java.lang.String URL)

Returns:: whether the link has been visited.

collect

public void collect(Site site)

Collect pages from site. Tenta di recuperare anche gli indici il cui download e' fallito.

newFile

public static java.io.File newFile(java.lang.String dir,
                                   java.lang.String prefix,
                                   java.lang.String ext)

Costruisce il file locale sul quale scaricare l'URL.
Assume che l'URL sia assoluto.

getContext

public Contesto getContext(java.lang.String url)

builds the Context for URL either reading it from net or from collected Contexts

Returns:: new Context or null if failed

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

A linea di comando e' possibile testare il funzionamento della classe. E' necessario fornire una URL da classificare ed il file di specifica del sito. Dopo la URL si possono specificare coppie che definiscono la storia iniziale per la catalogazione.