Tanl Linguistic Pipeline

Tanl::Corpus Class Reference
[ClassifierDependency Parser]

Represents common aspects of a Corpus. More...

#include <Corpus.h>

Inheritance diagram for Tanl::Corpus:
Tanl::CombCorpus Tanl::Conll08Corpus Tanl::ConllXCorpus Tanl::DgaCorpus Tanl::TextCorpus Tanl::TokenizedTextCorpus

List of all members.

Public Member Functions

 Corpus (Language const &lang)
 Corpus (Language const &lang, CorpusFormat &format)
 Create from specified CorpusFormat.
 Corpus (Language const &lang, char const *formatFile)
 Read the corpus format from file formatFile.
AttributeId attributeId (const char *name)
virtual SentenceReadersentenceReader (std::istream *is)
virtual void print (std::ostream &os, Sentence const &sent) const
 Print the sentence in the standard format for the corpus.
virtual std::string toString (Sentence const &sent) const
 Corpus (Language const &lang)
 Corpus (Language const &lang, CorpusFormat &format)
 Create from specified CorpusFormat.
 Corpus (Language const &lang, char const *formatFile)
 Read the corpus format from file formatFile.
AttributeId attributeId (const char *name)
virtual SentenceReadersentenceReader (std::istream *is)
virtual void print (std::ostream &os, Sentence const &sent) const
 Print the sentence in the standard format for the corpus.
virtual std::string toString (Sentence const &sent) const

Static Public Member Functions

static Corpuscreate (Language const &language, char const *inputFormat)
 Factory pattern for creating a Corpus based on the provided format.
static Corpuscreate (char const *language, char const *inputFormat)
static CorpusFormatparseFormat (char const *formatFile)
 Read the corpus format from file formatFile.
static Corpuscreate (Language const &language, char const *inputFormat)
 Factory pattern for creating a Corpus based on the provided format.
static Corpuscreate (char const *language, char const *inputFormat)
static CorpusFormatparseFormat (char const *formatFile)
 Read the corpus format from file formatFile.

Public Attributes

Language const & language
AttributeIndex index
 associates an index to field names
TokenFields tokenFields
 describes properties of fields in tokens

Static Protected Member Functions

static CorpusFormatparseFormat (std::istream &is)
static CorpusFormatparseFormat (std::istream &is)

Detailed Description

Represents common aspects of a Corpus.


Constructor & Destructor Documentation

Tanl::Corpus::Corpus ( Language const &  lang  )  [inline]
Parameters:
lang the default language for sentences in the corpus.
Tanl::Corpus::Corpus ( Language const &  lang,
CorpusFormat format 
) [inline]

Create from specified CorpusFormat.

Parameters:
lang the default language for sentences in the corpus.
Tanl::Corpus::Corpus ( Language const &  lang,
char const *  formatFile 
)

Read the corpus format from file formatFile.

Parameters:
lang the default language for sentences in the corpus.

References Tanl::CorpusFormat::index, index, parseFormat(), Tanl::CorpusFormat::tokenFields, and tokenFields.

Tanl::Corpus::Corpus ( Language const &  lang  )  [inline]
Parameters:
lang the default language for sentences in the corpus.
Tanl::Corpus::Corpus ( Language const &  lang,
CorpusFormat format 
) [inline]

Create from specified CorpusFormat.

Parameters:
lang the default language for sentences in the corpus.
Tanl::Corpus::Corpus ( Language const &  lang,
char const *  formatFile 
)

Read the corpus format from file formatFile.

Parameters:
lang the default language for sentences in the corpus.

Member Function Documentation

AttributeId Tanl::Corpus::attributeId ( const char *  name  )  [inline]
Returns:
the index (into the vector of values for tokens) of the attribute with the given
Parameters:
name. 

References index, and Tanl::AttributeIndex::insert().

AttributeId Tanl::Corpus::attributeId ( const char *  name  )  [inline]
Returns:
the index (into the vector of values for tokens) of the attribute with the given
Parameters:
name. 
static Corpus* Tanl::Corpus::create ( Language const &  language,
char const *  inputFormat 
) [static]

Factory pattern for creating a Corpus based on the provided format.

Parameters:
lang the default language for sentences in the corpus.
inputFormat is either the name of a builtin format (either CoNLL, conll08, DgaXML, Text, TokenizedText) or the name of a file containing the specifications of the format.
Corpus * Tanl::Corpus::create ( Language const &  language,
char const *  inputFormat 
) [static]

Factory pattern for creating a Corpus based on the provided format.

Parameters:
lang the default language for sentences in the corpus.
inputFormat is either the name of a builtin format (either CoNLL, conll08, DgaXML, Text, TokenizedText) or the name of a file containing the specifications of the format.

References parseFormat().

static CorpusFormat* Tanl::Corpus::parseFormat ( char const *  formatFile  )  [static]

Read the corpus format from file formatFile.

Returns:
created format or 0 if reading failed.
CorpusFormat * Tanl::Corpus::parseFormat ( char const *  formatFile  )  [static]

Read the corpus format from file formatFile.

Returns:
created format or 0 if reading failed.

Referenced by Corpus(), and create().

virtual SentenceReader* Tanl::Corpus::sentenceReader ( std::istream *  is  )  [virtual]
Returns:
an appropriate reader for reading sentences of the corpus from the given stream is.

Reimplemented in Tanl::ConllXCorpus, Tanl::DgaCorpus, Tanl::TextCorpus, and Tanl::TokenizedTextCorpus.

virtual SentenceReader* Tanl::Corpus::sentenceReader ( std::istream *  is  )  [virtual]
Returns:
an appropriate reader for reading sentences of the corpus from the given filename.

Reimplemented in Tanl::ConllXCorpus, Tanl::DgaCorpus, Tanl::TextCorpus, and Tanl::TokenizedTextCorpus.


The documentation for this class was generated from the following files:
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.