Tanl Linguistic Pipeline

Tanl::Text Namespace Reference

Text handling and internationalization support. More...

Namespaces

namespace  RegExp
 

Regular Expression matching.


namespace  Unicode
 

Utilities to handle UTF-8 strings.


Classes

class  Char
 Representation of Unicode characters. More...
class  Utf8Char
 This is just a type specifier for use in CharBuffer. More...
class  CChar
 This is just a type specifier for use in CharBuffer. More...
class  CharBuffer
 A text buffer that provides a random access iterator through it. More...
class  Encoding
class  HtmlTokenizer
 Similar to StringTokenizer, except that it skips HTML tags. More...
struct  Latin1Normalizer
 String normalizer interface. More...
struct  Normalizer
 String normalizer interface. More...
class  StreamTokenizer
class  String
 String class This class stores and manipulates strings of characters defined according to ISO10646. More...
class  StringTokenizer
class  Suffixes
 List of string suffix. More...
struct  eqstr
struct  eqstrcase
struct  WordIndex
 Associates an ID to each word in a set. More...
class  WordSetBase
class  WordSet
 Set of words. More...
struct  NormEqual
 Compare strings by normalizing to lowercase and discarding dots. More...
struct  NormHash
class  NormWordSet

Typedefs

typedef unsigned short UCS2
 UCS2 holds a single UTF-16 code unit.
typedef int UCS4
 UCS4 represents a Unicode code point.

Functions

char iso8859_to_ascii (char c)
 Convert an 8-bit ISO 8859-1 (Latin 1) character to its closest 7-bit ASCII equivalent.
bool operator== (const String &s1, const String &s2)
bool operator== (const String &s1, const std::string &s2)
bool operator== (const String &s1, const char *s2)
bool operator== (const std::string &s1, const String &s2)
bool operator== (const char *s1, const String &s2)
bool operator!= (const String &s1, const String &s2)
bool operator< (const String &s1, const String &s2)
bool operator> (const String &s1, const String &s2)
bool operator<= (const String &s1, const String &s2)
bool operator>= (const String &s1, const String &s2)
String operator+ (const String &s1, const String &s2)
String operator+ (const String &s1, String::CharType *c)
String operator+ (String::CharType *c, const String &s1)
String operator+ (const String &s1, String::CharType c)
String operator+ (String::CharType c, const String &s1)
bool strStartsWith (const char *s1, const char *init)
 Determine whether string s1 starts with the sequence in init, disregarding case.
void itoa (register long n, register char *s)
 Convert a long integer to a string.
void to_lower (register char *d, register char const *s)
 Convert a string to lower case.
char * to_lower (register char *s)
 Destructively convert a string to lower case.
string & to_lower (string &s)
 Convert a string to lower case.
void to_upper (register char *d, register char const *s)
 Convert a string to upper case.
char * to_upper (register char *s)
 Destructively convert a string to upper case.
string & to_upper (string &s)
 Convert a string to upper case.
char const * next_token (char const *&ptr, const char *sep, char esc)
 simple string tokenizer, with escape.
char const * next_token_line (char const *&ptr, const char *sep, char esc)
 simple string tokenizer, which returns next token within line.
char * strstr (const char *haystack, const char *needle, size_t count)
 Variant of strstr() which limits search to count characters in haystack.
std::string operator+ (const std::string s, const int i)
std::string operator+ (const int i, const std::string s)
std::string operator+ (const std::string s, const unsigned i)
std::string operator+ (const unsigned i, const std::string s)
void itoa (long, char *)
 String utilities.
char to_lower (char c)
char * to_lower (char *)
std::string & to_lower (std::string &)
char to_upper (char c)
char * to_upper (char *)
std::string & to_upper (std::string &)
char * strndup (char const *s, int len)
 Variant of strdup() which copies len characters from s.
int strncasecmp (const char *s1, const char *s2)
bool strempty (const char *s)
 Test for empty string.

Variables

char const iso8859_map []

Detailed Description

Text handling and internationalization support.

See also:
ICU

Function Documentation

char Tanl::Text::iso8859_to_ascii ( char  c  )  [inline]

Convert an 8-bit ISO 8859-1 (Latin 1) character to its closest 7-bit ASCII equivalent.

(This mostly means that accents are stripped.)

This function exists to ensure that the value of the character used to index the iso8859_map[] vector declared above is unsigned.

Parameters:
c The character to be converted.
Returns:
The said character.

SEE ALSO

International Standards Organization. ISO 8859-1: Information Processing -- 8-bit single-byte coded graphic character sets -- Part 1: Latin alphabet No. 1, 1987.

void Tanl::Text::itoa ( register long  n,
register char *  s 
)

Convert a long integer to a string.

Parameters:
n The long integer to be converted.
s A pointer to the string.
char const * Tanl::Text::next_token ( char const *&  ptr,
const char *  sep,
char  esc 
)

simple string tokenizer, with escape.

if preceded by

Parameters:
esc. A token is a sequence of characters delimited by characters in
sep except when preceded by
esc. 
sep sequence of delimiting characters
Returns:
the first token from
Parameters:
ptr. Advances ptr to the end of the token.
esc is an escape character for line continuation
char const * Tanl::Text::next_token_line ( char const *&  ptr,
const char *  sep,
char  esc 
)

simple string tokenizer, which returns next token within line.

A token is a sequence of characters delimited by characters in

Parameters:
sep except if preceded by
esc. 
sep sequence of delimiting characters
Returns:
the first token from
Parameters:
ptr. Advances ptr to the end of the token.
esc is an escape character for line continuation

Referenced by Tanl::NER::conf_feature::parseValue(), IXE::conf< Replacements >::parseValue(), and Parser::conf_feature::parseValue().

bool Tanl::Text::strempty ( const char *  s  )  [inline]

Test for empty string.

Returns:
true if string s is null or empty.
string& Tanl::Text::to_lower ( string &  s  ) 

Convert a string to lower case.

Parameters:
s The string to be converted.
Returns:
The modified string converted to lower-case.
char* Tanl::Text::to_lower ( register char *  s  ) 

Destructively convert a string to lower case.

Parameters:
s The string to be converted.
Returns:
The same string, after convertion.

References to_lower().

void Tanl::Text::to_lower ( register char *  d,
register char const *  s 
)

Convert a string to lower case.

Parameters:
d The destination string.
s The string to be converted.

Referenced by Tanl::Text::NormWordSet::insert(), and to_lower().

string& Tanl::Text::to_upper ( string &  s  ) 

Convert a string to upper case.

Parameters:
s The string to be converted.
Returns:
The modified string converted to upper-case.
char* Tanl::Text::to_upper ( register char *  s  ) 

Destructively convert a string to upper case.

Parameters:
s The string to be converted.
Returns:
The same string, after convertion.

References to_upper().

void Tanl::Text::to_upper ( register char *  d,
register char const *  s 
)

Convert a string to upper case.

Parameters:
d The destination string.
s The string to be converted.

Referenced by Tanl::SST::SstFeatureExtractor::extract(), Tanl::NER::NerFeatureExtractor::extract(), and to_upper().

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.