Tanl Linguistic Pipeline

Tanl::Text::RegExp::Pattern Class Reference

Regular Expression matching. More...

#include <RegExp.h>

Inheritance diagram for Tanl::Text::RegExp::Pattern:
IXE::Pattern2

List of all members.

Public Member Functions

 Pattern (std::string const &expression, int cflags=0)
 Pattern (char const *expression, int cflags=0)
 Pattern (Pattern const &other)
 Copy constructor.
Patternoperator= (Pattern const &other)
 Assignement.
bool test (std::string const &str, int eflags=0) const
 Tests if the pattern matches at given string str.
bool test (char const *str, size_t len=0, int eflags=0)
 Tests if the pattern matches at given string str, within the given length len.
int matchSize (std::string const &text, int eflags=0)
 compute the size of the match.
int match (const char *start, const char *end, MatchGroups &pos, int eflags=0)
 Matches the text between start and end and returns the matching positions in pos, expressed as byte-offset from start.
int match (std::string const &text, MatchGroups &pos, int eflags=0)
 Matches the text in text and returns the matching positions in pos, expressed as adjusted character offset (not byte-offset of the UTF8-Stream).
std::vector< std::string > match (std::string const &str, int eflags=0)
std::string replace (std::string &text, std::string &rewrite, bool replaceAll=false)
 Replaces the first substring matching the expression within text with the string rewrite.

Static Public Member Functions

static std::string escape (std::string &str)
 Escapes all meta characters.
static const unsigned char * setLocale (char const *locale)
 Set the locale for use during matching.

Static Public Attributes

static const unsigned char * CharTables = Pattern::setLocale(setlocale(LC_CTYPE, 0))
 The current chartable to use for matching.

Detailed Description

Regular Expression matching.

A pattern is compiled from a regular expression and used in matching. Regular expressions are written using the Perl 5 syntax.

A simple use for testing whether a string matches a pattern is:

      Pattern p("a*b");
      bool b = p.test("aaab");
  

In order to extract the portions of the string that match, MatchGroups can be used:

      Pattern p("(a*)b");
      MatchGroups m(2);
      string s("daaab");
      int n = p.matches(s, m);
  

n is the number of groups matched: group 0 represents the substring captured by the whole pattern.


Constructor & Destructor Documentation

Tanl::Text::RegExp::Pattern::Pattern ( std::string const &  expression,
int  cflags = 0 
)
Parameters:
expression the regular expression
cflags a combination of CompileFlags

NOTE. The ISO Latin-15 locale is used by default: ensure that the locale files for LC_CTYPE=en_US.iso885915 are installed in the OS. This can be changed using SetLocale().

Tanl::Text::RegExp::Pattern::Pattern ( char const *  expression,
int  cflags = 0 
)
Parameters:
expression the regular expression
cflags a combination of CompileFlags

NOTE. The ISO Latin-15 locale is used by default: ensure that the locale files for LC_CTYPE=en_US.iso885915 are installed in the OS. This can be changed using SetLocale().

References CharTables.

Tanl::Text::RegExp::Pattern::Pattern ( Pattern const &  other  )  [inline]

Copy constructor.

Use pcre_refcount() to avoid freeing twice _pcre.


Member Function Documentation

std::vector<std::string> Tanl::Text::RegExp::Pattern::match ( std::string const &  str,
int  eflags = 0 
)
Parameters:
str the text to match.
eflags any combinations of EvaluateFlags
Returns:
an vector<string>: [0] substring matched [1 - n] sub expression with '()'
int Tanl::Text::RegExp::Pattern::match ( std::string const &  text,
MatchGroups pos,
int  eflags = 0 
)

Matches the text in text and returns the matching positions in pos, expressed as adjusted character offset (not byte-offset of the UTF8-Stream).

Parameters:
text the string to match.
pos the identified matching positions.
eflags any combinations of EvaluateFlags
Returns:
0 if not matching, otherwise the count of matched expressions.
int Tanl::Text::RegExp::Pattern::match ( const char *  start,
const char *  end,
MatchGroups pos,
int  eflags = 0 
)

Matches the text between start and end and returns the matching positions in pos, expressed as byte-offset from start.

Parameters:
start start of the text to match.
end end of the text to match.
pos the identified matching positions.
eflags any combinations of EvaluateFlags
Returns:
0 if not matching, otherwise the count of matched expressions.

References Tanl::Text::RegExp::MatchGroups::size().

Referenced by Tanl::SST::TokenCategorizer::analyze(), Tanl::NER::TokenCategorizer::analyze(), Tanl::TokenSentenceReader::MoveNext(), Tanl::ConllXSentenceReader::MoveNext(), and Tanl::SentenceReader::MoveNext().

int Tanl::Text::RegExp::Pattern::matchSize ( std::string const &  text,
int  eflags = 0 
)

compute the size of the match.

Parameters:
text the text to match.
eflags any combinations of EvaluateFlags.
Returns:
0 if not matching, otherwise the size of the match
Pattern& Tanl::Text::RegExp::Pattern::operator= ( Pattern const &  other  )  [inline]

Assignement.

Use pcre_refcount() to avoid freeing twice _pcre.

Reimplemented in IXE::Pattern2.

std::string Tanl::Text::RegExp::Pattern::replace ( std::string &  text,
std::string &  rewrite,
bool  replaceAll = false 
)

Replaces the first substring matching the expression within text with the string rewrite.

If replaceAll is true, all occurrences are replaced. Within rewrite, backslash-escaped digits ( to ) can be used to insert text matching corresponding parenthesized group from the pattern. in "rewrite" refers to the entire matching text. E.g.,

	string s = "yabba dabba doo";
	RegExp::Pattern("b+").replace(s, "d");
	

returns "yada dabba doo". The result is undefined if rewrite contains wrong pattern references.

Referenced by Tanl::SST::SstFeatureExtractor::extract(), and Tanl::NER::NerFeatureExtractor::extract().

const unsigned char * Tanl::Text::RegExp::Pattern::setLocale ( char const *  locale  )  [static]

Set the locale for use during matching.

Use "en_US.iso885915" or similar for recognizing ISO Latin-15 letters.

References CharTables.

bool Tanl::Text::RegExp::Pattern::test ( char const *  str,
size_t  len = 0,
int  eflags = 0 
)

Tests if the pattern matches at given string str, within the given length len.

Parameters:
str the string to match.
len the length of the string to match.
eflags any combinations of EvaluateFlags
Returns:
true if matches
bool Tanl::Text::RegExp::Pattern::test ( std::string const &  str,
int  eflags = 0 
) const

Tests if the pattern matches at given string str.

Parameters:
str the string to match.
eflags any combinations of EvaluateFlags
Returns:
true if matches

Referenced by Tanl::NER::NerFeatureExtractor::extract(), and Parser::ParseState::transition().


The documentation for this class was generated from the following files:
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines
 
Copyright © 2005-2011 G. Attardi. Generated on 4 Mar 2011 by doxygen 1.6.1.