Tanl Linguistic Pipeline |
Regular Expression matching. More...
#include <RegExp.h>
Public Member Functions | |
Pattern (std::string const &expression, int cflags=0) | |
Pattern (char const *expression, int cflags=0) | |
Pattern (Pattern const &other) | |
Copy constructor. | |
Pattern & | operator= (Pattern const &other) |
Assignement. | |
bool | test (std::string const &str, int eflags=0) const |
Tests if the pattern matches at given string str. | |
bool | test (char const *str, size_t len=0, int eflags=0) |
Tests if the pattern matches at given string str, within the given length len. | |
int | matchSize (std::string const &text, int eflags=0) |
compute the size of the match. | |
int | match (const char *start, const char *end, MatchGroups &pos, int eflags=0) |
Matches the text between start and end and returns the matching positions in pos, expressed as byte-offset from start. | |
int | match (std::string const &text, MatchGroups &pos, int eflags=0) |
Matches the text in text and returns the matching positions in pos, expressed as adjusted character offset (not byte-offset of the UTF8-Stream). | |
std::vector< std::string > | match (std::string const &str, int eflags=0) |
std::string | replace (std::string &text, std::string &rewrite, bool replaceAll=false) |
Replaces the first substring matching the expression within text with the string rewrite. | |
Static Public Member Functions | |
static std::string | escape (std::string &str) |
Escapes all meta characters. | |
static const unsigned char * | setLocale (char const *locale) |
Set the locale for use during matching. | |
Static Public Attributes | |
static const unsigned char * | CharTables = Pattern::setLocale(setlocale(LC_CTYPE, 0)) |
The current chartable to use for matching. |
Regular Expression matching.
A pattern is compiled from a regular expression and used in matching. Regular expressions are written using the Perl 5 syntax.
A simple use for testing whether a string matches a pattern is:
Pattern p("a*b"); bool b = p.test("aaab");
In order to extract the portions of the string that match, MatchGroups
can be used:
Pattern p("(a*)b"); MatchGroups m(2); string s("daaab"); int n = p.matches(s, m);
n
is the number of groups matched: group 0 represents the substring captured by the whole pattern.
Tanl::Text::RegExp::Pattern::Pattern | ( | std::string const & | expression, | |
int | cflags = 0 | |||
) |
expression | the regular expression | |
cflags | a combination of CompileFlags |
NOTE. The ISO Latin-15 locale is used by default: ensure that the locale files for LC_CTYPE=en_US.iso885915 are installed in the OS. This can be changed using SetLocale().
Tanl::Text::RegExp::Pattern::Pattern | ( | char const * | expression, | |
int | cflags = 0 | |||
) |
expression | the regular expression | |
cflags | a combination of CompileFlags |
NOTE. The ISO Latin-15 locale is used by default: ensure that the locale files for LC_CTYPE=en_US.iso885915 are installed in the OS. This can be changed using SetLocale().
References CharTables.
Tanl::Text::RegExp::Pattern::Pattern | ( | Pattern const & | other | ) | [inline] |
Copy constructor.
Use pcre_refcount() to avoid freeing twice _pcre.
std::vector<std::string> Tanl::Text::RegExp::Pattern::match | ( | std::string const & | str, | |
int | eflags = 0 | |||
) |
str | the text to match. | |
eflags | any combinations of EvaluateFlags |
int Tanl::Text::RegExp::Pattern::match | ( | std::string const & | text, | |
MatchGroups & | pos, | |||
int | eflags = 0 | |||
) |
Matches the text in text and returns the matching positions in pos, expressed as adjusted character offset (not byte-offset of the UTF8-Stream).
text | the string to match. | |
pos | the identified matching positions. | |
eflags | any combinations of EvaluateFlags |
int Tanl::Text::RegExp::Pattern::match | ( | const char * | start, | |
const char * | end, | |||
MatchGroups & | pos, | |||
int | eflags = 0 | |||
) |
Matches the text between start and end and returns the matching positions in pos, expressed as byte-offset from start.
start | start of the text to match. | |
end | end of the text to match. | |
pos | the identified matching positions. | |
eflags | any combinations of EvaluateFlags |
References Tanl::Text::RegExp::MatchGroups::size().
Referenced by Tanl::SST::TokenCategorizer::analyze(), Tanl::NER::TokenCategorizer::analyze(), Tanl::TokenSentenceReader::MoveNext(), Tanl::ConllXSentenceReader::MoveNext(), and Tanl::SentenceReader::MoveNext().
int Tanl::Text::RegExp::Pattern::matchSize | ( | std::string const & | text, | |
int | eflags = 0 | |||
) |
compute the size of the match.
text | the text to match. | |
eflags | any combinations of EvaluateFlags. |
std::string Tanl::Text::RegExp::Pattern::replace | ( | std::string & | text, | |
std::string & | rewrite, | |||
bool | replaceAll = false | |||
) |
Replaces the first substring matching the expression within text with the string rewrite.
If replaceAll is true, all occurrences are replaced. Within rewrite, backslash-escaped digits ( to ) can be used to insert text matching corresponding parenthesized group from the pattern. in "rewrite" refers to the entire matching text. E.g.,
string s = "yabba dabba doo"; RegExp::Pattern("b+").replace(s, "d");
returns "yada dabba doo". The result is undefined if rewrite contains wrong pattern references.
Referenced by Tanl::SST::SstFeatureExtractor::extract(), and Tanl::NER::NerFeatureExtractor::extract().
const unsigned char * Tanl::Text::RegExp::Pattern::setLocale | ( | char const * | locale | ) | [static] |
Set the locale for use during matching.
Use "en_US.iso885915" or similar for recognizing ISO Latin-15 letters.
References CharTables.
bool Tanl::Text::RegExp::Pattern::test | ( | char const * | str, | |
size_t | len = 0 , |
|||
int | eflags = 0 | |||
) |
Tests if the pattern matches at given string str, within the given length len.
str | the string to match. | |
len | the length of the string to match. | |
eflags | any combinations of EvaluateFlags |
bool Tanl::Text::RegExp::Pattern::test | ( | std::string const & | str, | |
int | eflags = 0 | |||
) | const |
Tests if the pattern matches at given string str.
str | the string to match. | |
eflags | any combinations of EvaluateFlags |
Referenced by Tanl::NER::NerFeatureExtractor::extract(), and Parser::ParseState::transition().