Tokenization

  • Each token is a distinct word

Sentence splitting

  • One root per sentence.
  • Titles are sentences.
  • Sentences may be broken at ";" and ":" if the resulting sentences are meaningful. Exceptions are:
    • lists of short items following a ":"