|   
 CONTENTS
 
 PREFACE
 
 COPYRIGHT AND LICENSE
 
 INTRODUCTION
 
 Installation
 Acknowledgements
 
 FAQ
 
 SYNTAX
 
 What is a regular expression?
 Perl5 regular expressions
 
 THE INTERFACES
 
 Pattern
 PatternCompiler
 PatternMatcher
 MatchResult
 
 THE CLASSES
 
 Perl5Pattern
 Perl5Compiler
 Perl5Matcher
 PatternMatcherInput
 Perl5StreamInput
 Util
 Perl5Debug
 
 SAMPLE PROGRAMS
 
 MatchResult example
 Difference between matches() and contains()
 Case sensitivity
 Searching an InputStream
 Splits
 Substitutions
 
 APPENDIX
 
 Package API reference (javadoc generated)
 
 |        |  Syntax It is beyond the scope of this guide to give a detailed explanation of
regular expressions to beginners.  The OROMatcher TM package
is geared toward programmers who are already familiar with regular
expressions, having used them with other languages, and who now want
to apply them in their Java programs.  However, we shall make a small
attempt to cover the basics and summarize the Perl5 syntax supported
by the OROMatcher TM Perl5 classes.  For a detailed exploration of
regular expressions for both beginners and advanced users, we recommend
the book Mastering Regular Expressions by Jeffrey Friedl published
by O'Reilly & Associates.
  What is a regular expression? Part of this discussion is based on page 94 of
 "Compilers, Principles, Techniques, and Tools" by Aho, Sethi and Ullman
A regular expression is a pattern denoted by a sequence of symbols
representing a state-machine or mini-program that is capable of matching
particular sequences of characters.  Regular expressions have their
root in lexical analysis and tokenization where a set of lexemes had
to be recognized before being passed on to a parser.  Since then,
regular expressions took a life of their own, appearing in such languages
as AWK, TCL, and of course Perl, for all sorts of textual data extraction and
manipulation purposes.
 
The most basic regular expression syntax consists of 4 operations.  Let
A and B each represent an alphabet (a set of characters) and s and t
represent members of those alphabets.
 
 
| Operation | Representation | Meaning |   | Union of A and B | A|B | s is such that s is in A or s is in B |   | Concatentation of A and B | AB | st are such that s is in A and t is in B |   | Kleene closure of A | A* | Zero or more concatenations of A |   | Positive closure of A | A+ | One or more concatenations of A |  
Using this notation you can define a regular expression for positive
integers as follows:
  digit + Here digit represents the set of characters 0 - 9.  A range of
characters like this can be represented in most regular expression 
languages as
 [0-9].  Because this is such a common
expression, some languages have a special character for it: \d .
Learning a regular expression language is quite simple once you've learned
one, because most of the operations are the same.  Only the notation changes.
 Perl5 regular expressions Here we summarize the syntax of Perl5 regular expressions, all of which
is supported by the OROMatcher TM Perl5 classes.  However, for
a definitive reference, you should consult theperlreman page 
that accompanies the Perl5 distribution and also the book
 Programming Perl, 2nd Edition  from O'Reilly & Associates.
We need to point out here that for efficiency reasons the character
set operator [...] is limited to work on only ASCII characters
(Unicode characters 0 through 255).  Other than that restriction, all
Unicode characters should be useable in the package's regular expressions. 
 
 Alternatives separated by |
 Quantified atoms
 
       {n,m}  Match at least n but not more than m times.
       {n,}   Match at least n times.
       {n}    Match exactly n times.  
       *      Match 0 or more times.
       +      Match 1 or more times.
       ?      Match 0 or 1 times.
  Atoms
 
      regular expression within parentheses
      a . matches everything except \n
      a ^ is a null token matching the beginning of a string or line
          (i.e., the position right after a newline or right before
           the beginning of a string)
      a $ is a null token matching the end of a string or line
          (i.e., the position right before a newline or right after
           the end of a string)
      Character classes (e.g., [abcd]) and ranges (e.g. [a-z])
     
          Special backslashed characters work within a character
              class (except for backreferences and boundaries).  
          \b is backspace inside a character class
      Special backslashed characters
     
          \b  null token matching a word boundary (\w on one side
                      and \W on the other)
          \B  null token matching a boundary that isn't a
                      word boundary
	  \A  Match only at beginning of string
          \Z  Match only at end of string (or before newline
                      at the end)
	  \n  newline
          \r  carriage return
          \t  tab
          \f  formfeed
          \d  digit [0-9]
          \D  non-digit [^0-9]
          \w  word character [0-9a-z_A-Z]
          \W  a non-word character [^0-9a-z_A-Z]
          \s  a whitespace character [ \t\n\r\f]
          \S  a non-whitespace character [^ \t\n\r\f]
          \xnn  hexadecimal representation of character
          \cD  matches the corresponding control character
          \nn or \nnn  octal representation of character
                               unless a backreference.  a 
          \1, \2, \3, etc.  match whatever the first, second,
          third, etc. parenthesized group matched.  This is called a
          backreference.  If there is no corresponding group, the
          number is interpreted as an octal representation of a character.
          \0  matches null character
          Any other backslashed character matches itself
      Expressions within parentheses are matched as subpattern groups
      and saved for use by certain methods.
  
By default, a quantified subpattern is  greedy .
In other words it matches as many times as possible without causing
the rest of the pattern not to match. To change the quantifiers
to match the minimum number of times possible, without
causing the rest of the pattern not to match, you may use
a "?" right after the quantifier.
 
 *?      Match 0 or more times
 +?      Match 1 or more times
 ??      Match 0 or 1 time
 {n}?    Match exactly n times
 {n,}?   Match at least n times
 {n,m}?  Match at least n but not more than m times
 
 Perl5 extended regular expressions  are fully supported.
 
 (?#text)  An embedded comment causing text to be ignored.
 (?:regexp)  Groups things like "()" but doesn't cause the
 group match to be saved.
 (?=regexp) 
                 A zero-width positive lookahead assertion.  For
                 example, \w+(?=\s) matches a word followed by
                 whitespace, without including whitespace in the
		 MatchResult.
 (?!regexp) 
                 A zero-width negative lookahead assertion.  For
                 example foo(?!bar) matches any occurrence of
                 "foo" that isn't followed by "bar".  Remember
		 that this is a zero-width assertion, which means
		 that a(?!b)d will match ad because a is followed
		 by a character that is not b (the d) and a d
		 follows the zero-width assertion.
 (?imsx)  One or more embedded pattern-match modifiers.
		i enables case insensitivity, m enables multiline
		treatment of the input, s enables single line treatment
		of the input, and x enables extended whitespace comments.
Copyright © 1997 ORO, Inc.  All rights reserved.
Original Reusable Objects, ORO, the ORO logo,
and "Component software for the Internet" are trademarks or registered
trademarks of ORO, Inc. in the United States and other countries.
 Java is a trademark of Sun Microsystems.  All other trademarks are the
property of their respective holders.
 |