ORO, Inc. Logo

CONTENTS

PREFACE

COPYRIGHT AND LICENSE

INTRODUCTION

Installation
Acknowledgements

FAQ

SYNTAX

What is a regular expression?
Perl5 regular expressions

THE INTERFACES

Pattern
PatternCompiler
PatternMatcher
MatchResult

THE CLASSES

Perl5Pattern
Perl5Compiler
Perl5Matcher
PatternMatcherInput
Perl5StreamInput
Util
Perl5Debug

SAMPLE PROGRAMS

MatchResult example
Difference between matches() and contains()
Case sensitivity
Searching an InputStream
Splits
Substitutions

APPENDIX

Package API reference (javadoc generated)
      

The Classes


The current set of OROMatcher TM implement Perl5 regular expressions, but future releases will include classes for other regular expression grammars that users request. As a side note, you do not need to include the Util or Perl5Debug classes with software you write with OROMatcher TM if you do not use those classes in your code. This can reduce the size of your software distribution by a few kilobytes.

Perl5Pattern

Perl5Pattern implements the Pattern interface for Perl5 regular expressions. The only reason it is made visible to the programmer is for type safety when calling the Perl5Matcher(Perl5StreamInput, Perl5Pattern) method and for programmer accesibility when the class is made serializable in a future release incorporating 1.1 features. Currenly we want the package to be usable with the 1.0.2 and 1.1.* JDK's. But we will release a 1.1 enhanced version of the package leveraging the 1.1 features, such as serializability, that our users want. At that point we will distribute both 1.0.2 and 1.1 versions of the classes.

Perl5Compiler

The Perl5Compiler class creates compiled regular expressions conforming to the Perl5 regular expression syntax. It generates Perl5Pattern instances upon compilation to be used in conjunction with a Perl5Matcher instance. Please refer to the Syntax section for more information on Perl5 regular expressions.

The Perl5Compiler compile() methods can take the following flags which can be bitwise or'ed together to affect the nature of the compiled pattern:

DEFAULT_MASK
The default mask for the compile methods. It is equal to 0. The default behavior is for a regular expression to be case sensitive and to not specify if it is multiline or singleline. When MULITLINE_MASK and SINGLINE_MASK are not defined, the ^, $, and . metacharacters are interpreted according to the value of isMultiline() in Perl5Matcher. The default behavior of Perl5Matcher is to treat the Perl5Pattern as though MULTILINE_MASK were enabled. If isMultiline() returns false, then the pattern is treated as though SINGLINE_MASK were set. However, compiling a pattern with the MULTILINE_MASK or SINGLELINE_MASK masks will ALWAYS override whatever behavior is specified by the setMultiline() in Perl5Matcher.
CASE_INSENSITIVE_MASK
A mask passed as an option to the compile methods to indicate a compiled regular expression should be case insensitive.
MULTILINE_MASK
A mask passed as an option to the compile methods to indicate a compiled regular expression should treat input as having multiple lines. This option affects the interpretation of the ^ and $ metacharacters. When this mask is used, the ^ metacharacter matches at the beginning of every line, and the $ metacharacter matches at the end of every line. Additionally the . metacharacter will not match newlines when an expression is compiled with MULTILINE_MASK , which is its default behavior. The SINGLELINE_MASK and MULTILINE_MASK should not be used together.
SINGLELINE_MASK
A mask passed as an option to the compile methods to indicate a compiled regular expression should treat input as being a single line. This option only affects the interpretation of the ^ and $ metacharacters. When this mask is used, the ^ metacharacter matches at the beginning of the input, and the $ metacharacter matches at the end of the input. The ^ and $ metacharacters will not match at the beginning and end of lines occurring between the begnning and end of the input. Additionally, the . metacharacter will match newlines when an expression is compiled with SINGLELINE_MASK , unlike its default behavior. The SINGLELINE_MASK and MULTILINE_MASK should not be used together.
EXTENDED_MASK
A mask passed as an option to the compile methods to indicate a compiled regular expression should be treated as a Perl5 extended pattern (i.e., a pattern using the /x modifier). This option tells the compiler to ignore whitespace that is not backslashed or within a character class. It also tells the compiler to treat the # character as a metacharacter introducing a comment as in Perl. In other words, the # character will comment out any text in the regular expression between it and the next newline. The intent of this option is to allow you to divide your patterns into more readable parts. It is provided to maintain compatibility with Perl5 regular expressions, although it will not often make sense to use it in Java.

Perl5Matcher

The Perl5Matcher classes function according to the PatternMatcher interface when used with Perl5Patterns. Perl5Matcher contains 3 methods that don't appear in the PatternMatcher interface:
setMultiline(boolean)
Sets whether or not subsequent calls to matches() or contains() should treat the input as consisting of multiple lines. The default behavior is for input to be treated as consisting of multiple lines. This method should only be called if the Perl5Pattern used for a match was compiled without either of the Perl5Compiler.MULTILINE_MASK or Perl5Compiler.SINGLELINE_MASK flags, and you want to alter the behavior of how the ^ and $ metacharacters are interpreted on the fly. The compilation options used when compiling a pattern ALWAYS override the behavior specified by setMultiline().
isMultiline()
Returns the last value set by setMultiline(). The default value is true.
contains(Perl5StreamInput, Perl5Pattern)
Determines if the contents of a Perl5StreamInput instance, starting from the current offset of the input, contains a pattern. If a pattern match is found, a MatchResult instance representing the first such match is made acessible via getMatch() . The current offset of the input stream is advanced to the end offset corresponding to the end of the match. Consequently a subsequent call to this method will continue searching where the last call left off. See Perl5StreamInput for more details.

PatternMatcherInput

The PatternMatcherInput class is used to preserve state across calls to the contains() methods of PatternMatcher instances. It is also used to specify that only a subregion of a string should be used as input when looking for a pattern match. All that is meant by preserving state is that the end offset of the last match is remembered, so that the next match is performed from that point where the last match left off. This offset can be accessed from the getCurrentOffset() method and can be set with the setCurrentOffset(int) method.

You would use a PatternMatcherInput object when you want to search for more than just the first occurrence of a pattern in a string, or when you only want to search a subregion of the string for a match. An example of its most common use is:

      PatternMatcher matcher;
      PatternCompiler compiler;
      Pattern pattern;
      PatternMatcherInput input;
      MatchResult result;

      compiler = new Perl5Compiler();
      matcher  = new Perl5Matcher();

      try {
        pattern = compiler.compile(somePatternString);
      } catch(MalformedPatternException e) {
        System.out.println("Bad pattern.");
        System.out.println(e.getMessage());
        return;
      }

      input   = new PatternMatcherInput(someStringInput);

      while(matcher.contains(input, pattern)) {
        result = matcher.getMatch();  
        // Perform whatever processing on the result you want.
      }

      // Suppose we want to start searching from the beginning again with
      // a different pattern.
      // Just set the current offset to the begin offset.
      input.setCurrentOffset(input.getBeginOffset());

      // Second search omitted
      // Suppose we're done with this input, but want to search another string.
      // There's no need to create another PatternMatcherInput instance.
      // We can just use the setInput() method.
      input.setInput(aNewInputString);

Perl5StreamInput

The Perl5StreamInput class is used to look for pattern matches in an InputStream in conjunction with the Perl5Matcher class. It is called Perl5StreamInput instead of Perl5InputStream to stress that it is a form of streamed input for the Perl5Matcher rather than a subclass of InputStream. Perl5StreamInput performs special internal buffering to accelerate pattern searches through a stream. You can determine the size of this buffer and how it grows by using the appropriate constructor. You should avoid using buffer increments smaller than 4096 bytes, as they will adversely affect peformance.

If you want to perform line by line matches on an InputStream, you should use DataInputStream or BufferedReader class (depending on whether you are using JDK 1.0.2 or 1.1) in conjunction with one of the PatternMatcher methods taking a String, char[], or PatternMatcherInput as an argument. The DataInputStream and BufferedReader readLine() methods are implemented as native methods and therefore more efficient than supporting line by line searching within Perl5StreamInput.

In the future the programmer will be able to set this class to save all the input it sees so that it can be accessed later. This will avoid having to read a stream more than once for whatever reason.

For an example of how to use the Perl5StreamInput class, look at streamInputExample.java .

Util

The Util class is a holder for useful static utility methods that can be generically applied to Pattern and PatternMatcher instances. This class cannot and is not meant to be instantiated. The Util class currently contains versions of the split() and substitute() methods inspired by Perl's split function and s operation respectively, although they are implemented in such a way as not to rely on the Perl5 implementations of the OROMatcher packages regular expression interfaces. They may operate on any interface implementations conforming to the OROMatcher API specification for the PatternMatcher, Pattern, and MatchResult interfaces. Future versions of the class may include additional utility methods.

A grep method is not included for two reasons:

  1. The details of reading a line at a time from an input stream differ in JDK 1.0.2 and JDK 1.1, making it difficult to retain compatibility across both Java releases.
  2. Grep style processing is trivial for the programmer to implement in a while loop. Rarely does anyone want to retrieve all occurences of a pattern and then process them. More often a programmer will retrieve pattern matches and process them as they are retrieved, which is more efficient than storing them all in a Vector and then accessing them.

For an example of how to use the split and substitute methods look at splitExample.java and substituteExample.java .

Perl5Debug

The Perl5Debug class is not intended for general use and should not be instantiated, but is provided because some users may find the output of its single method to be useful. The Perl5Compiler class generates a representation of a regular expression identical to that of Perl5 in the abstract, but not in terms of actual data structures. The Perl5Debug class allows the bytecode program contained by a Perl5Pattern to be printed out for comparison with the program generated by Perl5 with the -r option. The Perl5Debug class is provided primarily for Perl programmers used to using the Perl -r option.
Copyright © 1997 ORO, Inc. All rights reserved. Original Reusable Objects, ORO, the ORO logo, and "Component software for the Internet" are trademarks or registered trademarks of ORO, Inc. in the United States and other countries.
Java is a trademark of Sun Microsystems. All other trademarks are the property of their respective holders.