Changes:
- Updated and clarified the documentation regarding adding RegexKitLite to an Xcode project.
- Created Xcode 3 DocSet documentation.
This document introduces RegexKitLite for Mac OS X. RegexKitLite enables easy access to regular expressions by providing a number of additions to the standard Foundation NSString class. RegexKitLite acts as a bridge between the NSString class and the regular expression engine in the International Components for Unicode, or ICU, dynamic shared library that is shipped with Mac OS X.
While RegexKitLite is not a descendent of the RegexKit.framework source code, it does provide a small subset of RegexKits NSString methods for performing various regular expression tasks. These include determining the range that a regular expression matches within a string, and easily creating a new string from a match.
RegexKitLite uses the regular expression provided by the ICU library that ships with Mac OS X. The two files, RegexKitLite.h and RegexKitLite.m, and linking against the /usr/lib/libicucore.dylib ICU shared library is all that is required. Adding RegexKitLite to your project only adds a few kilobytes of overhead to your applications size and typically only requires a few kilobytes of memory at runtime. Since a regular expression must first be compiled by the ICU library before it can be used, RegexKitLite keeps a small pseudo Least Recently Used cache of the compiled regular expressions.
The NSString that contains the regular expression must be compiled in to an ICU URegularExpression. This can be an expensive, time consuming step, and the compiled regular expression can be reused again in another search, even if the strings to be searched are different. Therefore RegexKitLite keeps a small cache of recently compiled regular expressions.
This cache is a simple hash table, the size of which can be tuned with the pre-processor define RKL_CACHE_SIZE. The default cache size, which should always be a prime number, is set to 23. The NSString regexString is mapped to a cache slot using modular arithmetic: Cache slot ≡ [regexString hash] mod RKL_CACHE_SIZE, i.e. cacheSlot = [regexString hash] % 23;. Since RegexKitLite uses Core Foundation, this is actually coded as cacheSlot = CFHash(regexString) % RKL_CACHE_SIZE;.
If the cache slot currently contains a compiled URegularExpression, checks are made to ensure that the current regexString is identical to the regular expression used to create the compiled URegularExpression. If they are a match, the cached compiled regular expression is used. If they are not a match, the current compiled regular expression for the selected cache slot is ejected and all of its resources are freed. Then the regexString that caused the ejection is compiled and fills the cache slot. Only one compiled regular expression can reside in a cache slot at a time.
When a regular expression is compiled, an immutable copy of the string is kept. For immutable NSString objects, the copy is usually the same object with its reference count increased by one. Only NSMutableString objects will cause a new, immutable NSString to be created.
If the regular expression being used is stored in a NSMutableString, the cached regular expression will continue to be used as long as the NSMutableString remains unchanged. Once mutated, the changed NSMutableString will no longer be a match for the cached compiled regular expression that was being used by it previously. Even if the newly mutated strings hash is congruent to the previous unmutated strings hash modulo RKL_CACHE_SIZE, that is to say they share the same cache slot (i.e., ([mutatedString hash] % RKL_CACHE_SIZE) == ([unmutatedString hash] % RKL_CACHE_SIZE)), the immutable copy of the regular expression string used to create the compiled regular expression is used to ensure true equality. The newly mutated string will have to go through the whole cache slot entry creation process and be compiled in to a URegularExpression.
This means that NSMutableString objects can be safely used as regular expressions, and any mutations to those objects will immediately be detected and reflected in the regular expression used for matching.
Unfortunately, the ICU regular expression API requires that the compiled regular expression be "set" to the string to be searched. To search a different string, the compiled regular expression must be "set" to the new string. Therefore, RegexKitLite tracks the last NSString that each compiled regular expression was set to, recording the pointer to the NSString object, its hash, and its length. If any of these parameters are different from the last parameters used for a compiled regular expression, the compiled regular expression is "set" to the new string. Since mutating a string will likely change its hash value, it's generally safe to search NSMutableString objects, and in most cases the mutation will reset the compiled regular expression to the updated contents of the NSMutableString.
When performing a match, the arguments used to perform the match are kept. If those same arguments are used again, the actual matching operation is skipped because the compiled regular expression already contains the results for the given arguments. This is mostly useful when a regular expression contains multiple capture groups, and the results for different capture groups for the same match are needed. This means that there is only a small penalty for iterating over all the capture groups in a regular expression for a match, and essentially becomes the direct ICU regular expression API equivalent of uregex_start() and uregex_end().
RegexKitLite is ideal when the string being matched is a non-ASCII, Unicode string. This is because the regular expression engine used, ICU, can only operate on UTF-16 encoded strings. Since Cocoa keeps essentially all non-ASCII strings encoded in UTF-16 form internally, this means that RegexKitLite can operate directly on the strings buffer without having to make a temporary copy and transcode the string in to ICU's required format.
Like all object oriented programming, the internal representation of an objects information is private. However, the ICU regular expression engine requires that the text to be search be encoded as a UTF-16 string. For pragmatic purposes, Core Foundation has several public functions that can provide direct access to the buffer used to hold the contents of the string, but such direct access is only available if the private buffer is already encoded in the requested direct access format. As a rough rule of thumb, 8-bit simple strings, such as ASCII, are kept in their 8-bit format, which is essentially UTF-8 strings. Non 8-bit simple strings are stored as UTF-16 strings. Of course, this is an implementation private detail, so the precise behavior should never be relied upon. It is mentioned because of the tremendous impact on matching performance and efficiency it can have.
For strings in which direct access to the UTF-16 string is available, RegexKitLite uses that buffer. This is the ideal case as no extra work needs to be performed, such as converting the string in to a UTF-16 string, and allocating memory to hold the temporary conversion. Of course, direct access is not always available, and occasionally the string to be searched will need to be converted in to a UTF-16 string.
RegexKitLite has two conversion buffer caches. Each buffer can only hold the contents of a single NSString at a time. If the selected buffer does not contain the contents of the NSString that is currently being searched, the previous occupant is ejected from the buffer and the current NSString takes it place. The first conversion buffer is fixed in size and set by the C pre-processor define RKL_FIXED_LENGTH, which defaults to 2048. Any string whose length is less than RKL_FIXED_LENGTH will use the fixed size conversion buffer. The second conversion buffer, for strings whose length is longer than RKL_FIXED_LENGTH, will use the dynamically sized conversion buffer. The memory allocation for the dynamically sized conversion buffer is resized for each conversion with realloc() to the size needed to hold the entire contents of the UTF-16 converted string.
This strategy was chosen for its relative simplicity. Keeping track of dynamically created resources is required to prevent memory leaks. As designed, there is only a single pointer to dynamically allocated memory: the pointer to hold the conversion contents of strings whose length is larger than RKL_FIXED_LENGTH. However, since realloc() is used to manage that memory allocation, it becomes very difficult to accidentally leak the buffer. Having the fixed sized buffer means that the memory allocation system isn't bothered with many small requests, most of which are transient in nature to begin with. The current strategy tries to strike the best balance between performance and simplicity.
When converted in to a UTF-16 string, the hash of the NSString is recorded, along with the pointer to the NSString object and the strings length. In order for the RegexKitLite to use the cached conversion, all of these parameters must be equal to their values of the NSString to be searched. If there is any difference, the cached conversion is discarded and the current NSString, or NSMutableString as the case may be, is reconverted in to a UTF-16 string.
RegexKitLite is also multithreading safe. Access to the compiled regular expression cache and the conversion cache is protected by a single OSSpinLock to ensure that only one thread has access at a time. The lock remains held while the regular expression match is performed since the compiled regular expression returned by the ICU library is not safe to use from multiple threads. Once the match has completed, the lock is released, and another thread is free to lock the cache and perform a match.
The goal of RegexKitLite is not to be a comprehensive Objective-C regular expression framework, but to provide a set of easy to use primitives from which additional functionality can be created. To this end, RegexKitLite provides the following two core primitives from which everything else is built:
There are no additional classes that supply the regular expression matching functionality, everything is accomplished with the two methods above. These methods are added to the existing NSString class via an Objective-C category extension. See NSString RegexKitLite Additions Reference for a complete list of methods.
The real workhorse is the rangeOfRegex:options:inRange:capture:error: method. The receiver of the message is an ordinary NSString class member that you wish to perform a regular expression match on. The parameters of the method are a NSString containing the regular expression regexString, any RKLRegexOptions match options, the NSRange range of the receiver that is to be searched, the capture number from the regular expression regexString that you would like the result for, and an optional error parameter that will contain a NSError object if a problem occurs with the details of the error.
A simple example:
In the previous example, the NSRange that capture number 2 matched is {5, 2}, which corresponds to the word is in searchString. Once the NSRange is known, you can create a new string containing just the matching text:
As a practical example of how to use the simple primitives provided by RegexKitLite, consider the common need of having to enumerate all the matches of a regular expression in a target string. The following example creates a simple NSEnumerator based enumerator for all the matches of a regular expression in a target string, returning a NSString of the text matched by the regular expression (capture 0) for each call to nextObject until the end of the string is reached. Each match begins searching where the last match ended.
The match enumerator is divided in to two parts. The public part is defined in the header RKLMatchEnumerator.h, below. The second part is a private subclass of NSEnumerator whose interface resides only in the file RKLMatchEnumerator.m. Match enumerators are instantiated by sending a NSString class member the message matchEnumeratorWithRegex:. A NSString with the regular expression is passed as the only argument, and a NSEnumerator is returned.
Next, in RKLMatchEnumerator.m, we define our private sub-class of NSEnumerator. In it we declare three instance variables, string, regex, and location. The string ivar holds the string to search, while regex holds the regular expression string. To guard against mutations to either, an immutable copy is made. The location ivar is used to keep track of the current location from which to begin matching. Finally, we declare our designated initializer which initializes the instantiated RKLMatchEnumerator object with the string to search and the regular expression to use.
The following begins the implementation section of RKLMatchEnumerator and a fairly standard initialization method, initWithString:regex:.
The following implements the heart of any NSEnumerator, the nextObject method. If all of the matches have been enumerated, location will be set to NSNotFound, and the body of the if statement won't be evaluated and NULL will be returned.
If there are still matches to be found, searchRange is created to begin at value of the location ivar, with the NSRange length set to the remaining length of the string to be searched, or location - [string length].
Then, the match is performed using the RegexKitLite method rangeOfRegex:inRange: and the result stored in the variable matchedRange.
Next, the location ivar is updated to point to the location at the end of the matchedRange. Since it is possible to have a match with a length of zero, it must handle that special case by adding one, otherwise it will loop endlessly, always matching the same location of zero length. If there was no match, matchedRange.location will be NSNotFound and matchedRange.length will be 0, and the location ivar will be set to NSNotFound.
If the matched range location is not NSNotFound, then a substring of the matched range will be returned. Otherwise, we will exit the if body and return NULL, indicating that the NSEnumerator has no more matches to enumerate.
A standard dealloc, releasing the string and regex ivar objects created during initialization.
And finally, the NSString category addition that returns our match enumerator. This simply creates an instance of our private NSEnumerator sub-class RKLMatchEnumerator, initializes it with the string to match, self, using the regular expression regexString, then sends the instantiated object autorelease, which is finally returned. Since this is a NSString category addition, this message will be sent to an instance of an object that is a member of the NSString class, which includes any objects whose super class is ultimately NSString. Therefore, the string to match is the instance receiving the message, self.
The following piece of code is a simple demonstration of the match enumerator which will use a regular expression to enumerate all the lines in the string to be searched.
The variable searchString contains the string to search. The example string includes several embedded \n, or new-line characters. There are a total of four lines of text, with the third line containing no characters.
The variable regexString contains the regular expression to be used for matching. This regular expression beings with the sequence (?m) which is used to enable the RKLMultiline regular expression option from the text of the regular expression itself. This enables the metacharacters ^ and $ to match the start of and end of a line, respectively. The remaining characters .* will match any character '.' zero or more times '*'. The prose translation would be:
Enable the RKLMultiline option and match all of the characters from the beginning of a line until the end of a line.
The match enumerator is then instantiated and the results are enumerated with a standard while loop, setting matchedString to the object returned by nextObject. For each line that is returned, the current line number, length of the matched string, and the matched string are printed.
The following shell transcript demonstrates compiling the example and executing it. Line number three clearly demonstrates that matches of zero length are possible. Without the additional logic in nextObject to handle this special case, the enumerator would never advance past the match.
For your convenience, the regular expression syntax from the ICU documentation is included below. When in doubt, you should refer to the official ICU User Guide - Regular Expressions documentation page.
Operator | Description |
---|---|
| | Alternation. A|B matches either A or B. |
* | Match zero or more times. Match as many times as possible. |
+ | Match one or more times. Match as many times as possible. |
? | Match zero or one times. Prefer one. |
{n} | Match exactly n times. |
{n,} | Match at least n times. Match as many times as possible. |
{n,m} | Match between n and m times. Match as many times as possible, but not more than m. |
*? | Match zero or more times. Match as few times as possible. |
+? | Match one or more times. Match as few times as possible. |
?? | Match zero or one times. Prefer zero. |
{n}? | Match exactly n times. |
{n,}? | Match at least n times, but no more than required for an overall pattern match. |
{n,m}? | Match between n and m times. Match as few times as possible, but not less than n. |
*+ | Match zero or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails. Possessive match. |
++ | Match one or more times. Possessive match. |
?+ | Match zero or one times. Possessive match. |
{n}+ | Match exactly n times. Possessive match. |
{n,}+ | Match at least n times. Possessive match. |
{n,m}+ | Match between n and m times. Possessive match. |
(…) | Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match. |
(?:…) | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses. |
(?>…) | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the (?> . |
(?#…) | Free-format comment (?#comment). |
(?=…) | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. |
(?!…) | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. |
(?<=…) | Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators). |
(?<!…) | Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators). |
(?ismwx-ismwx:…) | Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled. |
(?ismwx-ismwx) | Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match. See also: Regular Expression Options |
The following outlines the steps required to use RegexKitLite in your project.
First, add the ICU dynamic shared library to your Xcode project. You may choose to add the library to any group in your project, and which groups are created by default is dependent on the template type you chose when you created your project. For a typical Cocoa application project, a good choice is the Frameworks group. To add the ICU dynamic shared library, control/right-click on the Framework group and choose
Next, you will need to choose the ICU dynamic shared library file to add. Exactly which file to choose depends on your project, but a fairly safe choice is to select /Developer/SDKs/MacOSX10.5.sdk/usr/lib/libicucore.dylib. You may have installed your developer tools in a different location than the default /Developer directory, and the Mac OS X SDK version should be the one your project is targeting, typically the latest one available.
Then, in the dialog that follows, make sure that Copy items into… is unselected. Select the targets you will be using RegexKitLite in and then click to add the ICU dynamic shared library to your project.
Once the ICU dynamic shared library is added to your project, you will need to add it to the libraries that your executable is linked with. To do so, expand the Targets group, and then expand the executable targets you will be using RegexKitLite in. You will then need to select the libicucore.dylib file that you added in the previous step and drag it in to the Link Binary With Libraries group for each excutable target that you will be using RegexKitLite in. The order of the files within the Link Binary With Libraries group is not important, and for a typical Cocoa application the group will contain the Cocoa.framework file.
Next, add the RegexKitLite source files to your Xcode project. In the Groups & Files outline view on the left, control/right-click on the group that would like to add the files to, then select
Select the RegexKitLite.h and / or RegexKitLite.m file from the file chooser dialog.
The next dialog will present you with several options. If you have not already copied the RegexKitLite files in to your projects directory, you may want to click on the Copy items into… option. Select the targets that you would like add the RegexKitLite functionality to.
Finally, you will need to include the RegexKitLite.h header file. The best way to do this is very dependent on your project. If your project consists of only half a dozen source files, you can add:
manually to each source file that makes uses of RegexKitLites features. If your project has grown beyond this, you've probably already organized a common "master" header to include to capture headers that are required by nearly all source files already.
Using RegexKitLite from the shell is also easy. Again, you need to add the header #import to the appropriate source files. Then, to link to the ICU library, you typically only need to add -licucore, just as you would any other library. Consider the following example:
Compiled and run from the shell:
RegexKitLite is not meant to be a full featured regular expression framework. Because of this, it provides only the basic primitives needed to create additional functionality. It is ideal for developers who:
RegexKitLite consists of only two files, the header file RegexKitLite.h and RegexKitLite.m. The only other requirement is to link with the ICU library that comes with Mac OS X. No new classes are created, all functionality is provided as a category extension to the NSString class.
This documentation is available in the Xcode DocSet format. To add this documentation to Xcode, select
. Then, in the lower left hand corner of the documentation window, there should be a gear icon with a drop down menu indicator which you should select and choose and enter the following URL:feed://regexkit.sourceforge.net/RegexKitLiteDocSets.atom
Once you have added the URL, a new group should appear, inside which will be the RegexKitLite documentation with a Get button. Click on the Get button and follow the prompts. Xcode will ask you to enter an administrators password to install the documentation for the first time, which is explained here.
While RegexKitLite takes steps to ensure that the information it has cached is valid for the strings it searches, there exists the possibility that out of date cached information may be used when searching mutable strings. For each compiled regular expression, RegexKitLite caches the following information about the last NSString that was searched:
An ICU compiled regular expression must be "set" to the text to be searched. Before a compiled regular expression is used, the pointer to the string object to search, its hash, length, and the pointer to the UTF-16 buffer is compared with the values that the compiled regular expression was last "set" to. If any of these values are different, the compiled regular expression is reset and "set" to the new string.
If a NSMutableString is mutated between two uses of the same compiled regular expression and its hash, length, or UTF-16 buffer changes between uses, RegexKitLite will automatically reset the compiled regular expression with the new values of the mutated string. The results returned will correctly reflect the mutations that have taken place between searches.
It is possible that the mutations to a string can go undetected, however. If the mutation keeps the length the same, then the only way a change can be detected is if the strings hash value changes. For most mutations the hash value will change, but it is possible for two different strings to share the same hash. This is known as a hash collision. Should this happen, the results returned by RegexKitLite may not be correct.
Therefore, if you are using RegexKitLite to search NSMutableString objects, and those strings may have mutated in such a way that RegexKitLite is unable to detect that the string has changed, you must manually clear the internal cache to ensure that the results accurately reflect the mutations. You can clear the cache by calling the following class method:
Methods will raise an exception if their arguments are invalid, such as passing NULL for a required parameter. An invalid regular expression or RKLRegexOptions parameter will not raise an exception. Instead, a NSError object with information about the error will be created and returned via the address given with the optional error argument. If information about the problem is not required, error may be NULL. For convenience methods that do not have an error argument, they behave as if error was set to NULL when invoking the primary method.
This method should be used when performing searches on NSMutableString objects and there is the possibility that the string has mutated in between calls to RegexKitLite.
See Regular Expression Options for possible values.
Options for controlling the behavior of a regular expression pattern can be controlled in two ways. When the method supports it, options may specified by combining RKLRegexOptions flags with the C bitwise OR operator. For example:
The other way is to specify the options within the regular expression itself, of which there are two ways. The first specifies the options for everything following it, and the other sets the options on a per capture group basis. Options are either enabled, or following a -, disabled. The syntax for both is nearly identical:
Option | Example | Description |
---|---|---|
(?ixsmw-ixsmw)… | (?i)… | Enables the RKLCaseless option for everything that follows it. Useful at the beginning of a regular expression to set the desired options. |
(?ixsmw-ixsmw:…) | (?iw-m:…) | Enables the RKLCaseless and RKLUnicodeWordBoundaries options and disables RKLMultiline for the capture group enclosed by the parenthesis. |
The following table lists the regular expression pattern option character and its corresponding RKLRegexOptions flag:
Character | Option |
---|---|
i | RKLCaseless |
x | RKLComments |
s | RKLDotAll |
m | RKLMultiline |
w | RKLUnicodeWordBoundaries |
The RKLICURegexLineErrorKey, RKLICURegexOffsetErrorKey, RKLICURegexPreContextErrorKey, and RKLICURegexPostContextErrorKey error keys may not be present for all errors. For example, errors returned by passing invalid RKLRegexOptions flags will not have the listed keys set.
Changes:
Changes:
Bug fixes:
Initial release.
RegexKitLite is distributed under the terms of the BSD License, as specified below.
Copyright © 2008, John Engelhart
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.