The
AGREP software and document parts of this page are still unchanged since 16. September 1998,
because no further developements took place (yet). If you want to contribute an new port or
implementation, please contact
me (see below).
Some interesting links related to AGREP and newer developments as of 16.12.2007PERL module
String:Approx
Perl extension for approximate matching
(fuzzy matching) by Jarkko Hietaniemi, Finland
String::Approx lets you match and
substitute strings approximately. With this you can emulate errors:
typing errorrs, speling errors, closely related vocabularies (colour
color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot).
Excerpt from the README file:
"The most important change was that the
underlying algorithm waschanged completely. Instead of doing everything
in Perl using regular expressions we now do the matching in C using the
so-called Manber-Wuk-differences algorithm shift-add. You have met this
algorithm if youhave used the agrep utility or the Glimpse indexing
system.Because
this implementation shares no code with agrep, only the well-publicized
algorithms, the use of this software is limited only by my copyright
(file COPYRIGHT). This interpretation has been kindly confirmed by Udi
Manber. More details in the file README.apse."
TRE is
a lightweight, robust, and efficient POSIX compliant regexp
matching library with some exciting features such as approximate
(fuzzy) matching.
At the core of TRE is a new algorithm for regular expression matching with submatch addressing. The algorithm uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. The used space is also quadratic on the length of the regex, but does not depend on the searched string. This quadratic behaviour occurs only on pathological cases which are probably very rare in practice. TRE is not just yet another regexp matcher. TRE has some features which are not there in most free POSIX compatible implementations. Most of these features are not present in non-free implementations either, for that matter. Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.
TRE includes a version of the agrep command line tool for
approximate regexp matching in the style of grep. Unlike other agrep
implementations (like the one by Sun Wu and Udi Manber from University
of Arizona available
here) TRE agrep allows full regexps of any length, any number of
errors, and non-uniform costs for insertion, deletion and
substitution. TRE is free; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 (June 1991) as published by the Free Software Foundation. See the file COPYING, included in the distribution packages, for the complete license. If you cannot accept the GNU GPL, it may be possible that you can persuade me to give or sell you a version which is licensed differently. Please contact the author for more information.
Windows
- C - Cygwin
The bitap library , another new and fresh implementation of the bitap algorithm. "The bitap library is a clean
implementation of
regular expression (regex/grep) string matching
using the bitap algorithm. Approximate (also known as fuzzy) matching
is allowed. This is the same
algorithm as the one used in Glimpse and agrep,
but it is much more complete with regard to
regular expression syntax, and is much cleaner.
"
AGREPY
- A Python port of agrep string matching with errors (by
Michael J. Wise, August 2002)
Wiki-based
Intranet- and Internet Search Assistant and bookmark sharing tool using
auto-type detection and approximate and regular expression pattern
shortcuts (WIBISA).
What is AGREP?From the authors' notes:AGREP is a powerful tool for fast searching a file or many files for a string or regular expression, with approximate matching capabilities and user-definable records. It was developed 1989-1991 by Udi Manber and Sun Wu at the University of Arizona and many others (please read CONTRIB.TXT and MANUAL.DOC). AGREP is similar to egrep (or grep or fgrep), but it is much more general and usually faster. It also supports many kinds of queries including arbitrary wild cards, sets of patterns, and in general, regular expressions. It supports most of the options supported by the GREP family plus several more (but it is not 100% compatible with grep). AGREP is the search engine and part of the GLIMPSE tool for searching and indexing whole file systems. GLIMPSE stands for Global Implicit Search and is part of the HARVEST Information Discovery and Access System. AGREP belongs to the University of Arizona, which licenses it (see copyright). It is not public-domain, but free for non-commercial use. And it comes with source! AGREP ports to OS/2 and DOS and to Windows 95 and NTThe first port of AGREP to OS/2 was done by Robert Mahoney (2Rud Software).Based on the version of AGREP as found in the GLIMPSE 3.0 package, I made this port to OS/2 and DOS using Eberhard Mattes' GNU C-compiler EMX 0.9c. As GLIMPSE is now at version 4.1, the relevant changes 3.x -> 4.x with respect to AGREP will be taken into account. Available are stand-alone versions for OS/2 (agrep2.exe V3.35), for native DOS (agrepdos.exe V3.35) and for 32-bit Windows systems (agrepw32.exe V3.37) Rainhard Schnitker's RSX is a DOS DPMI-compliant runtime package, which makes a fourth version (agrep.exe V3.35) also run in a DOS box of OS/2, DOS box of Win3.x/Win95, under native DOS, and OS/2. But running this, you need one, two, or three additional files, depending on your operating system (see next section). Dave was the first to compile a version for Windows 95 and NT
using the Microsoft Visual C (VC) 5.0 compiler (January 1998).
AGREP files version 3.41 for LINUX)agrep-3.41.tgz for LINUX (type "make" to compile)AGREP files version 3.37 (3.35 for DOS and OS/2)
AGREP: What was new in the versions 3.35 .. 3.37 ?Here is the list of all options.
Example:
Limitation: the version 3.37 only supports nnn=437 or nnn=850. More information on codepages can be found here: codepage
437 (US) and 850 (Latin-1) (list of pointers)
Examples: "Ä" matches "ä" but these do not match "a" or "á" !
Example: "A", "Ä", "À", "Á", "Â",
"Ã", "Å",
"Æ", "a", "ä", "à", "á", "â",
"ã",
"å", "æ,"
option -i# format search The environment variable can be used to preset the most preferred options, but actual parameters as given in the commandline take precedence. Example: (in CONFIG.SYS) SET AGREPOPTS=-i -V4 -V version information only
The following can cause an infinite loop: agrep pattern * > output_file. If the number of matches is high, they may be deposited in output_file before it is completely read leading to more matches of the pattern within output_file (the matches are against the whole directory). It's not clear whether this is a "bug" (grep will do the same), but be warned. This is solved in the new version. Another bug in older versions: This is solved now. Limitations of AGREP 3.37The stand-alone executable AGREP2.EXE for OS/2 cannot be run in a DOS box of Windows or OS/2. Please use AGREP.EXE instead, together with EMX.EXE, RSX.EXE or EMX.DLL for running under OS/2. AGREP2.EXE runs under OS/2 only.The stand-alone executable AGREPDOS.EXE for DOS cannot be run in a DOS box of Windows or OS/2, please use AGREP.EXE instead, together with EMX.EXE, RSX.EXE. AGREPDOS.EXE runs under native DOS only (or with EMX.DLL under OS/2). The stand-alone executable for 32-bit Windows (Win95 and Windows/NT) is AGREPW32.EXE. Since the release of AGREP version 3.35, the complete distribution package also contains a Makefile to be used with the Microsoft Visual C (VC) 5.0 resp. Borland C 5.02 compiler. There are a few restrictions regarding the possible combinations of search options. AGREP will display a message, when it does not support the requested combination of options. As later versions could do, please check regularly this page for amendments. Running AGREP.EXE in a DOS box together with RSX.EXE does not allow to search long file or directory names under Windows 95, Windows NT. Known Bugs of AGREP 3.37When searching user-defined records (with option -d),
How can I use AGREP ?The program itself shows six pages of on-line help - when you call it without any parameters.
There is a list of all options of AGREP and a lot of examples. If you find a problem, please send me your bug report. Research & Development: AGREP's Wishlist (To-Do-List)The following lists ideas, proposals, and extensions for improvements of AGREP, which I intend to implement in later versions. Please send me your proposals.
The Rexx Interface allows the simple call of AGREP as a function from any user-written Rexx program. Recommended literature: OS/2 REXX Program Reference REXXAPI.INF. Return codes: $retc = system('AGREP -sV0 ' . $needle . ' ' . $foo); Partially done. Can be achieved by editing CODEPAGE.C and recompiling A new option -Mn let the user specify the maximum number of matches in order to exit out without going through the entire search: Example: will only return the first 100 matched lines (or records) even if the total number of matches is, e.g. 1000. One of the most powerful features is already the -d option allowing user-definable records. The proposed extension of this option would be very useful when
the target
files do not have a certain record structure. In this case, one would
prefer
to run AGREP line oriented (default), but giving -dn
would allow
a range of ±n target lines to be displayed around the
line with the needle. Preferable in combination with highlight
function: Allow user-definable prefix- and suffix strings to mark the needle in the output record. The strings could be composed of ANSI strings to select/deselect
colours,
or they could be used to generate HTML
links. Preferable with RIC technique. @ all letters % all digits ~ all the rest Examples: search for a car plate number which starts with "ABC"
followed
by 4 digits: "ABC%%%%" or to search for US
patents "US%%%%%%%" or for European
patents "EP%%%%%%%" In its current implementation, AGREP needs sixteen characters from the character set internally. The graphic characters, which are not common to text files, cannot be searched at the moment and must therefore not appear in the needle string. A warning message is shown. It is planned to remove that restriction of AGREP in a later version. Algorithms used in AGREP/* Copyright (c) 1994 Sun Wu, Udi Manber, Burra Gopal. All Rights Reserved. */The implementation of agrep includes the following algorithms. Except for exact matching of simple patterns, for which we use a simple variation of the Boyer-Moore algorithm, all the algorithms (listed below) were designed by Sun Wu and Udi Manber.
The most general algorithm inside agrep. It supports many extensions such as approximate regular expression pattern matching, non-uniform costs, simultaneous matching of multiple patterns, mixed exact/approximate matching, etc. The algorithm is described in agrep1.ps.
A sub-linear expect-time algorithm for matching a set of patterns. It assumes that the set of patterns contains k patterns, and that the shortest pattern is of size m. See agrep2.ps for a brief description
of the
algorithm. A Boyer-Moore style algorithm for approximate pattern matching. Let b = log_c (2*m), where c is the size of alphabet set. In the preprocessing, a table is built to determine whether a given substring of size b is in the pattern. Suppose we are looking for matches with at most k errors. The search is done in two passes: In the first pass (the filtering pass), the areas in the text that have a possibility to contain the matches are marked. The second pass finds the matches in those marked areas. The search in the first pass is done in the following way. Suppose the end position of the pattern is currently aligned with position tx in the text. The algorithm scans backward from tx until either (k+1) blocks that do not occur in the pattern have been scanned, or the scan has passed position (tx-m+k). In the former case, pattern is shifted forward to align the beginning position of the pattern with one character after the position in the text where the scan was stopped. In the latter case, we marked tx-m to tx+m as a candidate area. Combining the mgrep algorithm with a partition technique, we have an algorithm with the same time complexity as amonkey. For ASCII text and pattern, this algorithm is faster than amonkey. The principle of the partition technique is as follows. Let A and B be two strings of size m. If we partition A into (k+1) blocks, then the distance between A and B is > k if none of the blocks of A occur in B. This implies that to match A with no more than k errors, B has to contain a substring that matches exactly one block of A. A brief description can be found in agrep2.ps
Literature:The articles describe techniques for approximate pattern search in general and/or with respect to AGREP.When you are interested in special areas (e.g. searching for multiple string patterns), please follow one of the links given here.
ISBN 981-02-1829-X recommended reading on searching algorithms of all flavours !
PostScript version of the paper in its submitted form: agrep1.ps (agrep1ps.zip 59K) recommended reading to understand AGREP !
pp. 153-162, Berkeley, USA, 1991. PostScript version of the paper in its submitted form: agrep2.ps (agrep2ps.zip 35K) recommended reading to understand AGREP !
Addison-Wesley. See sections Text Searching and Algorithms coded in Pascal and C,
Proceeding of CPM'96, pp. 1-23.
A must: Exact string matching algorithms(C sourcecode; visualization in Java; examples) (Charras/Lecroq)
C program source available: "unzip ct9505.zip (256K) AEHNLICH/*.*" Links:The home page of AGREP (non-UNIX) http://www.tgries.de/agrepThe original home page of AGREP and GLIMPSE for UNIX http://glimpse.cs.arizona.edu The home page of HARVEST http://harvest.transarc.com Bug reports:Please file bug reports here.back to the top of this page Back to my home page. Page designed by Opteryx® |