MultiFast is a fast multiple-string search tool which implements Aho-Corasick Algorithm. It contains a tidy and efficient implementation of the Aho-Corasick algorithm as a C library.
Download
Features:
|
To build ahocorasick library just go to the directory and run make.
$ tar -zxvf multifast-v1.4.2.tar.gz
$ cd multifast-v1.4.2/ahocorasick
$ make
The static library libahocorasick.a will be created in the current directory. you can add it to other projects by gcc's library options: -lahocorasick and -L<path>. you also must declare include path for ahocorasick.h header file with -I option. see the Makefiles in example folders. you can build and run them by following commands:
$ cd ../example0/
$ make
$ build/example0
The library provides following functions:
AC_AUTOMATA_t *
ac_automata_init (void);
AC_STATUS_t ac_automata_add (AC_AUTOMATA_t * thiz,
AC_PATTERN_t * pat);
void ac_automata_finalize
(AC_AUTOMATA_t * thiz);
int
ac_automata_search (AC_AUTOMATA_t * thiz,
AC_TEXT_t * txt,
int * keep,
AC_MATCH_CALBACK_f * callback,
void * param);
void
ac_automata_settext (AC_AUTOMATA_t * thiz, AC_TEXT_t * text, int keep);
AC_MATCH_t *
ac_automata_findnext (AC_AUTOMATA_t * thiz);
void ac_automata_release (AC_AUTOMATA_t * thiz);
void ac_automata_display (AC_AUTOMATA_t * thiz, char repcast);
and the following data types:
It is recommended to read the comments and definitions in actypes.h and ahocorasick.h files.
A typical usage of ahocorasick library would involve following steps:
The ahocorasick library v1.4.2 provides two different methods for search:
The following code show a typical usage of settext/findnext method of the library in detail.
#include <stdio.h> #include <string.h> #include "ahocorasick.h" AC_ALPHABET_t * sample_patterns[] = { "simplicity", "the ease", "city", " and ", "simplicity of", "experience", "multifast", "simp", "good one", "whatever", "ever", "a good one", " of ", "clutter", "utter", "clu", "oneout", }; #define PATTERN_COUNT (sizeof(sample_patterns)/sizeof(AC_ALPHABET_t *)) AC_ALPHABET_t * input_text1 = {"experience the ease and simplicity of multifast"}; AC_ALPHABET_t * input_text2 = {"whatever you are be a good one"}; AC_ALPHABET_t * input_text3 = {"out of clutter, find simplicity"}; int main (int argc, char ** argv) { unsigned int i; // 1. Define AC variables: AC_AUTOMATA_t *atm; AC_PATTERN_t tmp_pattern; AC_TEXT_t tmp_text; // 2. Get a new automata atm = ac_automata_init (); // 3. Add patterns to automata for (i=0; i<PATTERN_COUNT; i++) { tmp_pattern.astring = sample_patterns[i]; tmp_pattern.rep.number = i+1; // optional tmp_pattern.length = strlen(tmp_pattern.astring); ac_automata_add (atm, &tmp_pattern); } // 4. Finalize automata ac_automata_finalize (atm); // after you have finished with adding patterns you must finalize the automata // from now you can not add patterns anymore. printf ("Searching: \"%s\"\n", input_text1); // 5. Set the input text tmp_text.astring = input_text1; tmp_text.length = strlen(tmp_text.astring); ac_automata_settext (atm, &tmp_text, 0); // 6. find AC_MATCH_t * matchp; while ((matchp = ac_automata_findnext(atm))) { unsigned int j; printf ("@%2ld: ", matchp->position); for (j=0; j < matchp->match_num; j++) printf("#%ld (%s), ", matchp->patterns[j].rep.number, matchp->patterns[j].astring); // CAUTION: be careful about using m->matched_patterns[j].astring // if 'astring' has permanent allocation inside your program's // memory area, you can use it. otherwise it will point to // an incorrect memory place. printf ("\n"); } printf ("Searching: \"%s\"\n", input_text2); // you can do more search // just use function pair ac_automata_settext/ac_automata_findnext tmp_text.astring = input_text2; tmp_text.length = strlen(tmp_text.astring); ac_automata_settext (atm, &tmp_text, 0); while ((matchp = ac_automata_findnext(atm))) { unsigned int j; printf ("@%2ld: ", matchp->position); for (j=0; j < matchp->match_num; j++) printf("#%ld (%s), ", matchp->patterns[j].rep.number, matchp->patterns[j].astring); printf ("\n"); } printf ("Searching: \"%s\" with \'keep\' enabled\n", input_text3); // and again tmp_text.astring = input_text3; tmp_text.length = strlen(tmp_text.astring); ac_automata_settext (atm, &tmp_text, 1); // when the keep option (3rd argument) in set, then the automata // considers that the given text is the next chunk of the previous text. // to understand the difference try it with 0 and 1 and compare the result while ((matchp = ac_automata_findnext(atm))) { unsigned int j; printf ("@ %2ld: ", matchp->position); for (j=0; j < matchp->match_num; j++) printf("#%ld (%s), ", matchp->patterns[j].rep.number, matchp->patterns[j].astring); printf ("\n"); } // 7. Release the automata ac_automata_release (atm); // do not forget to release the automata after you have done with it return 0; }
Before building multifast, you must compile ahocorasick library first. If you have not compiled it yet, please go to ../ahocorasick/ and make it. then you should do followings:
$ cd multifast
$ make
$ ./multifast -P ../test/cities.pat -drp
../test/input1.txt
$ ./multifast -P ../test/cities.pat -drp ../test/input1.txt
../test/input2.txt
$ ./multifast -P ../test/cities.pat -ndrp /etc/
$ ./multifast -P ../test/sample.pat -xrp ../test/input2.txt
$ find /var/www/ -type f -print0 | xargs -0 ./multifast -P
../test/cities.pat -xrpf
$ ./multifast -P ../test/cities.pat -nxrpiv ../test/input1.txt
$ ./multifast -P ../test/cities.pat -nxrp ../test/input1.txt
../test/input2.txt
$ ./multifast -P ../test/cities.pat -ndrpfv /var/www/
Input file(s) could be a single file name, multiple file names, standard input (-), or single directory name. in last case only regular file of the directory will be searched. multiple file names must be separated by space. you can generate multiple file names for multifast using find and xargs utilities:
$ find /var/www/ -type f -print0 | xargs -0 ./multifast -P ../test/cities.pat -nxrf
It is strongly suggested that use find and xargs to produce multiple files instead of using directory name. although both has same effect, But find and xargs work faster. the following shows two alternative ways for doing the same thing:
$ ./multifast -P ../test/cities.pat -ndrpf /var/www/
$ find /var/www/ -type f -print0 | xargs -0 ./multifast -P
../test/cities.pat -ndrpf
you cat feed multifast from standard input; to do so you need to write a single dash (-) instead of file name. These are examples:
$ cat ../test/input1.txt | ./multifast -P ../test/cities.pat -dp -
$ cat /etc/services | ./multifast -P ../test/cities.pat -dp -
Pattern file includes patterns. the structure of pattern file is very simple. every pattern is defined by a 3-part expression:
AX (REP) {PAT}
The first part, which we call it AX, only takes two values: 'a' or 'x'. the 'a' stands for ASCII and 'x' stands for hexadecimal. this part is mandatory. the interpretation of the 3rd part, PAT, depends to the value of AX.
The second part (REP) defines a meaningful representative for the pattern. for hex patterns or large patterns it helps to improve the intelligibility of output. this part is optional. for patterns without representative the program will assign an automatic representative. the second part is enclosed by parenthesis and only can take 0-9, a-zA-Z and _ (no space allowed).
The third part is the main part which defines the string of character or bytes. the definition of the string could be defined in two ways: ASCII or HEX. this is determined by the first part (AX). for ASCII mode you must put your string inside brackets. if your string contains brackets you must escape it. e.g. {abc\{dd\}g}. you also would escape backslashes too. e.g. {dro\\des} is equal to dro\des. be careful about initial and final spaces between your string and the brackets. they are taken into account. e.g. { remmg} is equal to " remmg" not "remmg". in ASCII mode everything you put inside the brackets (including line breaks) will be taken into account. for HEX mode, only hex digits (0-9, a-fA-F) are allowed inside the brackets. the number of digits inside the bracket must be even. no other constraints exists. there could be spaces between digits.
NOTE:
Example Pattern File:
# comments
a (hooloo) {I am filling lucky}
a {disclosed }
x (maloos) {56 10 23 Ef EB 1D e9 09 d3 7c a4}
#
# comments
a (firy) {from \{23\} to }
x {20 b3 7e 0a 40 97 79 ff ac 2d 84 2c 0c 3d 60 8d} # comments
x(popy) {50 55 42 5 1 6 c c c 0 a}
x (wood) {00 00 00 fe002345 e3}
See more examples of pattern file in multifast-code/multifast/test folder.
Can I compile the package in windows?
The build scripts are provided only for Linux but the ahocorasick library and example programs are totally compatible with Windows. You can compile them by Microsoft Visual Studio. In order to do that you should build an empty project in Visual Studio and add the .c and .h files in ahocorasick and example1 to it, then you can compile it.
kamiar.kanani@gmail.com
Please report bugs or any complain to the email mentioned in Contact or post it to sorcefore's support page.
Kamiar Kanani