MultiFast 1.4.2

MultiFast is a fast multiple-string search tool which implements Aho-Corasick Algorithm. It contains a tidy and efficient implementation of the Aho-Corasick algorithm as a C library.

Download

Features:

  • Written in C language
  • Efficient implementation of Aho-Corasick algorithm
  • License: LGPLv3
  • Powerful search tool for multiple simple patterns

Aho-Corasick library (ahocorasick v1.6)

Build and Link

To build ahocorasick library just go to the directory and run make.

$ tar -zxvf multifast-v1.4.2.tar.gz
$ cd multifast-v1.4.2/ahocorasick
$ make

The static library libahocorasick.a will be created in the current directory. you can add it to other projects by gcc's library options: -lahocorasick and -L<path>. you also must declare include path for ahocorasick.h header file with -I option. see the Makefiles in example folders. you can build and run them by following commands:

$ cd ../example0/
$ make
$ build/example0

API

The library provides following functions:

AC_AUTOMATA_t * ac_automata_init     (void);
AC_STATUS_t     ac_automata_add      (AC_AUTOMATA_t * thiz, AC_PATTERN_t * pat);
void            ac_automata_finalize (AC_AUTOMATA_t * thiz);
int             ac_automata_search   (AC_AUTOMATA_t * thiz, AC_TEXT_t * txt, int * keep,
                                          AC_MATCH_CALBACK_f * callback, void * param);
void            ac_automata_settext  (AC_AUTOMATA_t * thiz, AC_TEXT_t * text, int keep);
AC_MATCH_t *    ac_automata_findnext (AC_AUTOMATA_t * thiz);
void            ac_automata_release  (AC_AUTOMATA_t * thiz);
void            ac_automata_display  (AC_AUTOMATA_t * thiz, char repcast);

and the following data types:

AC_AUTOMATA_t
The main automata structure that contains all nodes and edges with all pattern data.
AC_PATTERN_t
The pattern data type
AC_TEXT_t
The input text data type
AC_MATCH_t
The match data type that contains information about matched patterns
AC_MATCH_CALBACK_f
A function of type int (*)(AC_MATCH_t *, void *) that is defined by the user. this function will be called back by the library search internal whenever a match is found.

It is recommended to read the comments and definitions in actypes.h and ahocorasick.h files.

Manual

A typical usage of ahocorasick library would involve following steps:

The ahocorasick library v1.4.2 provides two different methods for search:

there in no advantage in one method over another. It's just a design matter. There are examples in the source code which show how to use these two different methods:

The following code show a typical usage of settext/findnext method of the library in detail.

#include <stdio.h>
#include <string.h>

#include "ahocorasick.h"

AC_ALPHABET_t * sample_patterns[] = {
    "simplicity",
    "the ease",
    "city",
    " and ",
    "simplicity of",
    "experience",
    "multifast",
    "simp",
    "good one",
    "whatever",
    "ever",
    "a good one",
    " of ",
    "clutter",
    "utter",
    "clu",
    "oneout",
};
#define PATTERN_COUNT (sizeof(sample_patterns)/sizeof(AC_ALPHABET_t *))

AC_ALPHABET_t * input_text1 = {"experience the ease and simplicity of multifast"};
AC_ALPHABET_t * input_text2 = {"whatever you are be a good one"};
AC_ALPHABET_t * input_text3 = {"out of clutter, find simplicity"};


int main (int argc, char ** argv)
{
    unsigned int i;

    // 1. Define AC variables:
    
    AC_AUTOMATA_t   *atm;
    AC_PATTERN_t    tmp_pattern;
    AC_TEXT_t       tmp_text;

    // 2. Get a new automata
    
    atm = ac_automata_init ();

    // 3. Add patterns to automata
    
    for (i=0; i<PATTERN_COUNT; i++)
    {
        tmp_pattern.astring = sample_patterns[i];
        tmp_pattern.rep.number = i+1; // optional
        tmp_pattern.length = strlen(tmp_pattern.astring);
        ac_automata_add (atm, &tmp_pattern);
    }

    // 4. Finalize automata
    
    ac_automata_finalize (atm);
    // after you have finished with adding patterns you must finalize the automata
    // from now you can not add patterns anymore.
    
    printf ("Searching: \"%s\"\n", input_text1);

    // 5. Set the input text
    
    tmp_text.astring = input_text1;
    tmp_text.length = strlen(tmp_text.astring);
    ac_automata_settext (atm, &tmp_text, 0);
    
    // 6. find
    
    AC_MATCH_t * matchp;
    
    while ((matchp = ac_automata_findnext(atm)))
    {
        unsigned int j;
        
        printf ("@%2ld: ", matchp->position);

        for (j=0; j < matchp->match_num; j++)
            printf("#%ld (%s), ", matchp->patterns[j].rep.number, matchp->patterns[j].astring);
            // CAUTION: be careful about using m->matched_patterns[j].astring
            // if 'astring' has permanent allocation inside your program's
            // memory area, you can use it. otherwise it will point to
            // an incorrect memory place. 
        
        printf ("\n");
    }

    printf ("Searching: \"%s\"\n", input_text2);
    // you can do more search 
    // just use function pair ac_automata_settext/ac_automata_findnext
    
    tmp_text.astring = input_text2;
    tmp_text.length = strlen(tmp_text.astring);
    ac_automata_settext (atm, &tmp_text, 0);
    
    while ((matchp = ac_automata_findnext(atm)))
    {
        unsigned int j;
        
        printf ("@%2ld: ", matchp->position);

        for (j=0; j < matchp->match_num; j++)
            printf("#%ld (%s), ", matchp->patterns[j].rep.number, matchp->patterns[j].astring);
        
        printf ("\n");
    }

    printf ("Searching: \"%s\" with \'keep\' enabled\n", input_text3);
    // and again
    
    tmp_text.astring = input_text3;
    tmp_text.length = strlen(tmp_text.astring);
    ac_automata_settext (atm, &tmp_text, 1);
    // when the keep option (3rd argument) in set, then the automata
    // considers that the given text is the next chunk of the previous text.
    // to understand the difference try it with 0 and 1 and compare the result
    
    while ((matchp = ac_automata_findnext(atm)))
    {
        unsigned int j;
        
        printf ("@ %2ld: ", matchp->position);

        for (j=0; j < matchp->match_num; j++)
            printf("#%ld (%s), ", matchp->patterns[j].rep.number, matchp->patterns[j].astring);
        
        printf ("\n");
    }

    // 7. Release the automata
    
    ac_automata_release (atm);
    // do not forget to release the automata after you have done with it

    return 0;
}
	

multifast Search Utility

Before building multifast, you must compile ahocorasick library first. If you have not compiled it yet, please go to ../ahocorasick/ and make it. then you should do followings:

$ cd multifast
$ make

Manual

Followings are the examples which show the usage of multifast. it uses the sample pattern and input files provided in the package.

$ ./multifast -P ../test/cities.pat -drp ../test/input1.txt
$ ./multifast -P ../test/cities.pat -drp ../test/input1.txt ../test/input2.txt
$ ./multifast -P ../test/cities.pat -ndrp /etc/
$ ./multifast -P ../test/sample.pat -xrp ../test/input2.txt
$ find /var/www/ -type f -print0 | xargs -0 ./multifast -P ../test/cities.pat -xrpf
$ ./multifast -P ../test/cities.pat -nxrpiv ../test/input1.txt
$ ./multifast -P ../test/cities.pat -nxrp ../test/input1.txt ../test/input2.txt
$ ./multifast -P ../test/cities.pat -ndrpfv /var/www/

Usage

multifast -P pattern_file [-ndxrpfivh] file1 [file2 ...]

-P pattern_file
Specifies pattern file name
-n
Show number of match per file
-d
Show start position (decimal)
-x
Show start position (hex)
-r
Show representative string of the pattern
-p
Show the matched pattern
-f
Find first match per file only
-i
Search case insensitive
-v
Show verbose output
-h
Show help

Input file

Input file(s) could be a single file name, multiple file names, standard input (-), or single directory name. in last case only regular file of the directory will be searched. multiple file names must be separated by space. you can generate multiple file names for multifast using find and xargs utilities:

$ find /var/www/ -type f -print0 | xargs -0 ./multifast -P ../test/cities.pat -nxrf

It is strongly suggested that use find and xargs to produce multiple files instead of using directory name. although both has same effect, But find and xargs work faster. the following shows two alternative ways for doing the same thing:

$ ./multifast -P ../test/cities.pat -ndrpf /var/www/
$ find /var/www/ -type f -print0 | xargs -0 ./multifast -P ../test/cities.pat -ndrpf

you cat feed multifast from standard input; to do so you need to write a single dash (-) instead of file name. These are examples:

$ cat ../test/input1.txt | ./multifast -P ../test/cities.pat -dp -
$ cat /etc/services | ./multifast -P ../test/cities.pat -dp -

Pattern file

Pattern file includes patterns. the structure of pattern file is very simple. every pattern is defined by a 3-part expression:

AX (REP) {PAT}

The first part, which we call it AX, only takes two values: 'a' or 'x'. the 'a' stands for ASCII and 'x' stands for hexadecimal. this part is mandatory. the interpretation of the 3rd part, PAT, depends to the value of AX.

The second part (REP) defines a meaningful representative for the pattern. for hex patterns or large patterns it helps to improve the intelligibility of output. this part is optional. for patterns without representative the program will assign an automatic representative. the second part is enclosed by parenthesis and only can take 0-9, a-zA-Z and _ (no space allowed).

The third part is the main part which defines the string of character or bytes. the definition of the string could be defined in two ways: ASCII or HEX. this is determined by the first part (AX). for ASCII mode you must put your string inside brackets. if your string contains brackets you must escape it. e.g. {abc\{dd\}g}. you also would escape backslashes too. e.g. {dro\\des} is equal to dro\des. be careful about initial and final spaces between your string and the brackets. they are taken into account. e.g. { remmg} is equal to " remmg" not "remmg". in ASCII mode everything you put inside the brackets (including line breaks) will be taken into account. for HEX mode, only hex digits (0-9, a-fA-F) are allowed inside the brackets. the number of digits inside the bracket must be even. no other constraints exists. there could be spaces between digits.

NOTE:

Example Pattern File:

# comments
a (hooloo) {I am filling lucky}
a {disclosed }
x (maloos) {56 10 23 Ef EB 1D e9 09 d3 7c a4}
#
# comments
a (firy) {from \{23\} to }
x {20 b3 7e 0a 40 97 79 ff ac 2d 84 2c 0c 3d 60 8d} # comments
x(popy) {50 55 42 5 1 6 c c c 0 a}
x (wood) {00 00 00 fe002345 e3}

See more examples of pattern file in multifast-code/multifast/test folder.

FAQ

Can I compile the package in windows?

The build scripts are provided only for Linux but the ahocorasick library and example programs are totally compatible with Windows. You can compile them by Microsoft Visual Studio. In order to do that you should build an empty project in Visual Studio and add the .c and .h files in ahocorasick and example1 to it, then you can compile it.

Contact

kamiar.kanani@gmail.com

Bug report

Please report bugs or any complain to the email mentioned in Contact or post it to sorcefore's support page.

Author

Kamiar Kanani