Hardcoding Finite State Automata Processing E. KETCHA NGASSAM, BRUCE W. WATSON AND DERRICK G. KOURIE Department of Computer Science University of Pretoria URL: http://fastar.cs.up.ac.za To hardcode and algorithm means to build into it the data that it requires. In this paper, we present various experiments in hardcoding the transition table of a finite state machine directly into string-recognizing code. Experiments are carried out in two phases. The first phase is limited to the analysis of the hardcoded behavior in relation to acceptance or rejection of a single symbol in some arbitrary state of some finite automaton. Then follows a simulation of the analysis of some hardcoded solution for recognizing an entire string. Measurements are provided to show the time efficiency gains by various hardcoded versions over the traditional table-driven approach. Categories and Subject Descriptors: F.1.1 [Theory of Computation]: Computation by Abstract Devices—Automata; F.4.1 [Theory of Computation]: Mathematical Logic —Computability Theory; F.4.2 [Theory of Computation]: Grammars and their rewriting systems —Grammar types, Parsing General Terms: Algorithms, Experimentation, Performance. Additional Key Words and Phrases: Hardcoding, Automata, Pattern Matching.

1.

INTRODUCTION

To hardcode an algorithm means to build into it the specific data that it requires. Even though this means that the algorithm cannot run with alternative data, a hardcoded algorithm may sometimes be more efficient than its softcode counterpart. Implementers of finite state automata (FSAs) often use a table to represent the transition function. The conventional table-driven algorithm to determine whether an FSA recognizes a given string is generic in the sense that the transition table is part of the input data to the algorithm. The time taken by such an algorithm to determine whether an FSA recognizes a string, thus depends inter-alia on the memory load as represented by the size of the transition matrix. When manipulating large automata, the implementer has to be aware of, and indeed avoid, unpredictable behavior such as early program termination caused by memory overflow. This can be done by applying more complex techniques, such as vectorization [Dongarra et al. 1984] or recursion algorithms for efficient cache memory usage [Douglas et al. 1999; Bjarne et al. 1999]. Much of the work that has been done to improve automata implementation efficiency, has been at the modelling level; that is, before the automaton’s transition table has been set up for processing by the standard algorithms. Automata minimization Watson[2002; 1995], and the study of specialized algorithms on problems using an FSA as basic model (Pattern Matching, Parsing, DNA analysis etc...) are among several examples available where the model is optimized before implementation. Hardcoding automata has already proven to be effective in particular automata implementation areas such as parsing [Pennello. 1986; Horspool et al. 1988], and code generation [Achyutram et al. 1995; Fraser et al. 1991]. However, little has been said about hardcoding automata in a more general sense. This work helps to fill the gap. It consists of a performance analysis of various versions of string recognition algorithms based on hardcoded FSA implementation. However, instead of directly analyzing a hardcoded solution for recognizing an entire string, the analysis is first of all limited at the most elementary form possible: the analysis of hardcoded behavior in relation to acceptance or rejection of a single symbol in some arbitrary state of some finite automaton. This is followed by a simulation of the analysis of some hardcoded solution for recognizing an entire string based on the results obtained from previous experiments. We perform various tests using various coding strategies to help capture Author Addresses: E. Ketcha Ngassam, Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa; [email protected] Bruce W. Watson, Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa; [email protected] Derrick G. Kourie, Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa; [email protected] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, that the copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than SAICSIT or the ACM must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2003 SAICSIT

Proceedings of SAICSIT 2003, Pages 111–121

112

•

E. Ketcha Ngassam et al.

the advantage of using hardcode for situations where speed is a major factor. A conclusion of this work is that hardcoding may sometimes yield significant efficiency improvements over the traditional table-driven method, and might therefore be appropriate in particular circumstances where timing is important. Section 2 presents the foundational background to the hardcoding experiments that were carried out. Section 3 explains the various approaches taken with a view to exploring optimization. Two levels of experiments were carried out, the first being where the size of the FA and the associated hardcode code was limited to an absolute minimum. Section 4 describes the data generation and collection processes used in this first round of experiments and section 5 presents a summary of these results. Section 6 describes a further experiment (the second level) induced by these previous results. Here various levels of cache and RAM memory is exercised, based on larger FAs than in the first experiments but relying on the best hardcoded algorithm previously encountered. Section 7 gives the overall conclusion and indicates directions for future research. 2.

FOUNDATIONS

This section reviews a standard table-driven string recognition algorithm. It justifies the decision to restrict the study to an investigation of an automaton recognizing a single symbol and indicates in pseudo code how softcoding and hardcoding for such recognition should take place. It then reviews specific software support structures used for time measurement and random number generation. 2.1

The table-driven string recognition algorithm

A table-driven algorithm is the usual basis for ascertaining whether a string str is a member of the language generated by an FSA, M . Consider a string, str, of length len > 0, and a table transition[i][j] that represents the transition function of M , where 0

1.

INTRODUCTION

To hardcode an algorithm means to build into it the specific data that it requires. Even though this means that the algorithm cannot run with alternative data, a hardcoded algorithm may sometimes be more efficient than its softcode counterpart. Implementers of finite state automata (FSAs) often use a table to represent the transition function. The conventional table-driven algorithm to determine whether an FSA recognizes a given string is generic in the sense that the transition table is part of the input data to the algorithm. The time taken by such an algorithm to determine whether an FSA recognizes a string, thus depends inter-alia on the memory load as represented by the size of the transition matrix. When manipulating large automata, the implementer has to be aware of, and indeed avoid, unpredictable behavior such as early program termination caused by memory overflow. This can be done by applying more complex techniques, such as vectorization [Dongarra et al. 1984] or recursion algorithms for efficient cache memory usage [Douglas et al. 1999; Bjarne et al. 1999]. Much of the work that has been done to improve automata implementation efficiency, has been at the modelling level; that is, before the automaton’s transition table has been set up for processing by the standard algorithms. Automata minimization Watson[2002; 1995], and the study of specialized algorithms on problems using an FSA as basic model (Pattern Matching, Parsing, DNA analysis etc...) are among several examples available where the model is optimized before implementation. Hardcoding automata has already proven to be effective in particular automata implementation areas such as parsing [Pennello. 1986; Horspool et al. 1988], and code generation [Achyutram et al. 1995; Fraser et al. 1991]. However, little has been said about hardcoding automata in a more general sense. This work helps to fill the gap. It consists of a performance analysis of various versions of string recognition algorithms based on hardcoded FSA implementation. However, instead of directly analyzing a hardcoded solution for recognizing an entire string, the analysis is first of all limited at the most elementary form possible: the analysis of hardcoded behavior in relation to acceptance or rejection of a single symbol in some arbitrary state of some finite automaton. This is followed by a simulation of the analysis of some hardcoded solution for recognizing an entire string based on the results obtained from previous experiments. We perform various tests using various coding strategies to help capture Author Addresses: E. Ketcha Ngassam, Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa; [email protected] Bruce W. Watson, Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa; [email protected] Derrick G. Kourie, Department of Computer Science, University of Pretoria, Pretoria 0002, South Africa; [email protected] Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, that the copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than SAICSIT or the ACM must be honoured. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. c 2003 SAICSIT

Proceedings of SAICSIT 2003, Pages 111–121

112

•

E. Ketcha Ngassam et al.

the advantage of using hardcode for situations where speed is a major factor. A conclusion of this work is that hardcoding may sometimes yield significant efficiency improvements over the traditional table-driven method, and might therefore be appropriate in particular circumstances where timing is important. Section 2 presents the foundational background to the hardcoding experiments that were carried out. Section 3 explains the various approaches taken with a view to exploring optimization. Two levels of experiments were carried out, the first being where the size of the FA and the associated hardcode code was limited to an absolute minimum. Section 4 describes the data generation and collection processes used in this first round of experiments and section 5 presents a summary of these results. Section 6 describes a further experiment (the second level) induced by these previous results. Here various levels of cache and RAM memory is exercised, based on larger FAs than in the first experiments but relying on the best hardcoded algorithm previously encountered. Section 7 gives the overall conclusion and indicates directions for future research. 2.

FOUNDATIONS

This section reviews a standard table-driven string recognition algorithm. It justifies the decision to restrict the study to an investigation of an automaton recognizing a single symbol and indicates in pseudo code how softcoding and hardcoding for such recognition should take place. It then reviews specific software support structures used for time measurement and random number generation. 2.1

The table-driven string recognition algorithm

A table-driven algorithm is the usual basis for ascertaining whether a string str is a member of the language generated by an FSA, M . Consider a string, str, of length len > 0, and a table transition[i][j] that represents the transition function of M , where 0