Recurrent Neural Networks
HOME LINKS RESEARCH PUBLICATIONS
Contents
A Recurrent Neural Network that Learns to Count by Rodriguez et al
Representation Beyond Finite States: Alternatives to Pushdown Automata by Wiles et al
Constructing Deterministic Finite-State Automata in Recurrent Neural Networks by Omlin & Giles
Hill Climbing in Recurrent Neural Networks for Learning the anbncn Language by Chalup & Blair
Gradient Calculations for Dynamic Recurrent Neural Networks: A Survey by Pearlmutter
Mathematical Foundations of Simple Recurrent Networks in Language Processing (Thesis Chapter 6: Conclusion) by Rodriguez
Finite-Dimensional Analog Computers: Flows, Maps, and Recurrent Neural Networks by Moore
Connectionism, Artificial Life and Dynamical Systems: New approaches to old questions by Elman
Operators and curried functions: Training and analysis of simple recurrent networks by Wiles & Bloesch
Learning to adapt to changing environments in evolving neural networks by Nolfi & Parisi
Computational Simulations of the Emergence of Grammar by Batali
Encoding of Sequential Translators in Discrete-Time Recurrent Neural Nets by Ñeco et al
Context Free Grammar Representations in Neural Networks by Tabor
Can Recurrent Neural Networks Learn Natural Language Grammars? by Lawrence et al
Neural Networks: Automata and Formal Models of Computation by Forcada
Asynchronous translations with recurrent neural nets by Ñeco and Forcada
Neural Learning of approximate simple regular languages by Forcada et al
Second-Order Recurrent Neural Networks can Learn Regular Grammars from Noisy Strings by Carrasco and Forcada
Design of Artificial Neural Networks with a Stochastic Graph-L-System by Voigt et al
Efficient Evolution of Neural Network Topologies by Stanley & Miikkulainen
Rodriguez, Wiles & Elman – A Recurrent Neural Network that Learns to Count
The authors of this article carried out simulations of recurrent neural networks (RNNs) to determine if an RNN could learn to process a deterministic context free language (DCFL) when it was given a prediction task. The aim of this was to understand more clearly how RNNs perform natural language processing tasks. This simulation was tied in with human language performance by using dynamical systems theory to analyse the simulation results.
During training, the network was presented with a string of symbols from the DCFL, the maximum length of which was 22. The network was expected to correctly predict the next symbol, and also when the string was expected to end. The RNN had two input units, two hidden units, two copy units (provided recurrent connections between hidden units), two output units and one bias unit. The training algorithm was backpropagation through time (BPTT). Out of 50 trials, eight networks were able to learn the training set and generalize. These were analysed using dynamical systems theory, to demonstrate more clearly the behaviour of the networks.
This article demonstrates that the capabilities of an RNN are not constrained by resource differences such as memory. The RNN is able to function as something beyond a finite state machine, by processing a context free language. It does not seem to require a stack or counter mechanism, as a pushdown automata does. This might indicate that it does not fit into the Chomsky hierarchy of languages and the machines that can process them.
Reference details for “A Recurrent Neural Network that Learns to Count”:
Rodriguez, P., Wiles, J., & Elman, J. A Recurrent Neural Network That Learns to Count,Connection Science, Vol. 11, No. 1, 5-40, 1999.
Wiles et al - Representation Beyond Finite States: Alternatives to Pushdown Automata
The focus of this article is to show that Dynamical Recurrent Networks (DRNs) can be taken a step further than being used as a deterministic finite-state automata (DFA). DFAs (and therefore DRNs) can be used to predict the next word in a regular language – the aim is to show that DRNs can predict context-free and context-sensitive languages. This would usually require a machine that has at least a limited amount of memory (such as the pushdown automata). Wiles et al argue that the requirement for a DRN to learn a nonregular language is not the amount of memory of the machine, but the precision constraints that are imposed on it.
Two types of tasks were presented to the DRNs - prediction and recognition – to demonstrate that the DRNs can learn nonregular languages. Three different types of simulations were carried out. The first showed the ability of a DRN without an external stack to perform a relatively simple counting task. The simulation also showed that the DRN was able to generalize on the data given during training. The second simulation showed the ability of a DRN to learn and generalize on a context free language. The final simulation showed that a DRN was also able to predict symbols in a context-sensitive language.
This article demonstrates some of the differences between symbolic and dynamical systems. It also gives some solid evidence that the computational hierarchy of finite-state automaton, pushdown automaton etc based on memory constraints might not be the most effective separation mechanism, as under some conditions a network can function as more than one type of machine.
Reference details for “Representation Beyond Finite States: Alternatives to Pushdown Automata”:
Wiles, J., Blair, A.D. & Boden, M., 2001, Representation Beyond Finite States: Alternatives to Pushdown Automata in Kolen, J.F. & Kremer, S.C. (eds) A Field Guide to Dynamical Recurrent Networks, pp 129-142, New York: IEEE Press.
Activation vector of the state variables of a discrete time dynamical system changes over time according to this equation:

where It is the input pattern at time t. The fully recurrent state space is described by the following two equations:


where wxx is the weight of x, wyy is the weight of y, wxy is the weight of connection from y to x, wyx is the weight of the connection from x to y, and wxθ and wyθ are the biases for x and y respectively.
Omlin & Giles – Constructing Deterministic Finite-State Automata in Recurrent Neural Networks
Previous research has shown that recurrent neural networks (RNNs) with sigmoidal discriminant functions that are trained to behave as deterministic finite-state automata (DFAs) experience performance degradation on long strings. The authors of this article aim to show that a new construction algorithm that creates second order RNNs with sparse connections is able to minimise these problems: it can classify strings with arbitrary length.
The network used in the experiment was a based on a randomly generated 100 state DFA – therefore the network had 101 state neurons (one for each DFA state, as well as an output unit that accepts/rejects the output of the unit after the string has been processed). The network had a continuous sigmoidal discriminant function. Partially recurrent and fully recurrent networks were tested and demonstrated (along with mathematical proof) that a RNN could be constructed from a DFA that had the same behaviour as the DFA.
The author’s method for constructing RNNs is compared with earlier methods. Networks constructed by this method perform comparably or better than all other construction algorithms. The important things to note are: the network has not lost any computational capabilities with the introduction of second order weights, and this article only contains a proof that a RNN can behave like a DFA. There is no proof that the RNN can learn to behave as a DFA.
Go to “Constructing Deterministic Finite-State Automata in Recurrent Neural Networks” at Citeseer:
http://citeseer.nj.nec.com/omlin96constructing.html
Reference details for “Constructing Deterministic Finite-State Automata in Recurrent Neural Networks”:
Omlin, C. & Giles, C. L., 1996, Constructing Deterministic Finite-State Automata in Recurrent Neural Networks, Technical Report No 94-3, Troy: NY.
Each neuron in the first order network computes the following function:
where ai(t) is calculated by:
where Sj(t) represents the output of other neurons and Ik(t) represents the input neurons.
Chalup & Blair – Hill Climbing in Recurrent Neural Networks for Learning the anbncn Language
This article aims to show that a first order recurrent neural network (RNN) can be trained to predict the next symbol in a context-sensitive language (such as anbncn). The difference between this method and many others is that the algorithm that is used to train the networks is evolutionary hill climbing (not the more common backpropagation). The network is expected to be able to predict strings of up to length 12.
The architecture of the RNN contained three units in each layer. The network has three input units, three hidden units, three state units and three output units. The activation function for each unit is the hyperbolic tangent function (as opposed to the more common sigmoid function). The task that the network is presented with is to predict the next symbol in a sequence. Using staged learning (similar to incremental input learning), the network was trained to be able to predict symbols in strings up to depth 10 or 12.
There were two significant aspects to this paper. The first is that the authors claim that they have provided more support for the view that language complexity classes (i.e. the Chomsky hierarchy) may not be as appropriate for dynamical systems as it is for physical symbol systems. The other major claim of this paper is that the evolutionary hill climbing algorithm was able to overcome some of the instability problems that can be encountered when using backpropagation.
Go to “Hill Climbing in Recurrent Neural Networks for Learning the anbncn Language”:
http://www.cs.newcastle.edu.au/~chalup/Publications.html
Reference details for “Hill Climbing in Recurrent Neural Networks for Learning the anbncn Language”:
Chalup, S. & Blair, A., 1999, Hill Climbing in Recurrent Neural Networks for Learning the anbncn Language in Gedeon, T. et al (eds) Proceedings, 6th International Conference on Neural Information Processing (ICONIP'99), Perth, Australia 16-20 November 1999, pp508-513.
Pearlmutter – Gradient Calculations for Dynamic Recurrent Neural Networks: A Survey
This article provides a summary of learning algorithms for recurrent neural networks that have hidden units. Some simulations were carried out to show the comparison in time and resource use between the algorithms (however the focus of this article is not the results of these simulations). An initial distinction is made between Continuous and Discrete Time Networks (this article focuses mainly on Continuous Time Networks).
The algorithms that were examined were divided into three categories: Learning in Networks with Fixed Points, Computing the Gradient without assuming a Fixed Point and some other Non-Fixed Point Techniques. The section on Fixed Point learning examines Recurrent Backpropagation and Deterministic Boltzmann Machines. Techniques that allow a recurrent network without a fixed point to learn include Backpropagation Through Time and Real Time Recurrent Learning (in which time constants and time delays have to be taken into account). Finally the author examines “Elman Nets”, the Moving Targets Method, Feedforward Networks with State and Teacher Forcing in Continuous Time.
The importance of this article lies in the comparison of the algorithms that are used to learn in recurrent neural networks. It also shows that the time and resource use of learning in recurrent networks when compared with non-recurrent networks is not as bad as believed. These algorithms can be optimised and have been shown to allow generalization.
Go to Pearlmutter’s “Gradient Calculations for Dynamic Recurrent Neural Networks: A Survey”:
http://www.cs.unm.edu/~bap/publications.html
Reference details for “Gradient Calculations for Dynamic Recurrent Neural Networks: A Survey”:
Barak A. Pearlmutter. 1995, Gradient calculation for dynamic recurrent neural networks: a survey. IEEE Transactions on Neural Networks, 6(5):1212-1228.
The aim of Rodriguez’s thesis was to investigate questions regarding the ability of a Simple Recurrent Network (SRN) to function as a Context Free Language (CFL) processor. Some of the questions asked were: Does an SRN reflect human processing data on center-embedded sentences? What is the nature of the generalizations in SRNs? Rodriguez looks at SRNs from the traditional viewpoint of Chomsky’s original argument about syntax theory.
This section of Rodriguez’s thesis does not discuss the simulations that were carried out in earlier chapters. Rather, he discusses the abilities of the SRN in relation to the elements of Chomsky’s syntax theory. He also discusses the questions that have been raised and not yet answered by this work on SRNs. These include: Can a network learn unrestricted context-sensitive solutions? Can a network be made more robust in learning counting tasks? What is the relationship between SRNs as Analog Counter Automata and Biological Models of Neurons?
The next progression with this work in SRNs is to attempt training tasks beyond sentence processing and prediction. The factors that most closely affect the capabilities of the network are precision and the ability of the network to develop trajectories in phase space that perform the correct functions. Rodriguez suggests that these factors need to be examined more closely.
Go to Rodriguez’s thesis “Mathematical Foundations of Simple Recurrent Networks in Language Processing”:
http://cogsci.ucsd.edu/cogsci/publications/prodriguez/
Unpublished. However, papers have been developed from this thesis. They can be found at:
http://www.people.virginia.edu/~pr9t/Equations taken from Rodriguez's thesis (chapter 1):
and

where t represents time, X is the vector of the hidden units, I is the input vector, Y is the output vector, WHH is the weight matrix from hidden to hidden units, WHI is the weight matrix from input to hidden units and WOH is the weight matrix from hidden to output units.
Moore – Finite-Dimensional Analog Computers: Flows, Maps, and Recurrent Neural Networks
This article is a review of previous work on the computational power of dynamical systems. The author first explains a number of concepts relating to universal computation (i.e. Turing machines), and then looks at factors that can affect computation (examples given are computing in real time and being bound by resources). Finally, Moore looks at the difficulties of determining the lower bounds of a problem’s complexity.
The author did not carry out any simulations for this article. Instead, it focuses on addressing some questions (and solving some conjectures) about the nature of Universal Turing machines. The first question is whether a Turing machine will ever halt on a given input. The author looks at two different ways to create universal computation in low-dimensional spaces, both of which have drawbacks. For these models of universal computation to be interesting to anyone other than mathematicians, it has to be possible to build them. They must therefore be generic (the strongest definition of which is structural stability).
The interesting thing about this article was that in a couple of well defined conjectures, the author has stated some important properties of Universal Turing machines. He has also shown how a dynamical recognizer could represent a recurrent neural network.
Go to Moore’s article “Finite-Dimensional Analog Computers: Flows, Maps, and Recurrent Neural Networks”:
http://www.santafe.edu/~moore/
Reference details for “Finite-Dimensional Analog Computers: Flows, Maps, and Recurrent Neural Networks”:
Invited talk for the First International Conference on Unconventional Models of Computation in Auckland, New Zealand, 1998.
L & R are integers representing the left and right half of a tape in an arbitrary Turing machine:
and:
where
i is the position of the counter
a is the symbol that is being read
Elman – Connectionism, Artificial Life and Dynamical Systems: New approaches to old questions
The aim of this article was to provide some historical context and background information for some of the relatively new approaches to solving problems in cognition. The article begins by looking into the history of research into cognition in terms of computation, and the assumptions and theories that dominated the area up until the 1970s and 1980s.
Then the author discusses Connectionism (where processing is carried out by a number of simple processing elements). He outlines the key characteristics of connectionist networks, some of the issues that have arisen in the course of studying this area (such as the question of how to account for regularity) and some of the shortcomings of these type of networks. Secondly, Elman discusses Artificial Life (Alife) – where simple systems often produce complex behaviour and frequently exhibit emergentism. He also discusses evolution and goal driven behaviour. Lastly, Elman discusses Cognition as a Dynamical System (providing a formalism that can characterize the changes which occur in the system over time).
A useful aspect of this article was the clear description of how cognition can be seen as a dynamical system. Elman finishes by stating that the three models that he discussed share some things in common but none of them are enough to explain cognition by themselves. However, each of the models complements others in some way, and a combination of the three could produce some very exciting discoveries (Elman, however, is careful to leave his speculation at this point).
Go to Elman’s “Connectionism, Artificial Life and Dynamical Systems: New approaches to old questions”:
Elman, J. L., 1998, Connectionism, artificial life, and dynamical systems: New approaches to old questions. In W. Bechtel and G. Graham (eds.) A Companion to Cognitive Science. Oxford: Basil Blackwood.
The aim of this article was to demonstrate that the network can learn to represent three functions simultaneously. Another aim of the article was to show the different ways that a network can be analysed (Canonical Discriminant Analysis (CDA) and a method of developing curried functions from the input sequence).
The simulations of this article used a Simple Recurrent Network (SRN) with three input, five hidden, five context, one output and five hint units (hint units are extra target units at the output layer). The network was trained and tested on Boolean functions of the current input and previous output, and also AND, OR and XOR functions. The network was trained on three data sets each of which contained 700 randomly generated patterns. Inputs were strings of 0s and 1s, of arbitrary length. The network was also tested for its ability to generalise on data.
An important point that was raised by this article is that the information for computing a Boolean function can be found in the weights of the network. However, the simulations that were carried out demonstrated that this does not always generalise to other networks (even ones that use the same Boolean functions). The long term interest in using operators and curried functions for network is to further explore ways of integrating process and representation in recurrent networks.
Reference details for Wiles & Bloesch’s “Operators and curried functions: Training and analysis of simple recurrent networks”:
Wiles, J. & Bloesch, A., 1992, Operators and curried functions: Training and analysis of simple recurrent networks in Moody et al (eds) Neural Information Processing Systems 4.
The current state is described by the following equation:
where
and S(0) is the initial state of the network.
Nolfi & Parisi – Learning to adapt to changing environments in evolving neural networks
The aim of this article was to demonstrate the learning should not be studied in isolation from evolution. The authors claim that the adaptive functions of learning (helps to guide evolution, allows adaptation to fast environmental changes and makes it possible to overcome size limitations of the genotype) are important from an evolutionary point of view. In this article, they focus on one aspect of learning – that it allows networks to adapt to changes in the environment much faster than evolution alone.
The simulations that were carried out for this paper were a modified version of Todd & Miller’s (1991) aquatic environment. The networks are set up as creatures that have to choose between food and poison. The modifications were made when some of the constraints on the earlier model were removed. The main difference is that agents were moved around the environment, rather than remaining stationery for their entire lifetime. Each network consisted of two input (sensory) units and one output (motor) unit. Part of each network was a sub-network that acted as a self-teacher.
The implications of the simulations that were carried out in this paper demonstrate that learning should not be studied separately from evolution. An interesting conclusion that was presented was the comparison between the abilities of a learning population vs a non-learning population. The learning population were able to perceive more differences in their environment at birth than non-learners, however that difference decreased after the learning population was allowed to learn.
Go to Nolfi & Parisi’s article “Learning to adapt to changing environments in evolving neural networks”:
http://gral.ip.rm.cnr.it/nolfi/nolfipub.html
Reference details for “Learning to adapt to changing environments in evolving neural networks”:
Nolfi S. & Parisi D. (1997). Learning to adapt to changing environments in evolving neural networks. Adaptive Behavior, (5) 1:75-98
Batali – Computational Simulations of the Emergence of Grammar
The aim of this paper was to test the ability of agents (that consisted of recurrent neural networks) to develop co-ordinated communication systems. The agents in the population took turns sending or receiving sequences, to encourage them to learn to interpret sequences that were being sent by other agents. The agents were expected to develop a system of communication that was able to express meaning fully in the minimum number of symbols.
Each agent consisted of a neural network that could behave either as a speaker or a hearer and a meaning vector. When the agent is acting as a hearer, the meaning vector is used by the agent as a target output. When t acting as a speaker, it is used to represent the meaning that the agent sends to other hearers. Initially, the agents in the population are exposed to a number of contradictory meanings for the same sequence. As the number of epochs in a simulation increase, the negotiation process between the agents brings about agreement on the meaning of a string of syllables.
The networks were analysed after the simulations had been run and the results showed that the sequences were significantly shorter than at the beginning of the simulation and that measurably different sequences were sent to convey different meanings. Some of the agents from simulations were tested to see if they could convey “novel” meanings (i.e. if they could generalise) – they could. The main implication from these simulations is that an ability to express structured meaning can be developed without innate language-specific abilities.
Go to Batali’s “Computational Simulations of the Emergence of Grammar”:
http://cogsci.ucsd.edu/~batali/cv.html#Papers
Reference details for “Computational Simulations of the Emergence of Grammar”:
J. Batali, 1998, Computational Simulations of the Emergence of Grammar in Approaches to the Evolution of Language: Social and Cognitive Bases, J. R. Hurford, M. Studdert-Kennedy and C. Knight (eds). Cambridge University Press, pp 405-426.
The main aim of this paper is to provide further proof of claims that were made by the author (and others) in an earlier paper, which have been questioned in (Harvey, 1996 & 1997). Nolfi claimed in the earlier paper (Nolfi, Elman & Parisi, 1994) that learning had a beneficial effect on evolution, even when learning involved acquiring a different ability to the one acquired during evolution. More detailed information is given in this paper regarding the simulations that were carried out as proof of this claim.
The simulations use neural networks that control an “animat”, which is given the task of finding food in order to survive. The input to the neural network is sensory information on the nearest food token and the current planned action. Output is the next planned action and a prediction of the sensory state after the current planned action has been carried out. The networks consisted of four input, seven hidden and two prediction units. Backpropagation was used to train the network.
The claim that is made by Harvey is that the results shown in the first paper by Nolfi et al can be attributed to recovery from the weight perturbations caused by mutations. Nolfi demonstrates that the abilities that are lost through mutation can be restored by learning. However, he claims that this is not re-learning (as claimed by Harvey), but proof that evolution can affect learning.
Go to Nolfi’s “How Learning and Evolution Interact: The Case of a Learning Task which Differs from the Evolutionary Task”:
http://gral.ip.rm.cnr.it/nolfi/nolfipub.html
Reference details for Nolfi’s “How Learning and Evolution Interact: The Case of a Learning Task which Differs from the Evolutionary Task”:
Nolfi S. (1999 copyright 2000). How learning and evolution interact: The case of a learning task which differs from the evolutionary task. Adaptive Behavior, (7) 2:231-236.
Ñeco et al – Encoding of Sequential Translators in Discrete-Time Recurrent Neural Nets
The aim of this article was to demonstrate a strategy for stable encoding of Sequential Finite-State Translators (SFST) in discrete-time recurrent neural networks (DTRNN). The motivation for this, the author claims is that it is possible to view DTRNNs as state machines that are not finite. Previous works have shown that DTRNNs can learn Finite State Machine (FSM) behaviour, although the correct input is only demonstrated for short input strings. Incorrect behaviour for long input strings is called instability. Therefore the aim here is to provide a strategy for a scheme that can encode stability in second-order DTRNNs.
First, the author outlines the conditions for a DTRNN to act as a Mealy machine (more specifically an Extended Mealy Machine (EMM)). These conditions include the partition of the state space, representation of the input and the interpretation of the output. These conditions provide the bounding criteria for the constraints on the strategy presented by the author. The values of the next-state, output and control functions are determined by equations shown on the adjoining Equation page. The experiments that were carried out demonstrated the minimum value of variables that was required to implement the correct encoding of a Mealy machine.
The results of the experiments show that the conditions that were used to derive the constraints on the encoding strategy were valid. The DTRNNs displayed stability under these conditions. However, this strategy only applies to a very narrow group of DTRNNs – those that have continuous, positive, strictly growing and bounded activation functions.
Go to Ñeco’s “Encoding of Sequential Translators in Discrete-Time Recurrent Neural Nets”:
http://www.dlsi.ua.es/~mlf/publ_en.html
Reference details for Ñeco’s “Encoding of Sequential Translators in Discrete-Time Recurrent Neural Nets”:
Ramón P. Ñeco, Mikel L. Forcada, Rafael C. Carrasco, María Ángeles Valdés Muñoz,, "Encoding of sequential translators in discrete-time recurrent neural networks", in Proc. ESANN'99 (Bruges, Belgium, 21-24.04.1999) , pp 375-380.
Values of the weights for the next-state, output and control functions for encoding an Extended Mealy Machine (EMM) in a Discrete-Time Recurrent Neural Network (DTRNN):




Tabor – Context Free Grammar Representations in Neural Networks
The aim of this paper is to examine context free grammar (CFG) computation in Dynamical Automata (DA) that are implemented in neural networks. Previous research has focussed on neural network learning of only very simple context free grammars, and most of the networks have made use of an external stack. These simulations look at CFG computation without taking learning into account.
The Dynamical Automaton is implemented as a neural network using a combination of signalling and gating units. An affine function is used to define the state changes in the DA (for more details on the functions used in this paper, see adjoining Equations page). The grammar used in these simulations was relatively simple, although some interesting conclusions can be drawn from the results. The author was able to map where some of the simplest context free languages resided in the parameter space, which the author considers could be the start of a good navigational tool (as this method would work well with a gradient descent approach, the map of the parameter space could help move the gradient descent mechanism away from local minima).
The important concept in this paper is that fractal sets provide a method for organising recursive computations into a finite bounded state space. These results also demonstrate a parameter-space map and demonstrate its’ relevance to neural network learning.
Go to Tabor’s “Context Free Grammar Representations in Neural Networks”:
http://citeseer.nj.nec.com/189658.html
Reference details for Tabor’s “Context Free Grammar Representations in Neural Networks”:
Tabor, W., 2001, Context Free Grammar Representations in Neural Networks, draft version, submitted to NIPS.
Lawrence, Giles & Fong – Can Recurrent Neural Networks Learn Natural Language Grammars?
The issue that is examined in this paper is whether a recurrent neural network can learn a complex rule system, using a high level of discrimination (i.e. is able to distinguish grammatical strings from ungrammatical strings when the difference is very slight). The aim of the simulations in this paper was to train a neural network from scratch to demonstrate a good level of discriminatory power.
Four different types of neural networks were used in these simulations: FGS (a locally recurrent network not expected to be able to perform the learning task, and used mainly as a control case), N&P (feedback connections from all output to all hidden nodes), Elman (feedback from each hidden to all other hidden nodes) and W&Z (all nodes connected to all other nodes). The paper focuses on the Elman networks, which had two word inputs (i.e. the current and previous word), 10 hidden nodes, one hidden layer, a logistic activation function, sigmoid output functions, and a learning rate schedule taken from Moody and Darkin (see equations).
The authors were able to train the Elman network to give correct classification on the training data and an average of 74% correct generalisation on test data. As expected, the FGS was not able to represent the complexities in the language structure. However, the authors expected that the W&Z network would outperform the Elman network, however the Elman performed better on the tasks in this experiment. The authors claim this is due to the more complex error surface of the W&Z architecture.
Go to Lawrence, Giles & Fong’s “Can Recurrent Neural Networks Learn Natural Language Grammars?” on Citeseer:
http://citeseer.nj.nec.com/lawrence96can.html
Reference details for Lawrence, Giles & Fong’s “Can Recurrent Neural Networks Learn Natural Language Grammars?” :
Lawrence, S., Giles, C.L. & Fong, S., 1996, Can Recurrent Neural Networks Learn Natural Language Grammars? In Proceedings of the IEEE International Conference on Neural Networks, Piscataway, NJ, IEEE Press, pp 1853-1858.
Local output of a hidden node in a locally recurrent network calculated using:
"Search then converge" learning rate schedule takes the form:
where
is the learning
rate at time t,
is
the initial learning rate and
is
a constant.
Matlab code for learning rate schedule
Forcada – Neural Networks: Automata and Formal Models of Computation
This paper is an in depth look at neural networks from a formal computation point of view. The author examines a number of issues that mainly relate to Discrete Time Recurrent Neural Networks (DTRNNs), although the discussion is general enough that it can encompass other types of neural networks. Each major idea that is presented in this paper is completed with references to other, major papers on the topic.
First, Forcada takes a brief look at some of the history of neural networks (such as McCulloch and Pitts neural logical calculus, Mealy and Moore machines and Minsky’s neural automata). He moves onto sequence processing and the most common learning algorithms that are used with DTRNNs (mainly gradient based algorithms, although non-gradient based are also discussed).
The next two sections discuss the computational power of neural networks by looking at the most well-known tasks that DTRNNs are used for – language processing. Forcada looks at the hierarchy of grammars, and the automata that can process them. He then gives examples of DTRNNs that can simulate the different automata (and therefore perform some task on the associated grammar).
The most useful parts of this paper were the authors ability to begin with some of the simple concepts of neural network theory and move through to more complex concepts, and the discussion of the related references, which was concise but thorough.
Go to Forcada’s “Neural Networks: Automata and Formal Models of Computation”:
http://www.dlsi.ua.es/~mlf/nnafmc/
This is an unpublished draft that was downloaded on 19/05/02.
Ñeco and Forcada – Asynchronous translations with recurrent neural nets
The aim of this paper is to extend work that demonstrates the relationship between discrete-time recurrent neural networks (DTRNNs) and finite-state machines. Most of this work has focussed on deterministic finite state machines, which can only perform very limited translations. The authors extend these results on synchronous translations to asynchronous translations using an extension of a Mealy machine. The main complication in attempting to do this is that the time alignment between the input sequence and the target output sequence has to be learned by the DTRNN.
The standard Mealy machine is modified to allow translations to occur without “consuming” input. The authors built a second-order DTRNN, whose input-output pairs represented a series of extended Mealy machine translations. The extension that is made to the DTRNN so that it does not use up the input is the addition of two control lines. The training method used was a form of simulated annealing, as it was difficult to develop a cost function for use in gradient descent training (the alignment between input and output strings being ambiguous).
The next step in this area of research is to study the nature of the representations used by the DTRNNs and improve the ability of the network to generalise. Another area of interest is the ability of these extended DTRNNs to model more complex asynchronous translation tasks.
Go to Ñeco and Forcada’s “Asynchronous translations with recurrent neural nets”:
http://www.dlsi.ua.es/~mlf/publ_en.html
Reference details for Ñeco and Forcada’s “Asynchronous translations with recurrent neural nets”:
Ñeco, R.P. and Forcada, M.L., Asynchronous translations with recurrent neural nets in Proceedings of ICNN'97 (Houston, Texas, 8-12.06.1997) , vol. 4, p. 2535-2540.
The cost function that was used to train the neural network:
W is a noisy string
Es is the substitution term (see paper for description)
El is | length(W) - length(w) |
Eo is the output length
Ei is the input advance term
Forcada, Corbí-Bellot, Gori and Maggini – Neural Learning of approximate simple regular languages
Previous work has shown that discrete algorithmic methods are better than discrete time recurrent neural networks (DTRNNs) at inferring deterministic finite automata (DFA). The aim of this article is to demonstrate a method where the performance of DTRNNs can be improved. The method used changes the task of the DTRNN from learning the exact language to learning an approximate, simpler language. This is implemented by varying the error function.
The basis of the experiments reported in this paper were second order DTRNNs, which were used as string processors. A string was processed, then accepted or rejected according to the acceptance criteria (there were two: tight and loose). The training algorithm used on the DTRNN was real-time recurrent learning (RTRL). The error function was varied when the network classified clearly (even if the classification is incorrect) a pre-determined fraction of the training set. The languages that were used as training sets were Tomita’s simple languages. Two types of experiments were carried out – testing the DTRNN’s ability to recover the original language and its ability to learn an approximate language.
The DTRNNs were able to infer simple, approximate DFA’s for the languages used in the experiments, when the language did not correspond to a simple DFA. The next step is to study the stability of the DFA’s that were inferred during these experiments.
Go to Forcada et al’s “Neural Learning of approximate simple regular languages”:
http://www.dlsi.ua.es/~mlf/publ_en.html
Reference details for Forcada et al’s “Neural Learning of approximate simple regular languages”:
Forcada, M.L., Corbí-Bellot, A.M., Gori, M., Maggini, M., Neural learning of approximate simple regular languages in Proc. ESANN'99 (Bruges, Belgium, 21-24.04.1999) , p. 57-62
Dynamics of the network are defined by:
where i = 1, ..., nX and g is the sigmoid function.
The initial, supervised error function is:
where Tw is the correct target.
The final error function, which is unsupervised is:
where ε ≤ τ (ε is a small error value).
This paper extends on work that demonstrates that second order recurrent neural networks (2ORNNs) are able to infer deterministic finite automata (DFAs) when trained with both positive and negative examples. The authors extend this by attempting to show that 2ORNNs can learn DFAs from samples that also contain noisy input, when it is combined with a value that represents the degree of acceptance of the noisy string.
The 2ORNN used by the authors had A input neurons (where A is the size of the alphabet used in the simulation). The states of the hidden units are dependent on previous states. The goal of the learning process in this simulation is the optimisation of the parameters that characterise the network. The initial value of every parameter is selected randomly. The search is carried out using an adapted version of the real time recurrent learning (RTRL) method. The networks were tested using noisy strings for three regular languages.
The results demonstrated that 2ORNNs were able to learn simple DFAs, from relatively small samples. However, to conclusively prove that the networks could learn languages with noisy input, a larger sample of data was required. One conclusion that can be drawn from this paper is that recurrent networks can be used to infer DFAs from more complex (practical) pattern recognition problems than previously believed.
Go to Carrasco & Forcada’s “Second-Order Recurrent Neural Networks can Learn Regular Grammars from Noisy Strings”:
http://www.dlsi.ua.es/~mlf/publ_en.html
Reference details for Carrasco & Forcada’s “Second-Order Recurrent Neural Networks can Learn Regular Grammars from Noisy Strings”:
Carrasco, R.C., Forcada, M.L., 1995, Second-order recurrent neural networks can learn regular grammars from noisy strings in (eds) José Mira and Francisco Sandoval, From Natural to Artificial Neural Computation (Proceedings of the International Workshop of Artificial Neural Networks) , pp 605-610.
New states of the hidden neurons are determined using the following function, where g is the sigmoid function:
and
where Cija are the real valued weights, and I[t] is the input vector.
The aim of this article was to demonstrate a method used to design Feed-Forward Artificial Neural Networks (FFNs) that would be able to solve some of the more difficult optimisation problems. The model described in the paper is called the Building Blocks in Cascades Algorithms (BBC-Algorithm), which is based on Lindenmayer Systems (L-Systems).
The network structure is represented as an adjacency matrix. The matrix is designed using a stochastic L-System, which avoids the redundancy introduced by some other methods (such as using a deterministic context-free L-System). The networks created using the BBC-Algorithm were developed using an evolutionary algorithm. The fitness of an individual was a function of the hamming-distance between the target-matrix and the matrix of the evolved network. Back-propagation was used as the learning rule. The networks were tested on the 3 Dimensional parity problem.
The work that the authors are currently carrying out in this area is to extend the parity problem into 5, 7 and 10 dimensions. The conclusion the authors reach is that the BBC-Algorithm used in conjunction with an evolutionary algorithm can be used as a universal network generator. However, their results are not extensively presented nor discussed, so it is hard to verify if this claim is substantiated or not.
Reference details for Voigt, Born & Santibáñez-Koref’s “Design of Artificial Neural Networks with a Stochastic Graph-L-System”:
Voigt, H.-M., Born, J., Santibáñez-Koref, I., 1993, Design of Artificial Neural Networks with a Stochastic Graph-L-System, IEEE Colloquium on Grammatical Inference: Theory, Applications and Alternatives, 18, pp 1-8.
Stanley & Miikkulainen – Efficient Evolution of Neural Network Topologies
The aim of this article was to demonstrate the efficiency of an algorithm for evolving neural networks, called NeuroEvolution of Augmenting Topologies (NEAT), which was developed by the authors. The algorithm uses a genetic algorithm to evolve artificial neural networks. Previous results are summarised briefly in this paper, which demonstrate that the algorithm is more effective than other NeuroEvolution algorithms. The focus of this paper is to show that all the component parts of the NEAT algorithm are necessary for it to perform effectively.
The NEAT algorithm was tested on the pole balancing task, considered to be a benchmark reinforcement learning task. The pole balancing task is adjustable from easy to very difficult, which can prove useful in demonstrating the power of a network. According to the results shown, NEAT outperforms all other network systems. The authors then carried out “ablation” testing, which consists of removing parts of NEAT to demonstrate the effective of the algorithm without these components. In all cases, NEAT performed significantly worse with components missing.
The interesting features of NEAT are the use of historical information as the basis for the genetic algorithm, used to minimise the amount of searching that is required, and the use of speciation, which helps to maintain diversity in the population, preventing one strain dominating at the expense of others. If the results reported briefly in this paper are correct, NEAT appears to be a well-constructed, efficient NeuroEvolution algorithm.
Reference details for Stanley & Miikkulainen’s “Efficient Evolution of Neural Network Topologies”:
Stanley, K.O. & Miikkulainen, R., Efficient Evolution of Neural Network Topologies in Proceedings of the 2002 IEEE Congress on Evolutionary Computation (CEC ’02), Honolulu, Hawaii, to appear.
Equations from the article:
The compatibility distance, delta, of different structures is a linear combination of the number of excess (E) and disjoint (D) genes and the average weight of matching genes (W).
Species grow or shrink depending on whether their average adjusted fitness is above or below the population average.
LaMothe (unpublished):
Neural network input activation function:
where X1 - Xn are the inputs, wi are the weights on the inputs and B is the bias node, weighted by b.
Matlab code for input activation function
Weight matrix for a Hopfield network:
Input vector I is bipolar (values are 1 or -1) and contains k elements. The weight matrix is square with dimensions k by k.
