Pattern Matching vs Lexical Grammar Specification
Perm url with updates: http://xahlee.org/cmaci/notation/pattern_matching_vs_pattern_spec.html
Pattern Matching vs Lexical Grammar Specification
Amazon Kindle. Read books under the sun. Review
Xah Lee, 2008-05-01
This page is a personal note on the relation of pattern matching and lexical grammar specification, and their application in formal logic. Pattern matching here can mean textual pattern matching such as regex or list structure and datatype pattern matching such as in Mathematica.
A Need For Lexical Grammar Spec Lang
In computing, there's pattern matching. e.g. those in Mathematica, for pattern matching list structure and datatypes ( Mathematica patterns ) , and those in perl, for pattern matching strings. These pattern matching domain-specific languages are designed to match patterns, but i have realized, that they are rather unfit, when used as a mean to specify textual forms. In short, what i came to understand is that pattern matching facilities (may it be regex for strings or Mathematica's for symbolic expressions), although powerful, but is not capable of specifying formal grammar, even very simple ones. (this seems obvious now, when stated as above, but it was a realization for me.)
Often, when writing doc or in verbal communication, on computer programing or doing math with Mathematica, you are tempted to use a pattern to communicate a particular form a expression may have. For example, you would like to say that email address has the form xyz, where xyz is a perl regex:
Similarly, you might use regex to communicate the form of a url, domain name, file path. In math (especially in formal logic or in computer algebra systems), i often want to say that certain function has the form xyz, where xyz is a Mathematica pattern that indicates the function's parameters, output, and their types. (For example: a computer language expression analogous to traditional notation 「f:ℂ²→ℝ³」). However, in practice, using regex or Mathematica's patterns for the purpose of specifying a form, is often unsatisfactory. Typically, using patterns only gives a general idea, but isn't a correct specification of the form. For the purpose of giving a general idea, verbal English description is almost as good. Almost.
For example, in the official doc for E-mail address (RFC 2822), it does not use a regex to specify email address's syntax. Instead, it uses a variant of BNF mixed with many pages of explanations, for human reading.
Here's the regex for email address as specified in RFC 2822, from Source:
The regex is 426 chars long and not human unreadable.
It is quite desirable to have a simple language for specifying syntax, that can be parsed by a computer, yet designed in a human-readable way, and concise. With such a language, we could use it to verify if a text is a valid form. We could use it for human communication. Conceivably, it could replace regex for string patten matching or much enhance pattern matching as used in Mathematica or other functional langs.
Pattern Matching vs Grammar Specification
Pattern matching such as regex, takes a spect and works on a given text. Grammar spec lang such as BNF goes the other way: it takes a spect to generate all possible texts.
This brings the question: to what degree, pattern matching language and grammar specification language are orthogonal? That is: are these simply two ways of looking at the same idea? Can they be unified (in theory or practice)? This question, can be considered with respect to computer science of formal languages field. And, it can be also considered with respect to practical programing, as in tools to parse languages, regex to match strings, pattenrs to match text structures, or text generator tools to generate possible strings (such as math formula syntax, email syntax, url syntax, programing lang syntax).
One grammar specification lang is BNF. There are various (incompatible) versions and extensions of BNF, and BNF is mostly used for humans reading, in the same way that traditional math notation is a language for human consumption, not for machine to process. The machine readable versions used by parsers each has its own ad hoc incompatible extensions. As a human-to-human lang, it is imprecise and filled with errors. Is there a general purpose, widely used, BNF-like language that is designed to run as executable program, but also has qualities for human reading?
If we look at the top 20 most popular computer languages used today (e.g. Java, Perl, Visual Basic, SQL, ... lisp), most of them do not have a language specification or anything close to a published grammar. For those that have some form of spec, perhaps 95% of them are specified in a form for human to read (as opposed to machine readable parser lang), and are extremely complex and imprecise. For examples: Java with Java Language Specification, Python with its “The Python Language Reference”, Perl, PHP, have no grammar spec. Even the functional lang Haskell, does not have a formal grammar spec.
Take HTML for example, and even the much simpler XML's official specifications of its syntax by W3C at http://www.w3.org/TR/xml11/, are wildly complex with dense and verbose descriptions, in tens of pages. (as far as i know, there is no parser based specification for as simple a language as XML.)
The grammar of XML is conceptually very simple. It seems something is lacking, by the fact that its Lexical grammar is not specified by a computer language. Conceivable it would be within 50 lines of such a language, the essence would probably just 10 lines.
The above issues in computer science are roughly in the fields of parsing, pattern matching, formal language theory.
I'll be spending time studying parsing in the near future to fill my void in this important topic. As of now, my impression of yacc and lex is that they don't seem to be the system i'm looking for. For example, apparently, SGML, XML, do not use them to specify the grammar. Nor, most languages i know of. They seem just to be a very technical tool and not something general enough to be used for human communication as well in wide number of computer languages.
Parsing Expression Grammar!
I have found what i wanted!! See Parsing expression grammar.
There are 2 Emacs Lisp implementations:
- http://www.emacswiki.org/cgi-bin/wiki/ParserCompiler (2008), by Mike Mattie.
- http://www.emacswiki.org/emacs/ParsingExpressionGrammars (2008), by Helmut Eller.
Some Wikipedia learning and notes:
BNFs, CFL, Generative Grammar
The Backus–Naur form has several variants used today. The original BNF discussed in 1959 is hardly used today. Major variants are, perhaps in order of popularity:
- Extended Backus-Naur form (EBNF). Standardized by ISO-14977.
- Augmented Backus-Naur form (ABNF). Used by IETF. Specified in RFC 4234.
- Wirth Syntax Notation (WSN)
BNF and variants are so-called Metasyntax. Metasyntax means roughly a syntax that is used to describe the formal grammar of other langs. In particular, the BNF (and variants, henceforth BNFs) are used to describe a class of language called Context-free languages (CFL).
Context-free langs are those can be described using Context-free grammar, and context-free grammar is roughly those that can be described by a set of certain transformation rules (often called production rules). BNFs are syntaxes for these production rules.
A grammar (formal grammar), is a high-level (typically human-level) spec that describes a language's possible strings (i.e. the source code). There are many classes of grammars. One classification is whether they are context-free. And, of the context-free ones, there are few ways to describe them. One way, by listing several transformation rules, is called Generative grammar. Another way, by mathematical qualifications, is called Analytic grammar. (So, for example, BNFs are used for generative grammar.)
As far as i know, none of the following examples are written for machine to parse. Also, the BNF extention is non-standard.
- Java in BNF (educational example only)
- Lisp in BNF (educational example only)
- Scheme Lisp BNF from R5RS (arbitrary BNF)
- Python grammar from Python Reference Manual (2.5.2; 2008). Arbitrary BNF.
Parsing can be broken down to few steps.
Lexical analysis, which breaks the text stream into tokens. Then, syntactic analysis organize the stream of tokens into a parse tree.
In lexical analysis, it can be separated into 2 steps: scanner and tokenizer.
formal grammar specifies the all possible string of a language.
Lexical grammar specifies the syntax of tokens. For example, XML is specified in lexical grammar. (since here is not one specific language, but a lexical structure that can spawn off new langs)
Once a formal grammar is given, the reverse problem of determining whether a string belongs to that lang, is theoretically studied in a field named Automata theory, which is how regex originated from.
Some links to check out:
- Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) (January 2007), by Russ Cox. http://swtch.com/~rsc/regexp/regexp1.html.
- “Parsing Expression Grammars: A Recognition-Based Syntactic Foundation” (2004-01-14), by Bryan Ford, MIT. (simple slide based tutorial) http://www.brynosaurus.com/pub/lang/peg-slides.pdf
Origin Of My Interest
PS what lead me to wrote this article, is this: i was study algebraic curves 《elementary geometry of algebraic curves》 amazon , by C G Gibson. While reading the book, in conjunction of working on my Visual Dictionary of Special Plane Curves website, and also my long time interest in formal mathematics (see The Codification of Mathematics), i wanted to express the definition and theorems of algebraic curves as a sequence of symbols. (and hopefully show proofs as derivation of these symbol strings according to some given rules. In other words, my wish is to put the human element out)
In other words, i was foraging into formal systems. This is when i EXPLICITLY realized a long experience, that pattern matching as by regex or Mathematica isn't sufficient to specify even simple forms (i.e. grammar). This got me started to study formal languages, grammar, BNF, CFL, PEG, parsers, automata, etc. I have actually never studied these subjects. (I recall, back in 1997, i was naively trying to devise production rules for the Mathematica language. I was, effectively, devising BNF for Mathematica's FullForm (which is just nested list like lisp).)
Another reason that i dived into the subject, is that for the past 2 or so years i've been wanting to have a lisp source code reformatter in emacs based on a lexical analysis. (See: A Simple Lisp Code Formatter.) Getting acquainted with parser theory will help me.