In previous articles I have talked about using flex & bison
to generate a parser. However, such parsers are known to suffer from a series of problems. The first is that they are almost impossible to debug and test. In practice, we want to test our parser in small units and thus can locate bugs precisely. Another common issue is let parser generator to generate code makes compiling more complex (especially for languages like C++
) and the user is forced to learn their own APIs which can be very tedious.
Parser combinator is a Monad
in a nut shell, which also implies it is an instance of Functor
and Applicative
. It allows us to combine small and simple parsers into large and complex parsers and does not require to write any syntax/grammar files. For example, in order to recognize an identifier with parser combinators, we may do:
charP :: (Char -> Bool) -> Parser Char
charP = undefined -- detail omitted here
repeatP :: Parser a -> Parser [a]
repeatP = undefined --detail omitted here
identP :: Parser String
identP = liftA2 (:) startP $ repeatP bodyP
where bodyP = charP $ not . isSpace
startP = charP isLetter
Like most Monad
s, let us start by defining a Parser
as an wrapped lambda function that turns a String
into a tuple of the remained string and parsed data (of type a
).
Note: we will write a simple BaseParser
that does not handle errors at all. The real Parser
will be defined later.
newtype BaseParser a = BaseParser
{runBaseParser :: String -> (String, a)}
Now let us make it a Functor
.
instance Functor BaseParser where
fmap f p = BaseParser $ \s ->
let (s', a) = runBaseParser p s
in (s', f a)
As you may see, we are now able to transform BaseParser a
into BaseParser b
, yet we still can not combine two parser which is required to parser context free grammar (CFG). And that is why we must make it an Applicative
. Note that although Applicative
is enough to handle any context free grammar (CFG), we may be interested in something more powerful in order to parse context sensitive grammar (CSG). This requires us to make our Parser
a Monad
.
{-# LANGUAGE TupleSections #-}
import Control.Monad (ap)
instance Applicative BaseParser where
pure a = BaseParser (,a)
(<*>) = ap
instance Monad BaseParser where
return = pure
p >>= f = BaseParser $ \s ->
let (s', a) = runBaseParser p s
in runBaseParser (f a) s'
And that is it, you now have a working parser combinator with about 20 lines of code.
Our BaseParser
definition says nothing about error handling, but it doesn't means we can not handle errors outside BaseParser
with a monad transformer.
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
import Control.Monad.Except (ExceptT)
newtype Parser a = Parser (ExceptT String BaseParser a)
deriving (Functor, Applicative, Monad)
And that is all the codes we need.
Applicative
and Monad
allows us to combine several parser together so that we can parse a sequence. But what about "vertically" combine? If we are parsing something like json
, we probably want to try to parse a number or a string or an array together. For such requirement, we would introduce a new operator for Parser
.
import Control.Monad.Except (ExceptT (ExceptT))
import Control.Monad.Trans.Except (runExceptT)
parser :: (String -> (String, Either String a)) -> Parser a
parser = Parser . ExceptT . BaseParser
runParser :: Parser a -> String -> (String, Either String a)
runParser (Parser p)= runBaseParser . runExceptT $ p
(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = parser $ \s ->
case runParser p1 s of
(_, Left _) -> runParser p2 s
a -> a
Now our parser is ready to parse anything. I here by gives some examples to illustrate how to use it. You can also check out my Seml Github Repo which parse S-Expressions into XML.
Start with a parser that can parse only one character.
charP :: (Char -> Bool) -> Parser Char
charP f = parser $ \s ->
case s of
(x : xs)
| f x -> (xs, Right x)
| otherwise -> (s, Left "Invalid Char"")
[] -> (s, Left "Eof Error")
isP :: Char -> Parser Char
isP c = charP (== c)
Now we would like to parse an identifier. To do this we need to repeat charP
for n
times.
import Control.Applicative (liftA2)
import Data.Char (isAlphaNum, isLetter)
someP :: Parser a -> Parser [a]
someP p = liftA2 (:) p (manyP p)
-- Be careful here we do not want to stuck
-- on an empty string
manyP :: Parser a -> Parser [a]
manyP p = terminateP <|> someP p <|> return []
terminateP :: Parser [a]
terminateP = parser $ \s ->
case s of
[] -> ([], Right [])
_ -> (s, Left "")
-- | Repeat a parser
repeatP :: Parser a -> Parser [a]
repeatP = manyP
-- | Recognize an identifier
identP :: Parser String
identP = liftA2 (:) (charP isLetter) $
repeatP $ charP isAlphaNum
Note that even though we have handled empty string here, repeat a parser that accepts an empty string will still cause an infinite loop. Be aware!
Ok, we can parse an identifier. But what about parse a quoted string?
-- | Parse something between other things
betweenP :: Parser l -> Parser r -> Parser mid -> Parser mid
betweenP l r m = l *> m <* r
-- Simplified escape handling, we just return what ever
-- character that is escaped and do not translate
-- `\n` into newline or etc.
escapeSP :: Parser Char
escapeSP = isP '\\' *> charP (const True)
quotedP :: Parser String
quotedP =
let ssP = repeatP (escapeSP <|> charP (/= '\''))
dsP = repeatP (escapeSP <|> charP (/= '"'))
singleP = betweenP (isP '\'') (isP '\'') ssP
doubleP = betweenP (isP '"') (isP '"') dsP
in singleP <|> doubleP
Say we have a list of words separated by some random spaces. How do we parse such things?
-- | Repeat parser with separator
sepByP :: Parser s -> Parser a -> Parser [a]
sepByP ps pa = liftA2 (:) pa (repeatP (ps *> pa))
spaceP :: Parser String
spaceP = repeatP (charP isSpace)
wordsP :: Parser [String]
wordsP = sepByP spaceP identP