2013-03-07

De-Kuklewiczing Haskell's Experimental Regex Packages

Consider this a study journal, about me trying to figure out how to use the 'regex-base' front-end for regular expressions in Haskell. (Backstory; Earlier mid-level dive into Text.Regex.PCRE.ByteString.)

The focus of this endeavour to figure out how to use Haskell now turns to work of the author of 'regex-base', 'regex-pcre', and 'regex-tdfa'. Unfortunately for Haskell noobs like me, most of his documentation for these packages assumes that the reader is proficient in reading abstract (read: vague) hints for manipulating Haskell's type system. There are no algorithms provided on an "it just works, with a low POSSIBLE* variance of interpretation," basis.

(* in the modal logical sense, sigh *sic*.)

I'm really trying to figure this out - after all, it seems like he's set up quite a robust system. Below, expect plenty of redundancy without similar Q&A on SO, and other sites. Hopefully by the time I'm done with this exploration, we'll be left with some sort of canonical summary for dummies (God forbid the APIs then change again...).

Of special interest are the manipulation of ByteString types, as such manipulations are obviously much faster than String manipulations (I'm assuming you know the difference between these types, in Haskell).

Let me begin by listing all the relevant resources I've run into over the past week of this.

Well, here's a basic working example with Text.Regex.TDFA and String:
import Text.Regex.TDFA
temp = 
  getAllTextMatches ("foo" =~  "o" :: AllTextMatches [] String)
Here's a similar example with Text.Regex.PCRE and String:
import Text.Regex.PCRE
temp = 
  getAllTextMatches ("foo" =~  "o" :: AllTextMatches [] String)
And, adding one module and changing a type hint in each, lets us use ByteString. Here's the TDFA example:
import Text.Regex.TDFA
import Data.ByteString.Char8
temp = 
  getAllTextMatches ((pack "foo") =~ (pack "o") :: AllTextMatches [] ByteString)
Likewise, the PCRE example:
import Text.Regex.PCRE
import Data.ByteString.Char8
temp = 
  getAllTextMatches ((pack "foo") =~ (pack "o") :: AllTextMatches [] ByteString)
That should get us started. I am going to bed now... this research and writing will continue later.

2012-03-08:

Tear down of AllMatchText usage

.. specifically, along with ByteString and PCRE.
import Text.Regex.PCRE
import Data.Array
import Data.ByteString.Char8

main = return $
  -- all expressions returned by the functions below are (ByteString)s

  {-
  getAllTextMatches 
  ((pack "abcdebxcfgfbycijk") =~ (pack "(b).*?(c)") 
    :: AllTextMatches (Array Int) (Array Int ByteString))
  --  An Array of: 
  --    Arrays, 
  --    containing all matched expressions, 
  --    and their matched subexpressions -}

  {-
  getAllTextMatches 
  ((pack "abcdebxcfgfbycijk") =~ (pack "(b).*?(c)") 
    :: AllTextMatches [] (Array Int ByteString))
  --  A List of: 
  --    Arrays of:
  --      matched expressions, 
  --      and their matched subexpressions -}

  {-
  getAllTextMatches 
  ((pack "abcdebxcfgfbycijk") =~ (pack "(b).*?(c)") 
    :: AllTextMatches (Array Int) [ByteString])
  --  An Array of: 
  --    Lists of: 
  --      matched expressions, 
  --      and their matched subexpressions -}

  {-
  getAllTextMatches 
  ((pack "abcdebxcfgfbycijk") =~ (pack "(b).*?(c)") 
    :: AllTextMatches (Array Int) ByteString)
  --  An Array of: 
  --    matched expressions -}

  {-
  getAllTextMatches 
  ((pack "abcdebxcfgfbycijk") =~ (pack "b.*?c") 
    :: AllTextMatches [] ByteString)
  --  A List of: 
  --    matched expressions -}

  --{-
  getAllTextMatches 
  ((pack "abcdebxcfgfbycijk") =~ (pack "(b).*?(c)") 
    :: AllTextMatches (Array Int) (MatchText ByteString))
  -- An Array of: 
  --  (MatchText)s, i.e. Arrays of:
  --    matched expressions, 
  --        with their (MatchOffset)s 
  --        and their (MatchLength)s
  --    and their matched subexpressions 
  --        with their (MatchOffset)s 
  --        and their (MatchLength)s -}
It turns out that as suspected, there's just too much going on in K's giant type signatures. This is complexified by the use of the Array type - which is morphologically represented by round and square brackets, as if it were composed only of tuples and lists, while being subject to further semantic conventions that require a reading of the documentation of (Array). (Actually, their implemented completely differently from mere ordinary tuples and lists.) Furthermore, K exports all sorts of utility variations for formatting the output of each function... lists-of-arrays, arrays-of-arrays, arrays-of-lists, etc. all representing the same data in different structures. Very muddy. Nevertheless, I guess he's done a good deed by writing the general libraries for all of us.

And then you've got types (MatchArray) and (MatchText) which are woefully, arbitarily named, despite their underlying simplicity and similarity.

The class (Extract) in Text.Regex.Base.RegexLike, really should be exposed at the same layer as the matching functions. :( I'm thinking that it should belong in some (.Internals or .Utilities) module, instead.


2012-03-11

Done. Figured out the Kuklewicz code, at least at the level of using his utility functions. Will have to tidy this post up later, if ever at all.

WIP

WIP

WIP

WIP

No comments :

Post a Comment