Fast, Functional Text Mining: Rosie Pattern Language
April 27th 2019 09:00 - 09:45
Regex have well-known limitations and subtle ones. RPL replaces regex for mining unstructured text, is easy to write & maintain, and its match engine is fast. Under the covers, RPL expressions are combinators; these pure functions compile to instructions for a “matching virtual machine”.
Regular expressions are everywhere, including in the inner loops of most data mining code. But they don’t scale! Almost every implementation uses exponential backtracking, which can stall mining of big data, where input anomalies are likely. And building collections of regex is fraught, because they don’t compose. Perhaps most importantly, regex don’t scale to teams of people, because they are famously hard to read, understand, and maintain.
The Rosie Pattern Language (RPL) addresses all of these scale challenges: big data is processed in linear time in the input size; packages of composable patterns are easily shared; and it has a readable syntax, with named patterns, flexible whitespace, and comments, like a programming language.