Extracting key-value pairs using XQuery by mraziz

I have written a function in XQuery where you can pass a key as an argument and retrieve that key's value, this function can be applied on unformatted text.

I have defined separators (of pairs) as a regular expression pattern that would match all whitespace in addition to commas:

[ \t\r\n,]+

And then I applied another regular expression to extract the key and value parts out of each pair:


This pattern matches any alphanumeric keys (plus dots) and alphanumeric values (plus dots and dashes) that can be optionally quoted.

The full function is as follows:

declare function xf:GetValue($arg as xs:string, $key as xs:string)
    as xs:string{
        for $item at $pos in fn:distinct-values(fn:tokenize($arg, '[ \t\r\n,]+', 's'))
        let $regexp := '([a-zA-Z0-9\.]+)=(\"?[a-zA-Z0-9\.\-\=]*\"?)'
        where fn:matches($item, $regexp, 's')
        return if ($key = replace($item, $regexp, '$1')) then
            replace($item, $regexp, '$2') else ()

It can be changed to adapt to different scenarios, I believe the one above should cover a majority of cases unless for instance you are dealing with a format where pairs are glued to eachother, something of this nature: key1=valueKey2=value then we need to find one or more patterns to distinguish separators from pairs.

Now my function isn't completely optimized for performance. Let's say we have n pairs that exist in the text we are searching in, and k pairs that we want to find, that means the complexity is O(kn) or simply O(n), linear that is. What we can do instead is to run the function once and populate a binary search tree, and then each time we want to get a value we will look it up in the BST, the complexity in this case will be O(n+k.log(n)).