Tuesday, September 7, 2010

Split and join (in Javascript)

This article...

I want to take a quick look at splitting and joining text using javascript.

Splitting

Suppose you want to split some text. A language like javascript (and many other languages besides) makes this very easy:

  ',,a,,,,b,,c,,'.split(/,/) // case (I)
  => ["", "", "a", "", "", "", "b", "", "", ]

Or

  ',,a,,,,b,,'.split(/,+/) // case (II)
  => ["", "a", "b", "", ]

You can recreate the string for (I) using an in-built join function:

  ',,a,,,,b,,'.split(/,/).join(',')
  => ',,a,,,,b,,'

Case (II) can't be put back the way it was because we do not know for any given joining point the size of the joining item since /,+/ matches a variable number of characters (commas in this case).

Joining for case (I)

Sometimes, you want to join a split as in case (I) but not back into a string. When I first tried to do this I ended up writing a horrifically complicated function.

Looking at this case again:

  ',,a,,,,b,,c,,'.split(/,/) // case (I)
  => ["", "", "a", "", "", "", "b", "", c, "", "", ]

The thing to remember is that "" represents the gaps between the commas in ',,a,,,,b,,c,,' including the gap before the very first comma and the gap after the very last comma. "a","b" etc are filled-in gaps. This is probably what is confusing about manually joining such a split array; because it's easy to fall into thinking that the ""-terms represent commas instead of the gaps.

Algorithm for manual joining

From an algorithmic point of view we want to map over the array produced in case (I) and process both the "" and non-"" terms.

The commas in the string may signify a point where we want to insert something. In my case, the strings I was splitting were text nodes from preformatted text (in pre-tags) that contained line feeds (\n or \r\n). I was tokenizing the text and wanted to preserve line feeds in the form of individual span tokens. So in this case the commas in case (I) would represent line feeds eg '\n\na\n\n\n\nb\n\nc\n\n' instead of ',,a,,,,b,,c,,'.

Going back to case (I), the terms (or gaps) are the best indication of where the commas are; if there are n commas, then there will be n+1 gaps (including filled in ones). Keeping this in mind the rules we could follow as we map over the array might be:

  • when we have a ""-term we insert comma
  • when we have a non-"" term we insert term followed by a comma
  • at the last position in the array don't insert a comma
    • if last position in the array is a "" then do nothing
    • if last position in the array is a filled-in gap, process it but don't insert comma

Functional approach

There are some nice ways to do this in javascript. Ecmascript 5 probably has mapping functions that might assist but here is a manual version that whilst not overly functional, facilitates a functional style when used (using the term 'functional' in a very loose sense):

  // Join elements that have been split by String.prototype.split(...).
  var join = function(arr,unsplit,process) {
      var i,l=arr.length;
      for(i=0;i<l;i++) {
          if(arr[i]!=='') process(arr[i],this);
          if(i!=l-1) unsplit(this);
      }
  }

Notes:

  • unsplit is a function that represents the "insert comma" operation
  • process is a function that represents the "insert term" operation which we apply to filled-in gaps like "a"
  • in addition, we pass this to both unsplit and process as this can faciliate sharing privileged information between unsplit and process; although this isn't necessary.

We could run join like this:

join(arr,f,g)

for some array arr and functions f and g.

But suppose we want to accumulate a result as join maps over arr or otherwise share privileged information between f and g, this is where this could be used:

var module1 = function() {
  var prog1 = function(text) {
    ...
    var someObj = {};
    ... initialize someObj ...
    var arr = text.split(...);
    join.call(someObj,arr,unsplit,process);
    ...
  }     
  var unsplit = function(obj) {
    ...
  }     
  var process = function(item,obj) {
    ...
  }     
}();

In the above we have a function prog1 inside a module that performs a split on some text. We invoke join using call passing someObj as the first argument; this becomes the this reference within join which in turn passes this to unsplit and process

Variations

We could skip using call/this and simply add an extra paramter to join to allow us to pass an object in.

Or we could also call unsplit and process. This removes the need to specify the obj parameter in these two functions:

  // Join elements that have been split by String.prototype.split(...).
  var join = function(arr,unsplit,process) {
      var i,l=arr.length;
      for(i=0;i<l;i++) {
          if(arr[i]!=='') process.call(this,arr[i]);
          if(i!=l-1) unsplit.call(this);
      }
  }
  var unsplit = function() {
    ... do something with 'this' ...
  }     
  var process = function(item) {
    ... do something with 'this' ...
  }     

We could also define unsplit and process within prog1 giving these functions privileged access to someObj. These functions would be generated every time prog1 is invoked. But there would be no need to mess about with an extra parameter or this.