30 July 2013
When I wrote a couple of weeks ago about writing a way to consume Full Text Indexer data entirely on the client (so that I could recreate my blog's search functionality at productiverage.neocities.org - see The Full Text Indexer goes client-side!), I said that
Yes, the search index files are bigger than I would have liked..
This may have been somewhat of an understatement.
As a little refresher; there is one JSON file responsible for matching posts to search terms ("oh, I see you searched for 'cats'! The posts that you want are, in descending order of match quality: 26, 28, 58, 36, 27, 53, 24"). Then there is a JSON file per-post for all of the source location mappings. These describe precisely where in the posts that tokens may be matched. There's quite a lot of data to consider and it's in a fairly naive JSON structure (I used single letter property names but that was the only real attempt at size-constraining*). On top of this, there's a plain-text representation of each post so that the source location mappings can be used to generate a best match "content excerpt" for each result (à la Google's results, which highlight matched terms). And finally a JSON file for all of the titles of the posts so that the titles can be displayed when the search logic knows which posts match the query but the content excerpt generation is still being worked on in the background.
This results in 5.8MB of JSON and plain text data.
Now, in fairness, this is all gzip'd when it comes down the wire and this sort of data compresses really well. And the detail files for the posts are only requested when the corresponding posts are identified as being results for the current search. So in terms of what the user has to download, it's no big deal.
However, the hosting at neocities.org only offers 10MB at the moment so 5.8MB solely for searching seems a little excessive. To put it into perspective, this is several times more than the html required to render all of the actual posts!
I hadn't actually realised just how much of a problem it was until I published my last post. When I came to re-generate the flattened content to upload to NeoCities, that last post (being something of a beast) pushed the storage requirements past the magic 10MB mark!
* (Very minor diversion: There was actually one potential optimisation that I did consider when first serialising the index data for use on the client. Instead of a simple associative array, I wondered if using a Ternary Search Tree would reduce the space requirements if a lot of the keys had similarities. I used a TST internally in the C# project for performance reasons and talked about it in The .Net Dictionary is FAST!. Alas, when I tried it as a structure for my data, it resulted in slightly larger files).
The biggest contributors to size are the per-post search index files that contain the source locations data. Each is JSON data describing an associative array matching a "normalised" search term (to take into account plurality, case-insensitivity, etc) to a post key, weight and array of source locations. Each source location has the fields FieldIndex, TokenIndex, ContentIndex and ContentLength (read all about it at The Full Text Indexer: Source Locations).
I know that this data can be compressed since, not only does textual or JSON data frequently lend itself to being effectively compressed, but I can see the compression that gzip achieves when the data is delivered to the browser.
My first thought when it comes to compression is that we're dealing with binary data. Certainly, when I see gzip'd responses in Fiddler (before using the "Response is encoded.. Click here to transform" link) it looks like gobbledygook, not like any text I know!
This reminded me of something I read some time ago about a guy creating a PNG file where the pixels are generated from bytes extracted from textual content. This PNG can be read by javascript and the pixels extracted again, and from the pixel values the textual source content can be recreated. The real benefit with PNG here is that it incorporates a lossless compression scheme. Lossless compression is really important here, it means that the decompressed content will be identical to the source content. JPEG is a lossy scheme and can achieve higher compression rates if the quality is reduced sufficiently. But it loses information when it does this. The idea is that the resulting image is either acceptably close visually to the source image or that the discrepancy is evident but thought to be a worthwhile trade-off considering the file size benefits. If we're trying to extract text data that represents javascript than "close-ish, maybe" is not going to cut it!
The original article that I'd remembered was this: Compression using Canvas and PNG-embedded data. The results sound impressive, he got an older version of jQuery (1.2.3) compressed down from 53kb to 17kb. Bear in mind that this is already the minified version of the code! It's quite a short article and interesting, so give it a visit (and while you're there notice that the Mario background is interactive and can be played using the cursor keys! :)
The summary of the article, though, is that it's not suitable for mainstream use. Browser support is restricted (all modern browsers now would work, I'm sure, but I don't know how many versions of IE it would work in). And it concludes with "this is meant only as thing of interest and is not something you should use in most any real life applications, where something like gzip will outperform this".
Ok.
Glad I looked it up again, though, since it was interesting stuff.
On the same blog, there's also an article Read EXIF data with Javascript, which talks about retrieving binary data from a file and extracting the data from it. In this case, the content would have to be compressed when written and then decompressed by the javascript loading it. Unlike, the PNG approach, compression doesn't come for free. From the article, he's written the file binaryajax.js. Unfortunately, browser support apparently is still incomplete. The original plan outlined works for Chrome and Firefox but then some dirty hacks (including rendering VBScript script tags and executing those functions) are required for IE and, at the time at least, Opera wouldn't work at all.
Again, interesting stuff! But not quite what I want.
So I had to fall back to asking google about javascript compression and trying not to end up in article after article about how to minify scripts.
In fairness, it didn't take too long at all until a pattern was emerging where a lot of people were talking about LZ compression (info available at the Wikipedia page). And I finally ended up here lz-string: JavaScript compression, fast!
From that page -
lz-string was designed to fulfill the need of storing large amounts of data in localStorage, specifically on mobile devices. localStorage being usually limited to 5MB, all you can compress is that much more data you can store.
What about other libraries?
All I could find was:
- some LZW implementations which gives you back arrays of numbers (terribly inefficient to store as tokens take 64bits) and don't support any character above 255.
- some other LZW implementations which gives you back a string (less terribly inefficient to store but still, all tokens take 16 bits) and don't support any character above 255.
- an LZMA implementation that is asynchronous and very slow - but hey, it's LZMA, not the implementation that is slow.
- a GZip implementation not really meant for browsers but meant for node.js, which weighted 70kb (with deflate.js and crc32.js on which it depends).
This is sounding really good (and his view of the state of the other libraries available reflects my own experiences when Googling around). And he goes on to say
localStorage can only contain JavaScript strings. Strings in JavaScript are stored internally in UTF-16, meaning every character weight 16 bits. I modified the implementation to work with a 16bit-wide token space.
Now, I'm not going to be using it with localStorage but it's gratifying to see that this guy has really understood the environment in which it's going to be used and how best to use that to his advantage.
Preliminary tests went well; I was compressing this, decompressing that, testing this, testing that. It was all going swimmingly! The only problem now was that this was a clearly a custom (and clever) implementation of the algorithm so I wouldn't be able to use anything standard on the C# side to compress the data if I wanted the javascript to be able to decompress it again. And the whole point of all of this is to "flatten" my primary blog and serialise the search index in one process, such that it can be hosted on NeoCities.
The javascript code is fairly concise, so I translated it into C#. When I'd translated C# classes from my Full Text Indexer into javascript equivalents, it had gone surprisingly painlessly. I'd basically just copied the C# code into an empty file and then removed types and tweaked things to work as javascript. So I thought I'd take a punt and try the opposite - just copy the javascript into an empty C# class and then try to fix it up. Adding appropriate types and replacing javascript methods with C# equivalents. This too seemed to go well, I was compressing some strings to text files, pulling them with ajax requests in javascript, decompressing them and viewing them. Success!
Until..
I gave it a string that didn't work. The decompress method returned null. Fail.
I figured that there's only so much that could be going wrong. If I compressed the same string with the javascript code then the javascript code could decompress it just fine. The data compressed with the C# version refused to be decompressed by the javascript, though. Chance are I made a mistake in the translation.
I got the shortest reproduce string I could (it's from the titles-of-all-posts JSON that the search facility uses) -
{"1":"I lo
and got both the C# and javascript code to print out a list of character codes that were generated when that string was compressed.
These were identical. So maybe my translation wasn't at fault.
Well something must be getting lost somewhere!
This led me on to wondering if it was somehow the encoding. The compressed content is being served up as UTF8 (basically the standard on the web) but the compression algorithm is intended to compress to UTF16. Now, surely this won't make a difference? It means that the bits sent over the wire (in UTF8) will not be the exact same bits as the javascript string (UTF16) is represented by when it's received, but these encoded bits should be describing the same character codes.
So the next step was to intercept the ajax request that the javascript client was making for the data (compressed by C# code, delivered over the wire with UTF8 encoding) and to write out the character codes at that point.
And there was a discrepancy! The character codes received were not the same that I'd generated by the C# code and that I thought I was transmitting!
Thinking still that this must somehow be encoding-related, I started playing around with the encoding options when writing the data to disk. And noticed, innocently hidden away in an alternate constructor signature, the System.Text.UTF8Encoding class has an option to "throwOnInvalidBytes". What is this?
I knew how UTF8 worked, that it uses a variable number of bytes and uses the most-signficant-bits to describe how many bytes are required for the current character (the Wikipedia article explains it nicely) and thought that that was pretty much all there was to it. So how could a byte be invalid?? Well, with this constructor argument set to true, I was getting the error
Unable to translate Unicode character \uD900 at index 6 to specified code page.
so clearly particular bytes can be invalid somehow..
With this error, it didn't actually take much searching. There's a link on www.unicode.com; Are there any 16-bit values that are invalid? that states that
Unpaired surrogates are invalid in UTF8. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16
I spent a little while wandering through various searches on the internet trying to decide what the best way would be to try to address this. I didn't want to have to try to find another compressor for all of the reasons that the author of the one I'm using outlined! Which made me think, maybe there's more information about this on his site.
Lo and behold, in the changelog (at the bottom of that page), there's mention that there's a v1.3.0 available that has additional methods compressToUTF16 and decompressToUTF16 ("version 1.3.0 now stable") that "allow lz-string to produce something that you can store in localStorage on IE and Firefox".
These new methods wrap the methods "compress" and "decompress". But the "compress" and "decompress" methods in this new version of the code look different to those in the code that I had been using (and had translated). But it's no big deal to translate the newer version (and the new methods).
And now it works! I wish that this version had been the file you see when you go to the main lz-string GitHub page rather than being hidden in the "libs" folder. But considering how happy I am that the solution to my problem has been provided to me with little-to-no-effort, I'm really not going to complain! :)
Step 1 was to alter my Blog-to-NeoCities Transformer to write compressed versions of the per-post source location mappings data, along with compressed versions of the plain text post data and the JSON that has the titles for each post.
The C# translation of the LZString code can be seen at: LZStringCompress.cs.
Step 2 was to alter the javascript search code to handle the compressed content. Having included the 1.3.0 version of LZString.js, I needed to replace some of the $.ajax requests with calls to one of
function LoadCompressedData(strUrl, fncSuccess, fncFailure) {
// Note: I've seen this fail when requesting files with extension ".js" but work when the exact
// same file is renamed to ".txt", I'm not sure if this is in IIS that it's failing or if jQuery
// is requesting the data in a slightly different manner (despite the explicit dataType option),
// so best just ensuring that all LZ-compressed data is stored in a file with a ".txt" extension.
$.ajax({
url: strUrl,
dataType: "text",
success: function(strCompressedContent) {
var strContent;
try {
strContent = LZString.DecompressFromUTF16(strCompressedContent);
}
catch(e) {
if (fncFailure) {
fncFailure(e);
}
return;
}
fncSuccess(strContent);
}
});
}
function LoadCompressedJsonData(strUrl, fncSuccess, fncFailure) {
LoadCompressedData(
strUrl,
function(strContent) {
var objData;
try {
eval("objData = " + strContent);
}
catch(e) {
if (fncFailure) {
fncFailure(e);
}
return;
}
fncSuccess(objData);
},
function() {
if (fncFailure) {
fncFailure(arguments);
}
}
);
}
Step 3 is.. there is no step 3! Everything was now working but taking up less space on the hosting.
When compressed, the detailed source location mappings data is reduced from a combined 4.7MB to 1.5MB. The plain text content was reduced from 664kb to 490kb (not as much of a saving as I'd expected, to be honest). The titles JSON shrank marginally from 2.58kb to 2.36kb. The summary data JSON wasn't compressed so that the first stage of the search could be performed as quickly as possible and one non-compressed file on the server was no problem (it's still gzip'd when passed to the client, though, so there's no bandwidth cost). In total, 5.3MB of data was condensed into requiring less than 2MB on the server. Which I am happily marking down as a win :)
So here's to me hopefully fitting many more posts (along with all of the related javascript searching data) into the NeoCities hosting limitations! I'm sure that if I ever start pushing that 10MB point again, by then the 10MB limit will have been raised - 20MB is already on the way!
Posted at 23:22
Dan is a big geek who likes making stuff with computers! He can be quite outspoken so clearly needs a blog :)
In the last few minutes he seems to have taken to referring to himself in the third person. He's quite enjoying it.