Wednesday, November 5, 2014

Attaching ShareJS to <select> element

One thing that I found missing in ShareJS library was the possibility to attach live concurrent editing to HTML <select> element. Out of the box it works only with text fields - <input> and <textarea> using doc.attachTextarea(elem) function.

Working around that deficiency wasn't so trivial. ShareJS works with Operational Transformations that extracts each logical change to the text (addition or removal) and sends only the change information over the wire. It is great for textual elements, but for <select>, whose value is replaced always in one shot, it makes a little sense.

Unfortunately, there is no "replace" operation we could use on <select> value change - the modus operandi we have to live with is constrained to insertions and removals. It means we have to mimic "replace" operation with removal and insertion. The problem with this approach is that when the operations get reversed - so that the client receives new value insertion first and then removal of the previous value - the intermittent value in-between is not a valid <option>. It is a concatenation of old value and new value. DOM API doesn't like that and rejects that change, setting the <select> value to empty one. The removal operation that comes next is then unable to fix the value as it tries to remove something from already empty string in DOM.

I have worked around that wrapping my DOM element with a tiny wrapper that keeps the raw value and exposes it for ShareJS transformations while still trying to update the original element's DOM:

var rawValue = innerElem.value;
var elem = {
    get value () {
        return rawValue;
    },
    set value (v) {
        rawValue = v;
        innerElem.value = v;
    }
};

ShareJS also doesn't attach itself to change event, typical for <select> element - it specializes in keyboard events. So I have to attach on my own and rely the event to the underlying ShareJS implementation, faking the event of type that is handled by the library - I've chosen the mysterious textInput event.

Here is the full code as Gist: ShareJS attachSelect. It adds a new function to the Doc prototype, allowing calling it in the same way we're calling ShareJS native attachTextarea:

if (elem.tagName.toLowerCase() === 'select') {
    doc.attachSelect(elem);
} else {
    doc.attachTextarea(elem);
}

Feel free to use the code, I hope someone finds that useful.

This post was originally posted on my company blog.

Thursday, October 30, 2014

ShareJS 0.7.3 working example

I’m experimenting with ShareJS library, which is intended to allow live concurrent editing like in Google Docs. The demo on their website seems incredibly easy, even though later on the page they are so cruel: “ShareJS is mostly working, but it’s still a bit shit.”. I wouldn’t be so harsh as I was able to have it up and running in less than few hours. But the fact is it wasn’t as easy as it seemed.

It looks like the main problem with current state of ShareJS is what is pretty common in wild and uncontrolled open source world - lack of proper documentation. Here the problem is even worse. There are some docs and examples, but most of it is either incomplete or outdated. ShareJS.org website runs on ShareJS 0.5, while the most recent release is 0.7.3, with no backward compatibility between those releases. I think it will be less harmful if there was no examples at all - right now they are more misleading than helpful. It was a bit frustrating when even the shortest and simplest snippet from their website didn’t work, failing on non-existing functions being called.

Anyway, I was able to figure out what I need to change to have the simple demo running, both server- and client-side. Here it is, in case you have the same struggle, too.

On server-side, I’m running CoffeeScript WebSocket server, almost like in the original sample. I just needed few changes in order to have it running with Connect 3 - logging and static serving middlewares are no longer included in Connect out of the box, so I used morgan and serve-static, respectively. Here is the only changed part around Connect middlewares initialization:

app = connect()
app.use morgan()
app.use '/srv', serveStatic sharejs.scriptsDir
app.use serveStatic "#{__dirname}/app”

Go here for full Gist: ShareJS 0.7.3 server-side code.

I’m exposing client JavaScript libraries provided with ShareJS under /srv path and the client-facing web application files, physically located in /app on my filesystem, are exposed directly in the root path.

Client-side was a bit harder. Running the original code from the main ShareJS.org website wasn’t successful.

sharejs.open('blag', 'text', function(error, doc) {
  var elem = document.getElementById('pad');
  doc.attach_textarea(elem);
});

It tries to call sharejs.open function, which yields “TypeError: undefined is not a function” error for a simple reason - there is no longer “open” function on sharejs global variable. Fiddling around, I found an example that is using more verbose call like this:

var ws = new WebSocket('ws://127.0.0.1:7007');
var share = new sharejs.Connection(ws);
var doc = share.get('blag', 'doc');

if (!doc.type) {
    doc.create('text');
}
       
doc.whenReady(function () {
    var elem = document.getElementById('pad');
    doc.attachTextarea(elem);
});

Seemed legitimate and didn’t fail immediately, but I was getting "Operation was rejected (Document already exists). Trying to rollback change locally.” error message anytime except the first time. The code was calling doc.create('text') every time and that was clearly wrong, I should get doc.type pre-populated somehow. The solution is to subscribe to the document first and move checking the type and creating when needed to the function called after the document is ready - like this:

var ws = new WebSocket('ws://127.0.0.1:7007');
var share = new sharejs.Connection(ws);
var doc = share.get('blag', 'doc');
doc.subscribe();

doc.whenReady(function () {
    if (!doc.type) {
        doc.create('text');
    }

    var elem = document.getElementById('pad');
    doc.attachTextarea(elem);
});

See the full Gist: ShareJS 0.7.3 client-side code.

This post is cross-posted with my company blog.

Thursday, July 10, 2014

The Switch

A little more personal than usual today. With the end of June I left a job on a long-term 40+ developers project in a large international company. I didn't feel underpaid, my job was not mundane nor exhausting. But I felt it's high time to move on.

I'm now working with the company that employs around a thousand times less people and focuses on projects that are much smaller in scope and much shorter in time than in my previous job. I can see goods and bads of that change. I have more opportunities to try out and learn new things and I can stay away from office politics that I hate, but it comes in lieu of probably reduced job stability and the lack of long-term projects maintenance challenges that I really enjoy.

The biggest motivation I had was to test myself against my own belief that the software engineering is much more the mindset than the pure knowledge. I've always wanted to be open to another programming languages and technology stacks than the one I was currently working. I think the language or the platform are just a tools and a good software developer should not be constrained to use only one particular toolset. I believe switching from one to another should be just a matter of getting familiar with the practices and conventions of a given platform plus some practice. And actually the experience gained on other platforms can be used to get the best of both worlds.

So here I am - I no longer consider myself a database-inclined .NET guy. Now I'll be working in a variety of technologies, centered around mobile and front-end web development - from iOS to Android, from Node.js to AngularJS etc. The key is I have virtually no experience in any of those stacks. Right now I feel a bit like a toddler - I'm learning hundreds of new basic things every day. And that's the fun!

Wednesday, June 18, 2014

StructureMap: hot swap cache

In the previous post I've shown how to cache the objects in StructureMap for a given period of time. As I mentioned in that post, there is one possibly serious downside of the approach presented - the penalty of cache rebuilding that kicks one unlucky user every caching period. If it takes more than several seconds for the cached object to be built, we probably don't want this to happen in-process, unless we're showing our users something like XKCD strips while waiting.

Ideally, we would be rebuilding our cache in some kind of off-process mechanism and when it's ready, just replacing the old cache object with the fresh one - like disk hot swapping. Is it also possible with StructureMap? Probably not with lifecycles - lifecycles does not control object creation, they just provide proper cache.

What we can do instead is to pre-build the cache object and inject it into the working container. But we can't use the container to prepare that cache object for us this time - the container will happily fulfill our request with the previously cached object. Although delegating the object creation process is actually one of the purposes we use IoC containers for, I can't see any neat way to delegate the responsibility for objects creation for the whole application lifetime except the cache pre-building.

So I've chosen the less neat way. I've created a cache factory that just news the cache up manually, while being itself created by StructureMap. That way, whenever the application asks for IDependency, it gets the cached instance quickly. But when the cache rebuilding task runs, it grabs DependencyFactory and creates a new object, a future cache.

Let's see the code. First, here is a base class for all the cache factories - CacheFactory. It smells like a conforming container a bit, but I find it not really harmful. It is not intended to be used in any context other than cache pre-building and it is specialized to create a single type of objects. Cache consumers should not know about it and just take ICache dependency through the constructor injection or any other legitimate way.

public abstract class CacheFactory
{
    public abstract object InternalBuild();
    public abstract Type PluginType { get; }
}

public abstract class CacheFactory<T> : CacheFactory
{
    public T Build()
    {
        return (T)InternalBuild();
    }

    public override Type PluginType
    {
        get { return typeof(T); }
    }
}

The non-generic class is the core here. It defines a method responsible for returning the actual cache instance. The generic class is just to keep the API nice and have the possibility to define strongly-typed constraints.

The second brick in the puzzle is the code that handles the actual cache hot swap. It spawns a new thread that wakes up every 600 seconds and traverses all the CacheFactories registered in the container, creating new cache instances and injecting it into the working container. This way up until the Inject call, StructureMap serves all the requests with the previously cached instance and the Inject call gets the new object, ready to be used without any further delays.

public class BackgroundCacheRefresher
{
    private readonly IContainer _container;
    private readonly ILog _log;

    public BackgroundCacheRefresher(IContainer container, ILog log)
    {
        _container = container;
        _log = log;
    }

    private class Worker
    {
        private readonly IContainer _container;
        private readonly IEnumerable<CacheFactory> _cacheFactories;
        private readonly ILog _log;

        public Worker(IContainer container, IEnumerable<CacheFactory> cacheFactories, ILog log)
        {
            _container = container;
            _cacheFactories = cacheFactories;
            _log = log;
        }

        public void RefreshAll()
        {
            foreach (var cacheFactory in _cacheFactories)
            {
                try
                {
                    _container.Inject(cacheFactory.PluginType, cacheFactory.InternalBuild());
                    _log.InfoFormat("Replaced instance of '{0}'.", cacheFactory.PluginType.Name);
                }
                catch (Exception e)
                {
                    _log.Error(String.Format("Failed to replace instance of '{0}' due to exception,"
                        + " will continue to use previously cached instance.", 
                        cacheFactory.PluginType.Name), e);
                }
            }
        }
    }

    private void RunLoop()
    {
        while (true)
        {
            var lifetime = 600; // seconds
            _log.InfoFormat("Will now go to sleep for {0} s.", lifetime);
            Thread.Sleep(TimeSpan.FromSeconds(lifetime));

            _log.Info("Woke up, starting refresh cycle.");
            _container.GetInstance<Worker>().RefreshAll();
        }
    }

    public void Execute()
    {
        new Thread(RunLoop).Start();
    }
}

I'm creating BackgroundCacheRefresher and calling its Execute method at the application startup. It starts with sleeping - the first cache is build "traditionally", as registered below.

Now we just need to wire things up in the Registry. I've created an extension method for the cache registration to make it clean and encapsulated. It registers both the cache object (as a singleton, to keep it in memory, but we'll replace it periodically with the code above) and its corresponding CacheFactory implementation.

public static class RegistryExtensions
{
    public static CacheBuilderDSL<T> UseHotSwapCache<T>(this CreatePluginFamilyExpression<T> expression)
    {
        return new CacheBuilderDSL<T>(expression);
    }

    public class CacheBuilderDSL<T>
    {
        private readonly CreatePluginFamilyExpression<T> _expression;

        public CacheBuilderDSL(CreatePluginFamilyExpression<T> expression)
        {
            _expression = expression;
        }

        public SmartInstance<TConcrete, T> With<TConcrete, TFactory>(Registry registry)
            where TConcrete : T
            where TFactory : CacheFactory<T>
        {
            registry.For<CacheFactory>().Use<TFactory>();
            return _expression.Singleton().Use<TConcrete>();
        }
    }
}

And here is how to use it:

For<IDependency>().UseHotSwapCache().With<ExpensiveDependency, ExpensiveDependencyFactory>(this);

The last thing is the factory - just newing up the cache object. Note that its dependencies can be provided in the typical, constructor-injected way.

public class ExpensiveDependencyFactory : CacheFactory<IDependency>
{
    private readonly IDependencyDependency _loader;

    public ExpensiveDependencyFactory(IDependencyDependency otherDependency)
    {
        _otherDependency = otherDependency;
    }

    public override object InternalBuild()
    {
        return new ExpensiveDependency(_otherDependency);
    }
}

Whoa, a bit of code here. Maybe there is something simpler available - if so, drop me a line, please! Otherwise, feel free to use it.

Tuesday, June 17, 2014

StructureMap: time expiring objects cache

StructureMap is my favorite .NET's IoC container. It has a very nice API and is quite well extensible. One of the things I use its extensibility points for is to have my expensive objects cached for some time. Not a singleton, as the cached values are changing from time to time and I want to see those changes eventually. Also not a transient nor per-request instance, as filling the cache is expensive - let's say it's a web service call that takes several seconds to complete. There is no such object lifecycle provided by StructureMap. Let's fix it!

What I need is a custom lifecycle object, so that I can configure my dependencies almost as usual - instead of for example:

For<IDependency>().HybridHttpOrThreadLocalScoped()
    .Use<NotSoExpensiveDependency>();

I'll use my own lifecycle using more generic LifecycleIs DSL method:

For<IDependency>().LifecycleIs(new TimeExpiringLifecycle(secondsToExpire: 600))
    .Use<DependencyFromWebService>();

LifecycleIs expects me to pass ILifecycle implementation in. That interface is responsible for keeping a cache for the objects. Its responsibility is to decide where that cache is and how long does it live. In our case, all we need to do is to use "singleton-like" cache (MainObjectCache) and make sure it is invalidated after a given period of time. Easy as that!

This is how it looks like for StructureMap 2.6 family:

public class TimeExpiringLifecycle : ILifecycle
{
    private readonly long _secondsToExpire;
    private readonly IObjectCache _cache = new MainObjectCache();

    private DateTime _lastExpired;

    public TimeExpiringLifecycle(long secondsToExpire)
    {
        _secondsToExpire = secondsToExpire;
        Expire();
    }

    private void Expire()
    {
        _lastExpired = DateTime.Now;
        _cache.DisposeAndClear();
    }

    public void EjectAll()
    {
        _cache.DisposeAndClear();
    }

    public IObjectCache FindCache()
    {
        if (DateTime.Now.AddSeconds(-_secondsToExpire) >= _lastExpired)
            Expire();

        return _cache;
    }

    public string Scope
    {
        get { return GetType().Name; }
    }
}

And here is the same for StructureMap 3.0 (there were some breaking names changes etc.)

>public class TimeExpiringLifecycle : ILifecycle
{
    private readonly long _secondsToExpire;
    private readonly IObjectCache _cache = new LifecycleObjectCache();

    private DateTime _lastExpired;

    public TimeExpiringLifecycle(long secondsToExpire)
    {
        _secondsToExpire = secondsToExpire;
        _cache.DisposeAndClear();
    }

    private void Expire()
    {
        _lastExpired = DateTime.Now;
        _cache.DisposeAndClear();
    }

    public void EjectAll(ILifecycleContext context)
    {
        _cache.DisposeAndClear();
    }

    public IObjectCache FindCache(ILifecycleContext context)
    {
        if (DateTime.Now.AddSeconds(-_secondsToExpire) >= _lastExpired)
            Expire();

        return _cache;
    }

    public string Description
    {
        get 
        {
            return "Lifecycle for StructureMap that keeps the objects for the period of given seconds."; 
        }
    }
}

StructureMap is responsible for reading and writing the cache, constructing the objects etc. - we don't need to care about that stuff at all. The only thing we should remember is that although all the requests within 600 seconds will be served with the cached object, after that time one of the requests will finally encounter a cache miss and will need to create that expensive cache, bearing the cost within that request.

Friday, April 18, 2014

Computer Science fundamentals still hold true

There are some discussions out there about what software developer should really know nowadays. Arguments are raised that most of contemporary software developer work is no longer a computer science - which meant creating new stuff out of nothing - but is "just" an engineering work - using already built and verified components and approaches, gluing and mixing it together to create stuff out of other stuff. Questions are raised whether deep understanding of algorithms and data structures or other fundamentals of computer science is crucial in putting together existing libraries and frameworks, which is what most of us basically do every day.

The need for a good software developer to hold a Computer Science degree is often questioned and there are multiple advices out there how to be successful in a field without a degree. Indeed, I've never actually implemented sorting or tree search on my own for professional needs, only as an academic exercise. But I think the knowledge I've gained during my CS studies gave me a lot of insight how things work and gives me a sort of confidence in what I do. Without the knowledge about time and memory complexities, I'd be wandering in the darkness.

Right now, working in a self-sufficient Agile team, with strong desire to avoid knowledge and responsibility silos, everyone is encouraged to take on every task needed to reach the goal. Of course there are still (and always be) people more skilled in database stuff and other more skilled in HTML, and that's fine. But with no code ownership it's also fine when more front-end inclined developers do some back-end tasks and conversely.

But unless someone is doing something purely declarative in its nature, like plain HTML or CSS, the code is still code, regardless it's low-level C or JavaScript at the client. That means understanding the mechanics of how things works and knowing the fundamentals of data structures and algorithms is still crucial for all software developers, no matter where in the stack they fit best. Of course not having a degree cannot disqualify anyone from being a good software engineer, but the theoretical gaps need to be filled properly.

Recently I've stumbled upon a simple piece of code we already had on production, working well enough so that it haven't brought any attention until the data quantity was small enough. The code goes like this:

foreach (var foo in foos)
{
    var matchingBars = bars.Where(x => x.Foo == foo);
    foreach (var bar in matchingBars)
    {
        DoSomethingWith(foo, bar);
    }
}

Simple enough, isn't it? But there is more and more data. When we reached more than 20k foos and more than 20k bars, this is what happened:

That simple piece of code hit us badly with quite an obvious O(n^2) number of comparisons. Foos and bars are both plain lists, finding matching elements requires traversing plain bars collection for each and every foo element. Each comparison is insignificantly small, but doing it 425 million times takes more than a minute!

I've changed the code to use Lookup, which is a basic hashed structure that allows quick access to the elements by the key. The code now looks like this:

var barsLookup = bars.ToLookup(x => x.Foo);
foreach (var foo in foos)
{
    foreach (var bar in barsLookup[foo])
    {
        DoSomethingWith(foo, bar);
    }
}

That simple change replaced >400 millions of comparisons with only 20k needed to build up the lookup + 20k cheap lookup reads. The result? Total execution time fall down to just 115 ms.

That's 538 times faster, just by one simple data structure change.

I've found a great algorithms and data structures complexity cheat sheet. I think one may not call himself a software developer if he doesn't understand what at least the basic stuff in those tables mean.

Thursday, April 10, 2014

Waiting screen - doing bad things right - is it even possible?

Sometimes in large web applications, there is a necessity to make a client wait, before the server is able to provide any content. There may be some heavy calculations to be performed, caches refreshed etc. In most cases it probably can - and should - be avoided using background workers not being a part of the actual web request or some kind of asynchronous AJAX calls. Those approaches give the possibility to have either completely undisturbed user experience or at least to reduce the fuss and eliminate the need to have a blocking wait.

But there's a chance the task we are to accomplish cannot be offloaded onto the background thread at the server nor performed asynchronously with AJAX during normal site operation. I had that kind of situation in a project where authorization was handled by an external, awfully enterprisey component, available through the web service which was slow as a snail and completely out of our control. It was not possible to show anything more than the waiting screen on the first log on, as everything of any value had to be authorized first.

In that hopeless situation, we've decided we need a waiting screen, so that users at least see something that lets them know the service is being prepared for them. You know, if it's slow, it's a clear indication there's a lot of value inside and it's worth paying big money for it. In case it loaded quickly, there seems to have no real value (in case my English is not so fluent to express my thoughts accurately - yes, this was meant to be sarcastic).

Anyway, I was looking for a solution for the waiting screen, that ideally holds all of the features listed here in my subjective order of importance.

  1. it needs to be available not just as a JS throbber to be put somewhere on the site - it needs to be a response for the initial request to the website;
  2. it needs to draw something on screen while waiting (obvious to say, but not so obvious to implement);
  3. it should be HTTP-compliant in terms of status codes etc. so that it doesn't confuse any browsers or web crawlers;
  4. it should not break "back" button;
  5. while waiting, it should indicate "waiting" status in a browser's status bar, to convince the user something is really going on.

Serving a simple wait screen with 302 Redirect to the target request that will do the actual work is not an option, as it will fail on requirement no. 2 - the browser will issue a redirected request without rendering our wait screen. But in order to have point 3 and 4 fulfilled, we need 302 Redirect - serving 200 OK without the actual content will harm the protocol and browsers' history badly - it's tightly coupled. So there's a big contradiction here. We can try going the unwanted HTML Redirect route or use JavaScript redirection - it satisfies point 2, but it's no better about protocol compliance nor browser history obedience.

Well, I got stuck here. I've asked for help on StackOverflow, but no interest at all. Let me know if I'm missing something.

By now, I've sacrificed protocol compliance and proper "back" button behavior in order to have anything shown on screen - points 1 and 2 are the crucial ones for the general user experience. It's still quite weak experience, but without those, there is no experience at all. I've chosen the JavaScript redirect route called on window.load jQuery event (note that document.ready event is raised when the DOM is ready, but possibly not yet rendered - a bit too early for us to be sure something is already drawn).

I can't see a way to have the waiting screen done right. Well, maybe there is no right way to do bad things?

Monday, January 27, 2014

Introducing NOtherLookup - even better lookups

Recently I've been discussing how awesome is the ILookup data structure provided by .NET Framework as a part of LINQ library. I've looked through some of its key features, but I've also mentioned its few imperfections, like no way to create it directly and no easy way to do any operations on a Lookup as a whole, promising to take a look how to fix it.

So, how to fix it? By using NOtherLookup library I've just pushed to NuGet. It is a set of extension methods and utility classes designed to make Lookup-based operations as easy and as natural as in plain LINQ-to-objects.

Consider for example merging two Lookup instances into a single one that has the keys from both instances and for each key that is present in both Lookups, the union of sets should be produced. When doing it in plain LINQ, we probably need to convert both Lookups to IDictionary<TKey, List<TValue>>, then merge corresponding keys and finally produce a new Lookup manually. With NOtherLookup, we can just use LINQ in the same way as we do on IEnumerable:

var result = firstLookup.Union(secondLookup)

In this example, result is still typed as ILookup<TKey, TValue>, with no more conversions needed. And here is what it does:

firstLookup:
"a" => 1,2
"b" => 3,4

secondLookup:
"b" => 4,5
"c" => 6,7

result:
"a" => 1,2
"b" => 3,4,5
"c" => 6,7

NOtherLookup offers support for several operators known from LINQ. There are some operators that combine two Lookups - besides Union shown above, there is Concat, Except, Intersect, Join and Zip. There are some that manipulate a single Lookup - Select and Where. There is also quite universal OnEachKey method that allow running arbitrary LINQ query on each lookup element, maintaining ILookup type.

ILookup<int, string> transformed = lookup.OnEachKey(g => g.Select(x => x + g.Key).Reverse());

Example:

lookup:
"a" => 1,2
"b" => 3,4

transformed:
"a" => "2a", "1a"
"b" => "4b", "3b"

NOtherLookup contains also a few features that make obtaining ILookups easier. The most important one is Lookup.Builder - a class that allows creating lookup from the scratch (as a solution for the lack of Lookup public constructor), with support for custom comparators, if needed.

ILookup<int, string> lookup = Lookup.Builder
    .WithKey(1, new[] { "a", "b" })
    .WithKey(2, new[] { "c", "d" }) 
    .WithComparer(new CustomIntComparer()) // can be omitted
    .Build();

There is also empty lookup available for your convenience.

Last but not least, there are easy conversions between ILookup and IDictionary available - it's a simple method call in both ways:

ILookup<int, string> lookup = dict.ToLookup();
Dictionary<int, List<string>> backToDict = lookup.ToDictionary();

For the detailed descriptions with examples, see README on GitHub project page, where the code lives. NOtherLookup is licenced with very liberal MIT licence, so feel free to grab the code and use it any way you want.

Sunday, January 12, 2014

The downsides of .NET's Lookups

In the previous post I was admiring the wonderful features of .NET's Lookup - its immutability, null-safeness and clean and readable presence. I've also promised to look for its downsides or pitfalls one may encounter when using it.

Are there any? Well, to be honest, I can't see any serious problems with using Lookups. If you need a collection to be used as a lookup table by some identifier, with no modifications support, with multiple values per key - you won't find anything better suited than ILookup.

There are just two things I find to be a slight impediments in ideal Lookup's world.

The first is that the only way to create an ILookup instance is via LINQ's ToLookup method. Although the framework provides the default public ILookup implementation - a Lookup class - it has no public constructor available. It means that in order to create our lookup, we need to prepopulate some other collection just to convert it later using ToLookup().

The second thing is that the beautiful clean API provided by ILookup<TKey, TValue> goes away immediately if we need to do anything other than just read our lookup. If we need to have a lookup created from existing lookup but with one more value, or have our lookup filtered, or joined with another, or whatever, it breaks into IEnumerable<IGrouping<TKey, TValue>> - and the readability hurts a lot. IGrouping represents a single lookup item which is a values collection with its associated key. Still nice to access the data, but not nice to operate on. There's no public implementation of IGrouping available in the framework - the implementation is internal to LINQ. There's no way to modify the grouping, what is understandable as it is part of immutable ILookup, but there's also no way to create new grouping based on existing one - the workaround is to convert our whole lookup back to some mutable collection like Dictionary<TKey, TValue>, mutate it and then convert to new lookup using the convoluted LINQ projections first, something like this:

var temporaryDictionary = lookup.ToDictionary(x => x.Key, x => x.ToArray());
temporaryDictionary.Add("new key", new[] { "new value" });
var newLookup = temporaryDictionary
    .SelectMany(kv => kv.Value, (kv, v) => new { kv.Key, Value = v })
    .ToLookup(x => x.Key, x => x.Value);

Or alternatively, if we don't mind losing existing key-values mappings and we are able to recreate it once again, we can flatten our lookup to List<TValue>, modify the list and then use the simplest ToLookup() overload.

var temporaryList = lookup.SelectMany(x => x).ToList();
temporaryList.Add("new value");
var newLookup = temporaryList.ToLookup(x => SomethingThatRecreatesKey(x));

No matter which way we choose, it's quite a lot of work for such a simple thing.

Next time I'll try to propose a solution for both the issues. Stay tuned!

Monday, January 6, 2014

Lookup - .NET's hidden gem

.NET Framework has a lot to offer when comes to generic data structures, especially collections. We have several lists, arrays, sets, dictionaries - each of them has its performance and functional characteristics and scenarios where it fits much better than the others. As I'm observing in the project I work in, one of the most often forgotten yet powerful type of collection is Lookup. I think it needs more attention.

ILookup was introduced to the framework as a part of LINQ and is exclusively used in LINQ context, as a result of ToLookup calls. Logically, it is a key-value collection like IDictionary, but it is designed to hold multiple values in a single key. Also, unlike Dictionary<K, V>, it is immutable, what means that after creating the lookup, we can only read it - no modifications are possible.

Generally one may say that ILookup<K, V> is no better than IDictionary<K, IEnumerable<V>> and as such is redundant. But I'd strongly disagree. First of all, let's think how are dictionaries of collections used. From my experience, it's most often used to keep predefined static "dictionaries" that are loaded once and live long to provide easy access to some rarely changing values like what counties are there in each country etc. Well, these are even often called "lookup tables". Note that in most cases we don't ever change the stored values - it is immutable. We replace it as a whole periodically or when something changes. Doesn't it sound exactly like ILookup<K, V> key features discussed above? We should always be using the tools that are best suited for the job to avoid reinventing the wheel and avoid the problems others already solved.

Moreover, ILookup's definition is clearly more readable than nested generics in its dictionary-based equivalent. When I look at the code never seen before and I see ILookup<K, V>, I instantly know that it is the lookup table and I feel its read-only, one-way nature. When I see IDictionary<K, IEnumerable<V>> I need to consider the possibility that it is used as a read-write store for some values and I, as a user of this code, can (or should?) modify that collection. Using Lookup for lookups just makes sense.

IDictionary-based lookups, as mutable data structures may kick us badly. I've once experienced quite an ugly error caused by passing around the plain List<T> read from what was intended to be static lookup shared between users and was based on Dictionary<K, List<V>>. The intention was to pass around the immutable lookup data, but the receiving code actually got mutable list being exactly the same reference that was sitting in the global lookup. The value passed was eventually customized for each user and there were miracles happening to the global lookup. All that would be just impossible with lookups.

With dictionaries that are open for modification and not thread-safe by default, we are also subject to multiple kinds of failures in multi-threaded scenarios. All of it is gone with immutable data structure like ILookup.

And last but not least - null checks. Who likes to litter his code with all that boring null checks? When using IDictionary as a lookup table, every piece of code that consumes its data need to check for key existence first or have a conditional statement with TryGetValue method, because calling dictionary[nonExistingKey] would throw dreadful KeyNotFoundException. Moreover, even if the key exists, the value may still legally be null, so to read the collection contained under the key we need to have two checks:

if (dictionary.ContainsKey(theKey))
{
     var collectionOrPossiblyNull = dictionary[theKey];
     if (collectionOrPossiblyNull != null)
     {
          DoSomethingWithValues(collectionOrPossiblyNull);
     }
}

or, with more compactness but same complexity:

T collection;
if (dictionary.TryGetValue(theKey, out collection) && collection != null)
     DoSomethingWithValues(collection);

ILookup<K, V> was designed totally differently around key existence and null values. It just returns an empty collection for non-existent keys. And thanks to its strictly controlled way of creating (by LINQ's ToLookup method only), we can also be sure that no null collection is possible. And the code above just looks as it should like:

DoSomethingWithValues(lookup[theKey]);

Isn't it beautiful?

Next time I'll try to find some pitfalls or inconveniences of using lookups.