An Async Html cache – Part I - Writing the cache - Luca Bolognese

An Async Html cache – Part I - Writing the cache

Luca -

☕ 3 min. read

Other posts:

In the process of con­vert­ing a fi­nan­cial VBA Excel Addin to .NET (more on that in later posts), I found my­self in dire need of a HTML cache that can be called from mul­ti­ple threads with­out block­ing them. Visualize it as a glo­ri­fied dic­tio­nary where each en­try is (url, cached­Html). The only dif­fer­ence is that when you get the page, you pass a call­back to be in­voked when the html has been loaded (which could be im­me­di­ately if the html had al­ready been re­trieved by some­one else).

In essence, I want this:

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))

I’m not a big ex­pert in the .Net Parallel Extensions, but I’ve got help. Stephen Toub helped so much with this that he could have blogged about it him­self. And, by the way, this code runs on Visual Studio 2010, which we haven’t shipped yet. I be­lieve with some mod­i­fi­ca­tions, it can be run in 2008 + .Net Parallel Extensions CTP, but you’ll have to change a bunch of names.

In any case, here it comes. First, let’s add some im­ports.

Imports System.Collections.Concurrent
Imports System.Threading.Tasks
Imports System.Threading
Imports System.Net

Then, let’s de­fine an asyn­chro­nous cache.

Public Class AsyncCache(Of TKey, TValue)

This thing needs to store the (url, html) pairs some­where and, luck­ily enough, there is an handy ConcurrentDictionary that I can use. Also the cache needs to know how to load a TValue given a TKey. In programmingese’, that means.

    Private _loader As Func(Of TKey, TValue)
    Private _map As New ConcurrentDictionary(Of TKey, Task(Of TValue))

I’ll need a way to cre­ate it.

    Public Sub New(ByVal l As Func(Of TKey, TValue))
        _loader = l
    End Sub

Notice in the above code the use of the Task class for my dic­tio­nary in­stead of TValue. Task is a very good ab­strac­tion for do some work asyn­chro­nously and call me when you are done”. It’s easy to ini­tial­ize and it’s easy to at­tach call­backs to it. Indeed, this is what we’ll do next:

    Public Sub GetValueAsync(ByVal key As TKey, ByVal callback As Action(Of TValue))
        Dim task As Task(Of TValue) = Nothing
        If Not _map.TryGetValue(key, task) Then
            task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent)
            If _map.TryAdd(key, task) Then
                task.Start()
            Else
                task.Cancel()
                _map.TryGetValue(key, task)
            End If
        End If
        task.ContinueWith(Sub(t) callback(t.Result))
    End Sub

Wow. Ok, let me ex­plain. This method is di­vided in two parts. The first part is just a thread safe way to say give me the task cor­re­spond­ing to this key or, if the task has­n’t been in­serted in the cache yet, cre­ate it and in­sert it”. The sec­ond part just says add call­back to the list of func­tions to be called when the task has fin­ished run­ning”.

The first part needs some more ex­pla­na­tion. What is TaskCreationOptions.DetachedFromParent? It es­sen­tially says that the cre­ated task is not go­ing to pre­vent the par­ent task from ter­mi­nat­ing. In essence, the task that cre­ated the child task won’t wait for its con­clu­sion. The rest is bet­ter ex­plained in com­ments.

        If Not _map.TryGetValue(key, task) Then ' Is the task in the cache? (Loc. X)
            task = New Task(Of TValue)(Function() _loader(key), TaskCreationOptions.DetachedFromParent) ' No, create it
            If _map.TryAdd(key, task) Then ' Try to add it
                task.Start() ' I succeeded. I’m the one who added this task. I can safely start it.
            Else
                task.Cancel() ' I failed, someone inserted the task after I checked in (Loc. X). Cancel it.
                _map.TryGetValue(key, task) ' And get the one that someone inserted
            End If
        End If

Got it? Well, I ad­mit I trust Stephen that this is what I should do …

I can then cre­ate my lit­tle HTML Cache by us­ing the above class as in:

Public Class HtmlCache

    Public Sub GetHtmlAsync(ByVal url As String, ByVal callback As Action(Of String))
        _asyncCache.GetValueAsync(url, callback)
    End Sub
    Private Function LoadWebPage(ByVal url As String) As String
        Using client As New WebClient()
            'Test.PrintThread("Downloading on thread {0} ...")
            Return client.DownloadString(url)
        End Using
    End Function
    Private _asyncCache As New AsyncCache(Of String, String)(AddressOf LoadWebPage)
End Class

I have no idea why col­or­ing got dis­abled when I copy/​paste. It does­n’t mat­ter, this is triv­ial. I just cre­ate an AsyncCache and ini­tial­ize it with a method that knows how to load a web page. I then sim­ply im­ple­ment GetHtmlAsync by del­e­gat­ing to the un­der­ly­ing GetValueAsync on AsyncCache.

It is some­how bizarre to call Webclient.DownloadString, when the de­sign could be re­vised to take ad­van­tage of its asyn­chro­nous ver­sion. Maybe I’ll do it in an­other post. Next time, I’ll write code to use this thing.

6 Comments

Comments

It would be much easier to use a normal thread safe collection class.
Each element would have:
 key
 url (string)
 status (loaded, failed, waiting to load, partially loaded)
 last status change (date/time)
 html_loaded (string)
 last_referenced (date/time)
 can_timeout_and_be_deleted(boolean)
Class methods
  Get HTML from URL(boolean lookup_only = false, int max_block_seconds = 0 /* -1 block forever, 0 - don't block, otherwise block for X seconds*/)
  Get HTML from KEY(boolean lookup_only = false)
  Delete_entry(URL)
  Delete_entry(KEY)
A thread or threads internal to the class would load the html asychronously and be invoked via a clock timer with ticks a few seconds apart.
Attaching a callback for each request is much harder to implement.  It is upto the method requesting the URL to decide whether or not it blocks, needs an asychronous callback/interrupt or polls for data.  
The idea is that for nearly all cases, no new threads should be created and no new callbacks should be hooked up.  This keeps your code easier to understand and debug.  Common faults and scenarios are handled easily:
 - requesting thread terminates
 - asynchronous load times out
 - error loading html
 - html hasn't been used for 5 minutes and can be removed (a tunable cache parameter)
 - memory limit of cache reached and unreferenced html strings can be removed (a tunable cache parameter)
 - duplicate request for a URL/KEY from more than one thread
 - html can be loaded from multiple sources (web, file, network share, ftp, database, etc.).
 - html load failed as html string exceeds the size limit on loaded string (e.g., a tunable cache parameter)
 - The common problem with attempting a callback for a method that is terminated is avoided.  That's a problem when the callback requires the cache to build a complex packet of data to pass in the callback.
This is quite similar to basic page handling algorithm in a virtual memory system (circa 1980).  It's how one handled this in systems lacking real threading or with non-reentrant GUI message handling (VB6 GUI/MFC GUI posting a message to the current winform indiciating asynchronous request completed).

Thanks Greg, these are good comments.
We have a different design goal though. Both solutions are valid. I want the method requesting the URL to have the flexibility of deciding what to do (aka have a callback). I do want the exposed API to be async.
The rest of your comments talk to the difference between writing production code and a conceptual example. I'm doing the latter here.

The idea of wrapping the asychronous cache handler in a class is to reduce or eliminate the need for callers to bbe asychronous.  This makes coding the caller's class much easier.
The other aspect is that the amount of work done in an asychronous call back should be minimal since you don't know when it will be executed.  For example, you get a callback call with the HTML you need whilst you are destroying the caller's object.  This is more important when dealing with large amounts of data in each cach entry (e.g., large xml strings) since processing each cache entry may take considerable time.

The Visual Basic Team

2009-04-29T15:15:28Z

You may know Luca Bolognese from his well-known work on C# LINQ. Luca is now the Group Program Manager

Luca Bolognese's WebLog

2009-05-08T11:53:22Z

Other posts: Part I – Writing the cache Let's try out our little cache. First I want to write a synchronous

0 Webmentions

These are webmentions via the IndieWeb and webmention.io.