Welcome

This is the generic homepage (aka Aggregate Blog) for a Subtext community website. It aggregates posts from every blog installed in this server. To modify this page, look for the Aggregate skin folder in the Skins directory.

To learn more about the application, check out the Subtext Project Website.

Powered By:
Powered by Subtext

Blog Stats

  • Blogs - 3
  • Posts - 70
  • Articles - 0
  • Comments - 35
  • Trackbacks - 0

Bloggers (posts, last update)

Welcome All

It's a wonderful conceit isn't it? Scan in some paper form sent to you, and have the computer automatically read it and process it.

Unfortunately, that's all it is, a conceit. The reality is that reading handwritten text is hard. Typewritten text on the other hand is largely a solved problem, with up to 99% accuracy rates. Handwriting, well, not so much. Oh, it's possible to jimmy pretty good numbers out of the software, and with enough training it can get pretty accurate. Another way of improving accuracy rates is by being able to set a lexicon; a list of allowed words.

So, given that you can limit the words that people use, and given that you can force them to print (cursive is often still a problem), then yes, you can get some pretty impressive accuracy: 95% per word in some cases. However, this is usually not good enough for business; consider that a 95% per word accuracy translates to the software getting more than 3 of the words in this paragraph alone wrong.

For making decisions involving money, or people's health, that's just not an acceptable level of accuracy. So, it seems we are still wedded to manual capture where humans type out what other humans have written in. Or are we? Well, sure, you can't use OCR for the truly critical data on the scanned form, but you can use it for less critical data.

You may not want to use the OCR data for making business decisions, but you can use it as an aid to finding your documents. In many cases, your computerised business processes need only a subset of data on a form, with humans assessing the remainder. In such a case the process is simple: capture the critical data manually, index the rest using OCR, and use both sets of data to find your documents.

This gives you the best of all worlds: the accuracy that only human capture can currently provide, limiting your human processing to the bare minimum, and being able to find the document based on all the data on the form.

Looking for a Document Management System? Signate 2010 is powerful, secure and easy to use.

Latest Posts

Great posts on Reflector

posted @ 1/24/2013 10:18 AM by Sean Hederman

Build vs Buy in Agile

I'm wondering if any Agile gurus out there have any insights. At one of my clients they're doing an Agile proof of concept; they've selected a project, and an Agile partner (a big international Agile consultancy, one of the thought leaders in Agile). They've completed the Envisioning phase and are moving on to Inception, they're planning out stories and all seems to be going well.

The problem is that to my mind, and many others on the team the project is an ideal candidate for the purchase of an off-the-shelf application. The Agile consultancy say that we'll only know that once we've written some code, that there are too many unknowns, and we need some spikes.

So my worry is simple, this is one part of a much bigger project, the ONLY way we'll ever have enough information is by either analyzing the whole thing (big design up front) or by analyzing it as we go until completion, which implies we've finished the project. I don't think it's a particularly good scenario to complete a large project and then go "hey, we should have bought this", which I personally believe is a 90% likelihood.

How do people out there handle the Build vs Buy decision in an Agile manner? Am I being paranoid in thinking that each line of code will be used to justify continuing down the path of building the software? Does not writing "spikes" and a working, but simple implementation of the initial system imply that the Build vs Buy debate has been won by the very design of the Agile project. Are Agile projects in fact, DESIGNED to answer "Build" to this question, and how does one fairly answer such a question without prejudicing either option?

posted @ 10/31/2012 6:40 AM by Sean Hederman

Storing Passwords

I cannot believe in this day and age that people still store passwords. There are two things you should NEVER store: passwords and credit card numbers. Ever. I mean it. Never ever. It is so simple to either not store them at all (credit cards), offload the security to a trusted third party (passwords and credit cards), or to hash and salt them so that you can recognize them if you see them again, but you cannot reverse what you have to get back to the original.

I've heard so many justifications of this storing of passwords, and they are all, without fail, crap justifications. There is NO good reason to store a password, and if your technology has something where you need to, then you have a massive security hole in your technology, and you should patch it immediately. Don't make laziness the reason you decided to put all your customers at risk.

Have a look at this article by Mike O'Brien on Guild Wars "security". He makes a number of very good and interesting points, but the whole article boils down to "we know what your passwords are" and he has NO justification for that. Having a corpus of "bad" passwords is a great idea. Creating them from your users old passwords is very, very bad because it means that you read them!

He claims that they have not been breached, because they have "a team constantly monitoring for any signs of intrusion". Well that's cool. Except, clearly this team doesn't have the security chops to point out that he shouldn't be storing the passwords in the first place. Ooh, okay, let's rephrase that to a "team of amateurs" monitoring. Hmm, doesn't look so good now, does it? I've had real white-hat teams penetration test and rate my systems, and passwords stored in reversible format would ALWAYS be a "Critical" issue.

At the end he hand waves around the importance of database breaches, because he's asked his users to use a unique password. However, he has no way of knowing if the user has actually complied. Worse, the user might have a system like "guildwz-password", and the user just changes a portion of the password for a given site (a good practice by the way). A password breach would still open that user up to attack, and for what benefit? What fantastic piece of usability has been enabled by storing passwords in a reversible format? Is it something worth putting your users financial future at risk? Something worth risking the release of sensitive health data?

Because, make no mistake, if a hack happens and the users password is taken and used maliciously, it is Guildwars fault, right down the line. No matter that they've asked users to use unique passwords. The reality is that they're taking a secret, and instead of hashing and salting it like they should, they have chosen to store it in a way which makes it available. For no good reason.

You CANNOT claim to have good security if you can find out, in any way, what your users passwords are. If a breach happens you WILL be liable, because security experts will stand up in court, point fingers at you and accuse you of willfully ignoring security best practice. Best practice which is so easy to do that a junior dev could implement it in hours.

That said, linking to the XKCD comic is very cool, and I really like some of their innovations, but think of those as bright and shiny padlocks on a rusted and rickety gate. Every other innovation means nothing if you cannot get the most basic principles right.

Disclosure: I am responsible for application security at a stock exchange, but am not a security engineer, merely a development lead who is passionate about good security.

posted @ 10/6/2012 8:34 AM by Sean Hederman

How to report bugs to other developers

There are all sorts of techniques you can use to report bugs to another developer:
- Sit down with them and demonstrate your problem.
- Write a detailed bug report with your investigations.
- Fix it yourself.

Meeting
However each of these techniques has issues. Meeting with the other developer can be tricky. You have have conflicting schedules, or time zones. You might have to go through corporate bureaucracy, and there could be language or understanding issues.

Bug reports
Writing a detailed bug report has the advantage that your schedules are no longer an issue. But what if he wants more clarification? What if he disagrees? Must he now write a just as detailed report in return? The next thing you know, you're filling the world with memos, and that never ends well.

Fix it yourself
Fixing it yourself (AKA the Open Source Tao) is great in theory. Of course it assumes that your deadlines can move out for you to learn a new highly complex system, it also assumes that the community will assist you in gaining this knowledge

None of these techniques have the simple power of my preferred solution:

Write a failing test (or tests)
It doesn't require that your schedules are in sync. It is clear about what the problem is, and will greatly assist the other developers to find the issue. It is unambiguous, in fact the only ambiguity is around WHY you would need to do this.

Look, sometimes the WHY is important. Sometimes it indicates a deeper misunderstanding rather than an issue. In most cases however, spending time arguing about the WHY is futile, and the best solution is to focus on the WHAT.

There are other added advantages too. The tests become part of the corpus of tests, and thus you're assured that your use case is protected. If they have a good tracking system around tests, they might even be able to notify you when a breaking change is coming down that might affect you.

Finally, since your tests constitute the acceptance criteria for the defect, they'll know when it’s resolved. With no ambiguity. It does mean that you need to ensure your tests are sufficient as acceptance tests, but isn't that a LOT easier than writing reams of documentation?

Also, the acceptance tests don't become outdated. They're either in the test corpus or not. Sure, they may be removed, they may be modified, but there's no doubt about whether they're still covered by the current version of the solution. If they're still in the test run, they're still part of the solution. Compare that to reading a bug report about a defect fixed 3 years ago. Is the fix still there? Has it been superseded by new functionality?

Oh, and don't forget that if it's an understanding issue, all the fixing developer needs to do is fix the test, and return it. A crystal clear demonstration of the mistake made. And what if you want to fix it yourself? Well, your tests are your route out of your rabbit hole. Once they pass, you know you can extricate yourself (although you may not want to).

In fact, I strongly believe that ALL defects should be treated in this manner. The first step of the triage process is to write the failing test(s), and linking them to the bug report. Once the tests pass, the bug is marked as fixed, and once that build is deployed, the reporter is notified of the fix. All automatically.

posted @ 9/13/2012 7:57 AM by Sean Hederman

Storing Millions of files in a File Structure

A recent question on HighScalability.com was “How Do I Organize Millions Of Images?”. The asker had found that storing files in a database was inefficient, and wanted to know what scheme he should use to structure the files. I started writing this as a comment, but decided to do it as a blog post instead. The questioner is moving in the right direction; databases are a very poor place to put large amounts of files from a performance perspective – although don’t discount the convenience of this.

So, the question to ask is this:

How many file system entries can a folder efficiently store?

I did tests on this a couple years back, and on Windows at that time the answer was “about a thousand”. Okay, so that implies that we must go for a tree structure, and each node should have no more than about a thousand child nodes. This implies that we want to keep the tree nice and balanced. Having a huge tree structure with 90% of the files nodes distributed into 5% of the nodes is not going to be hugely helpful.

So, for a million files, with a perfectly even distribution, a single folder level is sufficient. So, that’s one “root folder” containing 1000 child folders, each containing 1000 files. For simplicities sake, I’m going to assume that only “leaf” folders will store files. Okay, so that will efficiently store about 1,000,000 files. Except, he wants to store millions of files. Okay, so that implies that either we accept more entries per node or we increase the tree depth. I’d suggest the more entries as the starting point to consider, my “1000 entry” testing is a bit out-dated.

So, 2-level structure; a “root folder”, with 1000 folders, each containing 1000 folders, each containing 1000 files gives us a nice even billion, 10003, assuming an even distribution. That last part is the tricky part. How do assure even distribution? Well, the simplest method would be to generate the folder names randomly using a pseudo random number generator with even distribution, so probably a cryptographically secure one. Some of the schemes suggested in the comments ranged from generating GUIDs to generating SHA-1 hashes of the files. Some of them may work well; I’ve personally used the GUID one myself to good effect. But a GUID does not guarantee good distribution, and it might bite you, badly.

Using a hash function is cute, it limits you to a folder size of 256 nodes though; which implies a deeper folder structure – additionally it means you must hash the file as part of the file location. But, um, if you’re looking for the file, how do you hash it? I assume you store the hash somewhere; this is good for detecting tampering  and if you are doing this or plan on doing this – then this seems like a good approach. Unfortunately it is inefficient compared to our “ideal” 1000 node per folder size. As the commenter points out; one other benefit is that if the same image is uploaded multiple times, the same file path will be generated. The problem with this approach is that the commenter is incorrect when he says that SHA-1 does not have collisions; there is in fact a theoretical approach to generate collisions for SHA-1, and NIST suggests that it’s use for name collision avoidance should be stopped by Federal agencies. So, maybe SHA-2? Well, it is off a similar base to SHA-1, so it’s possible a collision attack could be found – although one hasn’t been found yet. Oh, and why we should worry about a collision attack? Because person A uploads a photo of her wedding and person B uploads some porn – and person B overwrites person A’s photo.

The technique I’ve used many times is the GUID one, and it works well in most cases. The random number generator approach I’ve used for larger systems, using random numbers for folders, and a GUID for the file name. The hashing approach is very interesting. I think I might have to give it a try in a year or two when I have some spare time. I’d want to modify it to have a few thousand nodes per level, rather than just 256; and I’d want to handle collisions – but it has some really nice emergent features; and it makes good use of the hash I always store for file verification.

I haven’t touched on the approaches of segregating based on user ID and similar; since in my case where I need to store millions of files for a single company, this doesn’t apply. It may well apply quite nicely to your needs however.

Here are some simple rules to live by:

  • DO Compress stored documents – this will save you huge amounts of storage space
  • DO Compress transmissions – this will save you huge amounts of bandwidth, and speed up downloads
  • DO Support HTTP Resume – this will save you huge amounts of bandwidth
  • DO NOT Store large amounts of BLOBs in a database. If anyone tells you to do this; then they haven’t handled large number of binary documents. This always seems like a good idea at the time, and never is. Seriously. NEVER.
  • DO Separate your path generation logic from your path lookup. In other words, don’t replicate your path generation on lookups. Rather store the generated path, and just read it. This allows you to move files around if you need to, rebase them, change your algorithm – a whole bunch of things.
  • DO NOT use MDS for anything. Ever. No, not even path generation.

posted @ 6/22/2012 7:16 AM by Sean Hederman

I just don’t know what to say

Microsoft have confirmed that they will be pulling the ability to make non-Metro apps from Visual Studio 11 Express, and are pulling the free compilers from the .NET SDK. So, even if Metro was definitely the way of the future, and that is by no means certain, this would be a stupid, an appallingly stupid decision. Did all Windows XP, Windows 7, and XBox machines just disappear off the Earth because some nitwits in MS came up with the brilliant idea to throw out the Windows metaphor? Hell, console apps are still around, and still useful.Dumb and Dumber

Forrest GumpWith this move Microsoft is continuing it’s evolution from “Developers, developers, developers”, to “screw developers, screw developers, screw developers” which has always been Apples’ mantra. Of course, Microsoft is a stagnant company run by the worst CEO in the world; and Apple is the most valuable company on Earth; so when Apple say “screw developers” many devs don’t have a choice but to stay with them. With Microsoft we do. We’ve seen the way this game is playing out: a sane company when confronted with a massive competitor whose sole advantage is their number of apps, would focus on encouraging developers to join their platform. Microsoft are so focused on getting customers and developers to Metro that they’re willing to sacrifice all of them to get 10% of them to Metro. This has been clear for some time. They’ve given up on the corporate market, now they’ve given up on the indie market – all in the vain hope that a few percent of the customers and developers they’re pushing away from Windows will move to Metro.

My company, Palantir, and the clients we consult at are all beginning a hard re-evaluation of our commitment to Microsoft technologies. The biggest client I consult at flat out told me that they would like to investigate moving away from the Microsoft space as they don’t see a future in it; they’ve already said that they’ll not be rolling out Windows 8 and they’ve now allowed Android and Apple tablets when they found out Windows 8 ARM won’t support Active Directory or Group Policies. Note that it was not any of Microsoft's competitors that caused this re-evaluation – it was the non-support for management policies in ARM Windows, and the poor support for the (I guess in MS’s mind outdated) mouse and keyboard combination.

But worst of all; one of my friends has a young nephew who wants to make games. Talented kid, really talented. Right now he can, using Visual Studio 2010 Express and XNA, but are his parents really going to shell out $ 500 for a hobby? No, he’ll switch to writing games for iOS and OSX of course, just like any sane person would do.

I am getting sick and tired of watching Microsoft slit their throats again and again. Because when they do that; they’re slitting my throat too – and I’m getting bored of appallingly stupid, fearful, and short sighted decisions. Gosh I hate Objective-C. Compared to C# it’s a nightmare. But the business case gets more and more compelling with every misstep Microsoft make.

FIRE STEVE BALLMER NOW, before he destroys even more of a once vibrant company.

posted @ 5/26/2012 6:24 AM by Sean Hederman

New Toys

I’ve been hectically busy the past few days, weeks, months. Working at my current client is proving to be a bit of a singularity to my personal life. I made the joke the other day that “work at <client name> isn’t a job, it’s a lifestyle”. That said, I have been getting to play around with some pretty cool new toys. Microsoft StreamInsight and Reactive Extensions probably top the list in the direction of new capabilities. You know that you’re playing with big boys when you’re discussing the latency requirements on processing streams of thousands of events a second. Oh and throw in a few million rows in your database (and, no, eventual consistency is not good enough here), and things start getting exciting. And scary.

Anyway, one of the other toys I started playing around with is a new SSD hard drive for my new (work allocated) HP Core i7.

Oooh, shiny.

Boot up and starting Visual Studio are completed so fast that I don’t have time to even think about getting coffee. I’m now loaded down with pretty much every extension there is, and Visual Studio is sill snappy and responsive. I find myself pressing Ctrl-S more than once because the save had no impact or hourglass cursor, none, and I got paranoid it didn’t do anything. I’d now add an SSD to every developers “must-have list”. I calculated that it just needs to save you 3 minutes a day to pay for itself. My guess is that it saves me about 30 minutes to an hour a day; so it’s ROI is about, umm, 1400%. Not a bad purchase then…

Some more things to add the the Christmas stocking:

  • .NET Demon – I’m an unashamed fan of Red Gate, and have had at least one of my clients fork out for their .NET Developer Bundle. ANTS Performance Profiler may not be cheap; but just consider that the entry fee to effective profiling. Anyway, .NET Demon, downloaded it, installed it, love it. I no longer press Build for anything, and my tests run silently the whole time while I’m working, giving me instant feedback. The biggest problem with MSTest is the huge amount of time it takes to execute the tests – that really slows down the Red-Green loop for unit testing. .NET Demon makes all that go away. If I have on complaint: I’m a bit lazy and it’s a mild annoyance to have to go and enable a new test after I’ve written it. Oh, the humanity! Get it. Get it now. I’m not sure of it’s affect on a slower machine, but on the laptop (with CPU cores and IO to spare), I don’t even notice it doing it’s thing in the background.
  • QUnit – For a project for Palantir I’ve been wanting some unit testing on my Javascript. I played around with a few options, mostly trying to be able to build unit tests into MSTest with things like Jint, and trying to avoid brower-based test suites. Big mistake. The other options just plain didn’t work, and QUnit worked first time, no issues. One nice effect of browser-based tests is that I can test for individual browser quirks. Get Chutzpah to make your QUnit integrate seamlessly into Visual Studio. I’m still working on having the QUnit tests run as part of my MSTest run. Shouldn’t be tricky, just shell the Chutzpah console app, I figure.
  • JSCoverage – Old but good. Allows me to instrument and run my Javascript code through QUnit and get code coverage results. Have no idea how I’m going to integrate this into MSTest though.
  • Internet Explorer 9 – Yeah. I feel a bit sick for even mentioning this, I’ve hated IE for more years than I care to remember. However, I found one thing it does better than Google Chrome: debugging QUnit tests. Chrome, for some reason, doesn’t want to show Scripts for files displayed through the file:// URL, IE is happy to. I can start up a little server and debug through there, but I’ve actually been pleasantly surprised with the Developer Tools in IE9; they seem to work just fine. So far, he mutters darkly to himself. Anyway, kudos to Microsoft for supplying a positive surprise for a change.
  • iPad 2 – When I first got mine (as part of FNB’s special offer), I stuck to my belief that it’s killer app was reading web pages whilst on the toilet. Women, don’t laugh, that is a powerful and compelling feature. Then I got excited about all the apps and bought tons (because Apple are scum you have to pretend you live in the USA, and buy vouchers in order to get decent apps). My favourite app are Tweetbot, Alien Blue, Flipboard, Mindjet, Flixster, Kindle and Remember the Milk. After a month or two, I realized that most apps aren’t nearly as useful as you think they are. The apps I use in decreasing order are: Safari, Mail, Calendar, Memo, all the others. With Safari being used more than 70% of the time. It’s a glorified portable web browser. But it does it really, really well. It’d be really nice if it could act as a WiFi hotspot but it can’t. Because of the aforementioned “Apple are scum” thing. They might make nice products, but you just know that if they could get away with it, you’d be a bathtub missing your kidneys.
  • Kindle – Another device I was somewhat dismissive of before I got one. Love it. Simple as that. No caveats. When I first got my iPad I was quite clear that if my house was burning down and I had to choose between the iPad or the (much cheaper) Kindle, I’d pick the Kindle. Now, I’m not as certain. Considering it costs about a quarter as much, that’s still a powerful argument. Not everyone loves reading as much as I do, so YMMV.

Can I just say again how surprised I am that Internet Explorer made it on to a “good things” list that I compiled? I think the guys at my client will fall off their chairs. They’re used to my refrain that “there’s never any reason to run IE”. Of course, one argument might be “too little, too late”, but still.

Impressed.

Now, just make Metro stop sucking so bad, add Active Directory support to the ARM version, and you might still be in this game.

posted @ 5/22/2012 6:14 AM by Sean Hederman

ORMs a leaky abstraction?

I'm starting to come to the conclusion that ORMs, well specifically the Microsoft ORMs (if you can call them that) are a leaky abstraction. Most tend to have some or all of the following underlying (incorrect) assumptions:

  • The data being dealt with has one and only one representation for each entity
  • Data is read in row by row, or set based; but is only ever updated, inserted, or deleted row by row
  • Updates are to the entire object, with no mechanism for partial updates, e.g. UPDATE Invoice SET Status = Cancelled WHERE CustomerID = X
  • Entity objects are internal to the data access layer; thus there is no necessity to allow customization of the attributes, data types used.
  • Entity objects are always connected, and this can be change tracked ("Internets, you say? Speak up sonny, I'm not understanding your point!")
  • It is okay to accept shoddy performance on the grounds that "it's only an ORM" and you should be able to drop down to "raw" ADO.NET for REAL work. Query hints? Isolation levels? Schemas? Don't be silly, no-one uses those things.
  • Stored procedures and functions are something disconnected from an ORM, and to be treated as an add-on, not something first-class.
  • Logging, Instrumentation and sometimes even security are not something to consider.
  • There's no need to consider extensibility, people must just wait for new database features to get given support in the ORM, if ever.
  • It's okay to support a huge ugly designer working off one file, because all REAL projects have the database scheme controlled by ONE person.
  • It's okay to require usage of your specified collection class and/or require virtual properties in your "POCO" option, because nothing says "you're in control" like forcing stupid design decisions.
  • There's no need to support a "code-first" model where the DB is generated from the DAL entities, and needs to handle change scripts.
  • Constraints? Indexes? Why would our query language need to consider those?

When are we going to get a proper ORM? I've given up on the MS ADO.NET Team ever delivering, they're too busy writing tools for data development the way we did back in 2005.

Where are the products that provide the requirements MS don't even understand?

posted @ 3/11/2012 8:33 AM by Sean Hederman

Performance Myth: Managed languages are slow

"[When comparing language performance] what's really evaluated is the skill of the compiler writers, not the languages themselves"
- Fahad Gilani

There are a lot of reasons for this myth, but basically they all boil down to two main misconceptions. The first is that managed languages are somehow interpreted, which they're not. The second misconception is a little more accurate, and has to do with the impact of the garbage collector.

Managed code (.NET and Java at any rate) runs on the CPU as native code, just like your handcrafted C++. Now, generally there is a step before that happens, called the JIT (Just In Time Compile) where the .NET MSIL or Java bytecode is compiled to machine code. This compile has to be blazingly fast, in order to ensure that the user isn't faced with stalled user interfaces as the compiler runs around doing its job. Because of this, the compiler cannot perform some of the more advanced optimisations that C and C++ compilers do.

However this is offset by the fact that the JIT compiler is running on the actual machine being used. Normally, a compile runs on a similar machine to the target computer, not the exact one. This means that theoretically, the JIT compiler can perform optimizations that most other compilers can't, at least not without massively limiting the potential install base. In addition, it's theoretically possible for the JIT compiler to recompile sections based on observed usage patterns in the application, providing even more performance. Of course, these two possibilities are currently just theoretical, but it's important to see ways in which a JIT-compiled system could approach or exceed static compiled code in performance.

The .NET runtime does perform a great many checks in order to ensure your application runs predictably. Array bounds checking and overflow trapping are just two of the safety features provided. These features consume processor cycles. However, I'm not convinced that this is poor
performance. If C++ code were written completely robustly, it would have many of these checks too. In any case (in C# at least) you can run in unsafe mode, with many of these checks avoided. Beware.

The garbage collector allows us to pretend that memory is easy to manage, and to willy nilly create tons of small objects in the (almost) certain knowledge that they won't waste memory and they won't cause heap fragmentation. This is probably the most difficult concept for C and C++ programmers to get used to. They mutter about deterministic finalization, and the various pointer mechanisms they use, and miss the most important and interesting facts from a performance perspective: no fragmentation and fast allocations.

In traditional memory management, the allocation routines maintain a linked list of available chunks of memory. When you need some memory, you walk the stack looking for the smallest piece that will satisfy your requirements. This is not a cheap optimization, which is why many low latency systems allocate chunks of memory up front for more controlled and performance critical allocations.

In .NET (and I assume Java too), the operation to allocate memory boils down to a very simple operation: move the heap pointer N bytes further along. It's marginally more complicated than that, but not by much. This is one of the reasons managed programmers are so free with object allocations, they're cheap by comparison.

The second interesting aspect is the defragmentation the garbage collector performs. In a C++ application if you allocate a bunch of objects, some of which are deallocated quickly and the rest surviving, you will find that the surviving objects are scattered around the heap. In .NET this is not the case; after a garbage collection the longer lived objects will all tend to be close together (closer than they were, at any rate).

This leads to some interesting side effects. Theoretically it means that .NET apps should have better cache coherency than C++ apps do, at least by default.

Of course, they also have the impact of the garbage collector on their execution times. Mostly, they're no longer "stop the world" collections, but their performance impact in undeniable. Of course, you need to deallocate memory in non-managed languages too, and often the deallocations occur even when the machine has plenty of memory available. One of the strengths of the garbage collector is that it deallocates when it makes sense from a memory pressure and CPU perspective. Unfortunately, that's also it's weakness, as you never really know when that will be; it could be just as a critically important action needs to be taken. In most cases, the impact is small and infrequent enough to not matter, but for some use cases, it really, really does.

So, in conclusion, we've seen that managed languages run "raw" on the CPU, that some can reduce their runtime checks, that JIT could theoretically provide similar performance benefits to static compilation by being more machine-targeted, that GC memory is fast to allocate, and should tend towards improved cache coherency, although it has small, but unpredictable impacts on execution.

Does this mean it's possible to write truly high performance managed code? Yes, it does. Fast enough to compete with C and C++? Well, fast enough to make them break out the advanced compiler options, yes, certainly. Over the next several months I'll be exploring some of the things we can do to improve the performance of .NET applications.

posted @ 2/13/2012 10:51 PM by Sean Hederman

5 Reasons to Unit Test now

  1. If you're not testing it now, you won't have time to add tests later
  2. If you do have time to add tests later, you won't remember all the context
  3. If you don't remember the context, you'll just test code paths instead of actual use cases
  4. If you think about testing as you write code, the code tends to be more decoupled and you test more "non-obvious" scenarios
  5. If you unit test everything, and there's code still not being hit, then maybe you can get rid of it

On a Top Secret Project we’re working on, we’re aiming for 100% code coverage. Does that mean that all code gets tested? No.

What it means is that when we find code we can’t test, and we’re happy with that, we mark it with the ExcludeFromCodeCoverageAttribute. So it’s easy for us to see how we’re doing on our goal (currently at about 80%).

BUT, we will do back and review those ExcludeFromCodeCoverageAttribute decisions from time to time. There’ve been a few cases where we’ve decided to revoke that decision.

So you have:

  • Code that is covered by unit tests (known as “good code”)
  • Code that is excluded from code coverage (known as “exempt code”)
  • Code that is not covered (known as “bad code”)

I guarantee that 90% of your bugs come from bad code, and most of the remainder come from exempt code. Exempt code has at least been evaluated, bad code has not.

posted @ 2/13/2012 3:09 PM by Sean Hederman

Reflector.NET Add-Ins Gallery Active

The guys at red gate have set up an Add-Ins gallery where you can browse the active Reflector Add-Ins and tools.

Check it out, and add tools and add-ins you’d like to see highlighted.

My Reflector.Diff 2 add-in is highlighted there as well.

posted @ 10/25/2011 2:34 PM by Sean Hederman

IoC, extension methods and logging

One thing I always need is nicely instrumented and logged code. However, I don’t want to be setting up performance counters and log files in my unit tests. So, how do I make logging statements which are nice and injectable? Well, clearly to start, we need an interface that can be injected into our class that needs logging:

   1:  public interface ILogging {
   2:      void Write(TraceLevel severity, string message);
   3:      bool IsEnabled(TraceLevel severity);
   4:  }
 

Simple enough. Now, the point of all this is that if the Logging isn’t injected, then we shouldn’t log. Okay, so our code (assuming the ILogging property is called Log) looks something like this:

   1:  if (Log != null && Log.IsEnabled(TraceLevel.Error))
   2:      Log.Write(TraceLevel.Error, string.Format("Security '{0}' has been suspended", ticker));
 

Meh. Ugly. Imaging having to do THAT everywhere. But, we can make an extension for it. In fact one of the nice things about extension methods is that you can call them on null objects. Because they’re not really instance methods, they just look that way. So given the extension method

   1:  public static void Error(this ILogging log, string format, params object[] args) {
   2:      if ((log != null) && (log.IsEnabled(TraceLevel.Error)))
   3:          log.Write(TraceLevel.Error, string.Format(format, args));
   4:  }
 

We can now write the MUCH simpler:

   1:  Log.Error("Security '{0}' has been suspended", ticker);
 

And if Log is null, nothing will happen, courtesy of the log != null check. Needless to say the implementation of ILogging could be using log4net or System.Diagnostics or your own scheme. It doesn’t actually matter. If you need context passed in to the logging, then one way would be to pass it in when the ILogging instance is constructed by your IoC container. I’m sure you could figure out other ways to slip context in to it.

Now, one of the places I use this mechanism is on a single stock future pricing system with massive volumes and required latencies in the millisecond range. So we really don’t want to incur any costs we don’t absolutely have to. Some logging has somewhat expensive operations to determine the parameter lists, consider for example:

   1:  Log.Verbose("Stock '{0}' listed on {1:d} and has a market cap of {2:c}", 
   2:      ticker, GetListingDate(ticker), GetMarketCap(ticker));
 

This is a somewhat contrived example, but you can see where I’m going. I don’t want the costs of pulling the listing date and market cap every time I pass the Verbose logging call, which most of the time probably isn’t enabled. We could maybe cache the listing date up front; but market cap can change second by second so it can’t be cached. So what are we to do? Well, it’s easy enough, add an extension which lazy loads the format arguments:

   1:  public static void Verbose(this ILogging log, string format, Func<object[]> args) {
   2:      if ((log != null) && (log.IsEnabled(TraceLevel.Verbose)))
   3:          log.Write(TraceLevel.Verbose, string.Format(format, args()));
   4:  }
 

So, now our contrived example becomes:

   1:  Log.Verbose("Stock '{0}' listed on {1:d} and has a market cap of {2:c}", 
   2:      () => new object[] { ticker, GetListingDate(ticker), GetMarketCap(ticker) });
 

Slightly more complicated, but it now will only execute the expensive operations when absolutely necessary.

So there we have it, a nice injectable logging wrapper that can handle being completely disabled, can be unit tested against and mocked out, and can handle all logging scenarios I’ve needed since I came up with it about 6 months ago.

posted @ 10/14/2011 6:06 AM by Sean Hederman

Apologies for being so quiet

Okay, the last year or so has been insane. On the personal side I got married last month and anyone will tell you that doing that can suck up 6 months plus of your life. Couple that to some exciting work that’s been sucking up most of my waking hours and you see little time for blogging or otherwise.

Anyway, I’ve resolved to sort that out ASAP. So, here’s my plan: I’m going to carry on blogging and stick to strictly technology based posts from now on. My screeds about strategy vis a vis Microsoft, Apple etc seem to be popular based on page views, but to be honest they’re not why I started blogging.

posted @ 10/11/2011 9:16 PM by Sean Hederman

Windows 8 preview: Yawn

So Microsoft are releasing little tidbits of info about Windows 8. As I anticipated there’s nothing ridiculous like dropping support for .NET. Instead, they’re raving about their funky new tile-based UI, and how great it’ll be to write apps for it in HTML5 and Javascript. So, .NET isn’t dead; like how VB.NET is an equal partner to C#. Great in theory – absolute drek in reality.

So, the new tile-based UI is called Metro, and apparently we’re all supposed to switch to it. Except…why? I mean, it’s pretty; and it seems to work well by all accounts, but I can’t see why I’d rewrite my current “legacy” (or “Desktop”) Windows applications to use it, unless I wanted them to use a touch interface.

To be honest, as I mentioned before, I see no reason to write HTML5 and Javascript applications for Windows. If I’m using those; I’m going to do it as a nice cross-platform web application. I mean; why would I target Windows only? In what mad world would I take the pain of Javascript for the dubious advantage of targeting only Windows?

I think we all need to stand up and give the Windows Team a round of applause. They’ve managed to accomplish what Linux, Google, Apple and the rest haven’t managed to do yet. They’ve managed to make Windows irrelevant and simultaneously annoy and frustrate a large percentage of the developers who build the apps that make Windows so popular.

This is a strategic blunder the likes of which will be studied in business schools for decades.

posted @ 9/13/2011 8:57 PM by Sean Hederman

Metadata Changes &amp; Versioning

Daniel Antion has an interesting and well thought out article called “Can Records Change” at the Association for Information and Image Management. His question details what we do about changes in data about a document, or metadata. I’m thrilled about him bringing up this topic; because it’s one I’m passionate about. Let’s think about some reasons this information changing and maybe we can shed some light on his question:

  • The underlying document changed. This is probably one of the most common reasons for metadata changing; people make changes to documents all the time. The contents may have been modified; the subject could have been modified; authors added; review information changed and so on.
  • Linked information changed. This is less common; and many document management systems don’t handle it correctly or at all. Consider a situation where we link to a Person record on our line of business system. We may store some of the fields from that record in the document management system; such as Surname or City; things that may make it easier to find the document down the line. So; we capture an Application form for a “Ms. Jones”, but 6 months later we find out that she’s got married and hew new name is “Mrs. Smith”. Do we leave the original record data as it is? Curse ourselves for storing Line of Business data in our DM system? Change the data; accepting that a search for “Ms. Jones” now won’t find a document that plainly says “Ms. Jones” on it?
  • Information captured incorrectly. Depressingly common; we obviously want the correct information. However, our auditors and lawyers will possibly also want the original metadata; especially if processing or business decisions were made using that information.
  • Extra information added. Our processing workflow might well add metadata to the document; storing information about the processing steps undertaken; approvals gained; signatures affixed and so on. This doesn’t change the original document or metadata but must be accessible as well.
  • Our metadata schema changes. This is also depressingly common, where we change what fields can/must be captured against a document type. Much as we all like to think we can plan perfectly, and much as our clients love to believe they understand their requirements full; the truth is different. Think about a scenario where we’ve been in operation for 3 months when the client comes in and tells us that they need a “Category” field added to the document type. Great; we can add it, but what about the existing documents that don’t have it? Does this mean that we have to add it as an optional field? In too many systems the answer is yes. Now, a couple months later they change their mind. “Get rid of it”, the client commands. What happens to the documents captured with the data? If we restored the field sometime in the future would their data have been lost? Again, too many systems have “yes” as to the answer to that question.

Okay, so now we’ve had a look at some of the reasons that the document can change, we can see some requirements coming out. Our hypothetical metadata system must keep a version history; and must keep it in such a way that previous versions data is still accessible in searches. Needless to say audit information about who, what, when, why must be stored against each metadata change. The system must be flexible to schema changes, allowing fields to be added later - even if mandatory, as well as allowing them to be removed and even restored.

Additionally when we keep a version history, we must also consider whether we want a bitemporal system; a system which not only stored what did happen; but also what should have happened, e.g. we only updated “Ms. Jones” to “Mrs. Smith” yesterday; but she sent us the documentation 2 months ago and we should have done it then. A bitemporal system caters for such a situation; allowing you to see both the “Operational Truth” of how events actually occurred and the “Business Truth” of how events were supposed to happen.

As you can see, what seems like a simple topic of changing information becomes complicated very quickly. It’s important that your document management system handle these complexities in an intuitive manner. Almost every system I’ve ever seen falls over when it comes to metadata. The most usual reason is that most systems are designed around their underlying database; and that database doesn’t handle one or more of the scenarios I’ve outlined above. For example, a relational database like SQL Server can’t cater for schema changes correctly without a great deal of work that frankly isn’t worth the effort. Other systems use a more hierarchical store which handles the schema changes nicely, but struggles with efficient bitemporal access and most importantly tend to have rotten performance.

Do you know of other systems that can efficiently handle all of the above reasons for metadata changing? What about scenarios I’ve left out?

Want to change your metadata reliably, accurately and quickly? Signate 2010 handles all of the above scenarios well due to it’s unique and innovative design.

posted @ 7/19/2011 8:34 PM by Sean Hederman

Apple and the Commoditization of Developers

You have to hand it to Apple; they certainly know marketing. They can sell millions of units of a product with dubious value, not only that, but they can sell the same useless product to the same person more than once. There are certain people at one of my clients who all rushed out to buy the iPad when it came out; they used it with great gusto and excitement and showed off all it’s nifty features. You know, all the things it could do that my iPhone could also do, while being less portable and unable to make calls. Anyway, I noticed that these wonderful iPads started being left around more and more. These guys are die-hard Apple fanatics so they never discarded them entirely, they just … used them less. Eventually, the iPad was pretty much being used for it’s one true killer app: reading web pages whilst sitting on the toilet.

Nonetheless, when Apple announced the iPad 2, most of these Apple-istas announced their intention to “upgrade”. It’s not really an upgrade though; they’re discarding the old one and buying a completely new one. Apple have reduced the prices of the iPad 1, so if they were to try and sell their old one; Apple have ensured that they would be hugely out of pocket. Plus, the new one has not got very many capabilities better than the iPad 1. Nonetheless, a True Apple Fanatic™ must have the latest Apple kit. So, here we see a lemming rush of people determined to throw money at an all but useless product that they already have. I’m not saying the iPad doesn’t have some value, I’m just saying that in most cases people are buying it without any need or place for it, and thus it is useless to them.

Anyway, that’s not really the point of this article.

The other great marketing trick that Apple have done is to begin the complete destruction of the livelihood of all software developers on Earth, despite a pressing shortage. This is a quite amazing trick when you think about it. Apple’s hostility to the average developer is well known, and with the Apple App Store™ their revenge is nearing it’s completion. They have decided to equate developing applications with singing songs. They encourage a price of $0.99 for a song, and for an application. Except, you see, the song is normally produced for the mass market by an artist who also has recording contracts and live shows and so on; whilst the application is normally specific to a niche and is the sole livelihood of one or more developers. Oh I agree that some applications have a broad mass-market appeal, and those successful apps are the ones that Apple holds out to the rest of us as a carrot in order to keep us churning out applications for below market value. Most of the successful apps are games and other things with limited real utility; so the other people who are suffering are the users. Fart apps are plentiful, accounting apps non-existent.

What we are seeing is a massive transfer of wealth; away from developers and into Apple. They don’t care that your app doesn’t make enough money for you to survive; it helps them push more units of product and that is their only concern. They also take a 30% cut of your sales, but leave you holding the bag on returns; even if it’s their fault. That massive bump in Apple’s share price you saw recently (there’s always a bump in their stock recently), that came straight from your sweat and blood. What did you get for it? Nothing, zip, nada, maybe some aggravation about how your app didn’t meet their high standards for $0.99 applications. A dismissive, sneering, cold shoulder from a dismissive, sneering company.

The issue is that due to Apple’s perceived dominance there is now a halo effect; the whole industry seems to be climbing on this bandwagon of cheap apps given away in perpetuity by software developers whose talents are in dire shortage, in the desperate hope that there is indeed a pot of gold under the App Store rainbow. The pot of gold is too frequently an illusion; you’d be better off buying lotto tickets than in developing for Apple’s consumer products. Plus, it’s only a matter of time before Apple make their iAds platform mandatory; and will probably make you refund them for mistakes they make with it.

So where to for software development as a profession? Well, I believe it’s increasingly going to result in a massive income disparity; successful App Store developers and top-end corporate developers will be the small elite commanding higher and higher salaries due to their rarity; and the broad mass of other developers will find their income eroded from the top by the App Store successes and from the bottom by even more desperate developers giving their wares away for free. In other words, the industry will become more like music – with a few rich swimming at the top, and masses of wanna-be’s struggling at the bottom. We’re already seeing this trend, despite insourcing growing in popularity (due to the clear disadvantages of outsourcing) and growing demand salaries are stagnant or declining, defying economic truths. Apple have realized that they can turn their marketing machine to reducing their software development costs in addition to increasing sales. The Reality Distortion Field can extend to an entire industry leaping off a cliff like lemmings.

So say no. Say no to companies like Apple which hate developers. Say no to working your butt off for the chance of an adequate return. Demand respect and demand that your talents; bought at high price by years of work and study, are respected and paid for. Stop allowing companies like Apple to steal from you and expecting you to be grateful for it. Price your software at a reasonable rate, not one encouraged by a company which does not have your interests at heart. Take the cost of development and divide it by the number of likely sales, not potential sales. If you don’t know the number of likely sales, do some market research. Admittedly, by closing off the App Store, Apple make this very difficult; your app must be full featured before you can find out how popular it might be. So, develop it as a web app first; see if it’s popular; see if people are clamouring for an iPhone version. Only then should you consider writing an iPhone version. Ideally, rather just make your Web App iPhone compatible. That way you side-step Apple’s control completely, and retain ownership of your work rather than gifting it away.

Just Say No to closed platforms; and you don’t get any more closed than the Orwellian App Store.

1984, now brought to you by Apple.

* – Disclaimer, I am not an Apple developer and have no desire to be one

posted @ 7/12/2011 1:04 PM by Sean Hederman

The Truth About Folders: a rebuttal

In AIIM, Laurence Hart makes a number of comments about the Search vs Folders debate.

Claim 1: People are used to folders, Rebuttal: They are used to search as well

Not a good enough reason to stick with folders. By that logic we shouldn’t use search engines to access the Web, we should organise it into folders instead, just like Yahoo used to do. You see, people are also used to search engines, they use them every day of the week. In fact, more and more people are using search based idioms rather than folder based idioms to access their systems. Look at the search box built into the Windows 7 menu, and every Windows Explorer window.

Claim 2: Search Engines fail, Rebuttal: So does everything else

He claims that we should have folders as a fallback position in case the search engine doesn’t work. Well, if your technology is based around an unreliable bolt-on search engine (looks meaningfully at SharePoint), then yes, this is a valid concern. If your entire system is designed around search, then the search engine is the core and any folder-based view would be the bolt-on, and thus would be the one more likely to fail. All systems can fail from time to time, but that is not a good reason to not use an entire class of technology.

“Cars sometimes break down, so we should all use horses.”

This is a classic example of the logical fallacy of the excluded middle.

Claim 3: Folders help you organise, Rebuttal: Why manually organise?

I’m actually not sure what point he is trying to make here exactly. He goes into taxonomies, and how folders help users create a “well-executed taxonomy”, and how creating a taxonomy without folders sacrifices performance and simplicity. Not one person I’ve ever spoken to about their requirements from a document management system has ever mentioned the word taxonomy. Not one. Ever.

I will not deny that a proper taxonomy is easier to do with folders than without. I will even admit that should a system have a bolt-on taxonomy system this will likely be less performant and simple than a system designed around taxonomies. I deny the need for taxonomies at all. Search and metadata is all that is required to search billions of documents, and requires zero extra effort.

He then admits that these taxonomies change, and systems must be put in place to manage these transitions. I’ve never once had to redesign search, it’s search for goodness sake; and if your data changes, just reindex it! Want to add a metadata field? Reindex. No manual effort, let the computer do it for you.

4. Not using folders cripples systems, Rebuttal: Only if the developers were idiots

This claim boggles my mind. Let me quote: “One of the problems that you get when you don’t use folders is that you can cripple most systems. While few systems claim a limit to the number of documents that can reside in one location, there is a practical limit”. I’m pretty sure that what he’s talking about here is the well-known reality that operating systems struggle when directories file up with more than a few thousand files.

He seems to be conflating the experience of the system from the outside (i.e. the users sees no folders), with the implementation details of the inside (i.e. does that mean the system stores every document in one huge directory). This is utter rubbish. Signate as an example creates an internal directory structure which documents are routed to in a balanced fashion, ensuring that no directory winds up with too many documents. This structure is internal to the system and is a performance and management implementation detail. It is not exposed outside the system at all.

In fact this argument of his is an excellent example of why folder based systems don’t work as well as search based ones. While Signate automatically balances files across a directory structure designed to allow billions of documents per node, no such balancing can be applied when humans are involved. Every folder-based system I’ve ever seen winds up with a “dump” location, sometimes more than one, where documents which don’t fit the taxonomy neatly are placed. This can swiftly grow to thousands of documents, resulting in the very problem that Laurence claims search based systems suffer from. Sure, if the taxonomy was perfect, this would not arise; and this is also a sign that the taxonomy may need to change, resulting in a great deal of manual work. In a search-based system with balancing, this situation never arises. This is not just a punt for Signate; I’ve never seen a search-based document storage without balancing, and I struggle to comprehend that anyone would ever conceive of designing such a system.

He then claims that “You can swear that nobody will ever browse to [the internal storage] location, but unless you remove that capability, someone will do it”. Well, of course we remove it! I consider it a massive security breach if people are able to access the internal document location of the system without passing through the interface to the system. Does he allow users to access his internal company databases directly? Of course not.

5. Search Engines can’t read your mind reliably, Rebuttal: nothing can

Neither can folders. Search engines help you find what you’re looking for; folders let you know where you’re looking. Which would you rather have? We don’t need perfect reliability; you can refine search terms based on the results we see. Too many results? Add search terms. Too few? Remove some. Make some more approximate, tighten up others. Signate allows an enormous range of searching options, including approximate search where words similar to the specified word are found.

Conclusion

Clearly, I’m biased. I’m so convinced of the value of search-based document management that we created one ourselves. Laurence is a specialist in Documentum, a prominent folder-based document management system. So, we’re both biased. But read his article, read mine, and then ask yourself which approach:

  • Will get the benefits of document management into my users hands faster?
  • Will result in the lowest ongoing administration whilst delivering excellent results?
  • Will adapt to my changing business needs?

Folder-based systems are great for rigorously defining the information content your organisation needs; and if you’re working in a top-down company that has an IT department that can easily define a data dictionary for your entire business and enforce it’s consistent usage; then I’d strongly suggest you look at systems that support such an approach. If, however, you work in the remaining 99% of businesses where change is constant, time is precious, and flexibility and turnaround are more important than rigor; then look at systems that support that approach.

posted @ 7/6/2011 9:23 AM by Sean Hederman

Windows 8, wpf, silverlight and HTML5

Okay, so there’s a great deal of wailing and gnashing of teeth about Microsoft’s announcement of Windows 8, and especially the comment:

Windows 8 apps use the power of HTML5, tapping into the native capabilities of Windows using standard JavaScript and HTML to deliver new kinds of experiences

As usual, Tim Anderson has an insightful post where he tries to calm people down and introduce some sanity to the discussion. Scott Barnes (the former head of Silverlight) seems to believe that WPF is dead, and Silverlight is on it’s way out.

HTML5 + JS Windows Apps? Good Idea

HTML5 + Javascript applications running directly in the OS? Kudos to them, it’s a great idea. It makes Windows programming available to a massive amount of devs who currently only do web work. Of course, they’re not too likely to switch, but hey, some might. It also allows us to treat the OS as a commodity with one app working on all operating systems. Oooh, Good Idea for devs, BAD IDEA for Microsoft.

OS-Specific Hooks for JS + HTML5? Meh Idea

Your code can detect the OS and take advantage of these OS-specific features if they’re available, much like the flicking functionality in iOS Safari. Happy days. Will this make devs write HTML5 apps only for Windows 8? No. Not at all, and if MS think this they must be retarded.

Foot off the pedal on WPF? Meh Idea

Personally, I could care less. I’ve hated WPF since it came out, but then I’m not a graphics designer (although they apparently hate it too). I grok it’s power and capabilities, but I just don’t care. I either want a simple form based application which I can use HTML and/or WinForms, or a pretty high-graphics app in which case I use HTML. WPF sits nowhere in that spectrum for me, although I understand and appreciate that it does for some. They could gain some developer defections if they go through with this.

Foot off the pedal on Silverlight? BAD IDEA

This would just be idiotic in so many ways it’s not funny. It would make MS a laughing stock in the Web Dev community (okay, more of a laughing stock). It would wipe out their fallback plan for delivering rich applications if the OS get’s commoditized. I understand that the Windows OS team don’t expect that to happen, but to remove your Plan B for the thousands of other applications you provide is just thick. Plus this would probably result in a massive developer defection.

Foot off the pedal on .NET entirely? BAD IDEA

Nobody is saying this directly yet, but it seems a logical (if extreme) conclusion from where MS seem to be thinking about going. It’s hard to consider that anyone at Microsoft could be so stupid as to be considering this, but we never know with the Redmond creature. This would be a company-killing move. An enormous amount of Windows devs hate HTML+JS with a passion. They like writing Windows applications. They don’t want to write web apps, and they don’t want to use C++ because, umm, it sucks to develop in. Hello Java, and Mr. Larry Ellison is suddenly very happy he bought Java.

Conclusion

I don’t really have any, except to say that you should take what you read with a grain of salt. A lot of it seems to be internal politics playing out in public and marketing drones putting spin on things. It’s entirely possible that the Windows 8 team told marketing about the HTML5 + JS thing as an aside; like “hey we also are allowing you to do this”, and the marketdroids turned it into “this is the way you will do things”.

Personally, I’m kind of uninterested on everything except the direction of .NET in general. I currently use Silverlight a little, for things like showing a graph or video. I gave up on the whole Windows UX team a long time ago. I cut my teeth on DOS and then WINAPI, MFC, WTL, VB, WinForms (1 & 2), was underwhelmed when WPF came out, and mildly interested in Silverlight. In almost all that time, a simple web page with a few simple tags would do everything that the latest UI fad from Redmond would do. I’m just tired of new UI mechanisms and widgets and thingies.

So, am I going to write Windows 8 only applications? No.

Am I going to use WPF? No.

Silverlight? Less and less.

HTML5+JS on the desktop? No.

HTML+JS in the browser, served over the web? Yes.

Off the treadmill, and not particularly interested in the remaining hamsters attacking each other, as long as they don’t destroy the cage.

More and more devs are coming to the same opinion, and Windows is becoming more and more marginalized and commoditized, ironically by the very Windows team that thinks they’re doing the exact opposite.

Final Score

For Microsoft

For Devs

Good Idea

0

1

Meh Idea

2

2

BAD IDEA

3

2

posted @ 6/3/2011 1:15 PM by Sean Hederman

Why Search is the #1 Feature

Searching for itemsThe main means of accessing documents in document management systems is via folders. This makes sense because it’s what people are used to. Before they get a document management system they normally arrange their shared documents in a shared location, organized via folders. They’re intuitive, hierarchical and familiar; and thus people tend to look for document systems which are focused around folders as well. This makes the migration to the new system easier as well.

THIS IS A MISTAKE!

“Why are you switching to a document management system at all?”

If your file share is perfect for you why spend a lot of money on a system which is just going to replicate it? You’re trading a cheap, convenient, and reliable system for one that is much dearer, requires retraining and is all too often less reliable. If you need to share your documents in a folder structure across the Internet, use Dropbox, it’s a fantastic service and it’s very reasonable. If your company just needs a glorified (and expensive) folder share, I don’t want to con you out of your money and add zero value. With Signate, we want to add real value, and we don’t believe that is done by a web-based share.

Case Study in failure

Set of folders containing documentsAt a large financial company where I consult on some electronic accounting issues, they had a massive shared drive where all documents were kept. They spent millions of rands implementing a company-wide Content Management System (CMS); money thrown at the software, hardware and numerous consultants involved. Most CMS systems make it easy to arrange your documents in folders, and so they reorganized the layout, designed it better and set it all up. All new projects were to use the new system it was decreed. They did, but only as a file storage medium. The more advanced features like wikis, monitoring, calendar, workflows and so on went almost unused. It was also decided that the existing file share was too large to migrate, so it coexists side by side with the CMS, and there is often confusion about where to find documents, and which version is the “current” one: the file share version or the CMS one. Net result: significant capital and operational spend, massively increased storage requirements (due to duplication), confusion, and little or no improvement. They have also been unable to get the CMS search working, which is a critical failure in my opinion (as we will see later).

Is this the fault of the CMS? Not at all. This particular CMS is a very powerful tool in the right hands. It’s configurability allows it to really shine when well-implemented. Unfortunately it is all too rarely implemented well, and this usually requires hordes of very expensive consultants.

Failure of Design

Old Yahoo pageOld Google pageThe underlying problem is that the CMS, along with all too many document management systems cater to people’s first instincts: the desire to keep things the same as they were. Let’s cast our minds back to 1998. The Web was a growing phenomenon, and the most popular portal was Yahoo!. Their web site was built around a directory, a folder structure exactly like that in your shared drive, except it consisted of links to web pages. People submitted their pages to Yahoo! and it would be placed in a category.

They had search, of a sort, but the focus was clearly on the directory structure; that was how you ensured that you found what you were looking for. You would browse through folders, hunting for the right category. Sometimes the categories were arranged somewhat haphazardly, so it could take a while to find the right one. However the task of maintaining this directory grew larger and larger, and the directory fell further and further behind.

I remember that my primary source of new pages started to be from friends’ emails rather than finding them in directories. All the while Google was making search their primary focus. We all know how that story played out; search became the dominant means of finding pages in the web. Why does search trump directories? For a few simple reasons:Multiple computer accessing one folder

  1. A directory imposes the directory organisers priorities on the consumer – If the organiser arranges things in a way that the consumer finds counter-intuitive it can be difficult or impossible for the consumer to find content that is present.
  2. A directory requires constant work to ensure relevance – Entries (or documents) can become stale or corrupted, newer locations may become popular causing duplication with work occurring in both locations.
  3. Search puts the consumers priorities first – You type what you’re looking for and the search engine finds it, what could be simpler than that? There is no organiser other than the content, so you don’t have to put up with odd filing hierarchies.
  4. Search ensures relevant content is found immediately – No hunting through folders and opening documents; the best matching results are returned first.
  5. Search allows for powerful search terms – You can use advanced features such as ranges for dates and numbers, exact matching, wildcards and so on very quickly and easily.
  6. Directories are categorised by perception, search by reality – When we decide to place a document under the “Technical Specifications” folder we’re doing so based upon our idea of what that document contains. Normally this would be done by the content author; so they’re generally pretty accurate, but there might be a better location or the categoriser may be mistaken in their assessment. Search categorises documents based on their content.
  7. Directories are static – Related to the above, documents change, and your system must cater for that. A directory structure tends not to change, even when it should. People are used to accessing a particular document in a particular place, and if you move the document they won’t find it at all. You’ll go from 100% accuracy to 0% in one swift go. Whereas with a search system, the document will move up and down in the search results for a particular set of search terms as it’s content changes.
  8. Directories take effort – You need policies and procedures and people who monitor them and control them. All of this is not productive work.

The Road Ahead

Some taking off in comparison to othersThe future of document management lies in search. In my many years in the Document Management field, across industries as diverse as logistics, healthcare, insurance, financial, travel and many others I have seen finding documents again and again become the pain point for project after project. This is why we created Signate: as a response to the appalling inefficiencies of products spanning from the cheapest of the cheap to high-end enterprise servers. Signate puts search front and center, and whilst we are ahead of the game right now, I am under no illusions as to how long that advantage will last. Search is such a compelling feature that all document management systems will have to become search-centred or they’ll fail.

The question you have to ask yourself is where you need your company to be? Do you want an easy transition to document management but very little added value, or are you willing to learn a new way of finding your documents? It isn’t even that new if you’re used to searching the Web.

Search Quality

A complicated and busy search formSo now that I’ve made my case for search over folders in document management systems, let’s look at the quality of that search. Have a look at the screenshot to the left. It’s from another document management systems search screen and exemplifies pretty much everything I dislike about search in the document management space.Just one search box

Each field you can search on is listed, each with it’s own box. Worse yet there are drop downs for “Contains”, “Exact Match” and so on. Whilst I hate the dropdowns for the date fields, at least they actually have a date range search as opposed to forcing you to pick one date at a time. But now, what if I knew that the document was created after a certain date, and the revision I was looking for was before another one. How would I enter that? Would I be able to leave the To date range empty for “creation date”, and the from date range empty for “revision date”. Possibly, it’s not clear. How would I search for where the author is either “Ray Bradbury” OR “Orson Scott Card”? I know it’s one, but am not sure which one. I’d probably have to do two searches.

Now consider the search screen on the right. This strangely enough is much clearer as to what you need to do, which is counter-intuitive if you think about it. You’d expect that the screen which spells everything out explicitly would be the easiest and most compelling, but it’s just not. An empty search box invites, a complex search screen repels. How would you search for both authors as above. Well, I’d type: “Ray Bradbury” OR “Orson Scott Card”

A criminal hacking a computerThe normal reason document management companies use search forms like the left hand one is because their search form is a thin wrapper over their underlying database. This limits them, as databases are designed to be very specific, and cannot search across fields easily. Not only that, if they don’t get the search form exactly right, it is possible for a user to run updates and malicious scripts on the database. The database that’s storing all your document data. With Signate we use a completely separate search engine, which not only is designed to search and search well, but also cannot affect your underlying data. Oh yeah, and it’s fast. Blazingly fast. Much faster than a complex query run against a database. Plus it can easily scale up to billions of documents, which database-driven searches struggle with.

Conclusion

If you need a document management system, please, please, please choose one that puts search in the forefront. Ensure that before you buy you really kick the tires on the search system; that it’s quick and easy to use. It should not take your staff longer to find an internal document than to find a web page via Google. If it does, then you have a suboptimal system. A swift and powerfulA plant growing from money document management system should pay you dividends across the board. You should have happier and more productive staff, faster processes, easy findability for your documents, and vastly improved turnaround times.

Use our Document Costs Calculator to work out the amount you’re probably wasting right now on your document costs, and thus the amount you can save every year. That financial company I discussed earlier? Our calculator shows that well designed and run document management system should be saving them between 12 and 70 million rands a year, and save between 180,000 and 490,000 staff hours annually. These figures are based on research, calculations and figures from Gartner, Cap Ventures and the Arbeidsgemeinschaft für wirtschaftliche Verwaltung.

Plug in your company’s figures and see the impact a good document management system could be having on your bottom line and client satisfaction. Plus it’s good the environment too.

Palantir (Pty) Ltd

Sean Hederman is Director of Palantir (Pty) Ltd, and Software Architect for the Signate Document Management System. He also writes the popular programming blog Codingsanity.

Signate 2010 Document Management System
 
 
 

Follow signatedm on Twitter

posted @ 2/17/2011 1:25 PM by Sean Hederman

Reflector No Longer Free

Red Gate have announced that they will no longer be offering Reflector for free, and that from May, you will need to pay $35 for it. This has caused a predictable community backlash, especially on reddit. Here are some my thoughts on the situation.

Firstly, I must say I’m disappointed. I personally was hoping that Red Gate would make enough money off the (very, very useful) Premium version, and “halo effect” to keep it free. That said, I’m not a dyed in the wool Open Source zealot, and I understand why they’ve been forced to make this switch. They’re being very careful to keep the price as low as possible, and I think it’s an incredibly reasonable price for what the tool offers.

Let’s get a few myths out of the way:

Red Gate have added in a timebomb to kill the old version of Reflector.

No, the timebomb has always been there, it was put in place by Lutz. Used to annoy me a lot. Anyone using this argument does not have more than a passing familiarity with the tool.

Reflector will be easy to replace with an OSS version

Cool, good luck with that. There are FOSS competitors. One problem: they suck. Okay, a bit strong, but they are nowhere near Reflector in usability, features, extensibility etc. I looked into doing a bit of what Reflector did myself in order to make my Diff tool standalone. Let me just say that after a few weeks at it I’d barely scratched the surface. Anyone using this argument does not know very much about programming.

Red Gate are screwing over the contributors

The idea here is that by making Reflector paid-for that I’m somehow getting screwed over. I can’t speak for the other contributors, but I wrote Reflector.Diff firstly for myself, and have kept working on it because it’s fun. I write Windows programs too, does that mean Microsoft are screwing me over by charging for Windows? No. The people using this argument do not appear to be contributors themselves.

Red Gate should give Reflector away to the FOSS community

Ummm, why? I assume they paid a fair whack of cash for it, whose going to refund that? They’ve added new features (some very, very cool), who’s going to refund them those? I see not one of the self righteous ”give it away” crowd contributed to the tool even when it was free, now they expect to be given stuff for free for their lack of participation?

 

 

In conclusion it appears that most of the people who use the tool seem quite happy with the idea of paying $35 for something that has given them great value of almost a decade and will likely continue to do so for ages to come.

The ones who don’t use the tool are the ones screaming the loudest. The ones who didn’t contribute and add to the product are the ones complaining about how contributors are getting shafted.

I have a loop in my head of Bill Hicks screaming “THINK OF THE CHILDREN!!!”.

posted @ 2/3/2011 7:43 AM by Sean Hederman

Latest Images