Welcome

This is the generic homepage (aka Aggregate Blog) for a Subtext community website. It aggregates posts from every blog installed in this server. To modify this page, look for the Aggregate skin folder in the Skins directory.

To learn more about the application, check out the Subtext Project Website.

Powered By:
Powered by Subtext

Blog Stats

  • Blogs - 3
  • Posts - 61
  • Articles - 0
  • Comments - 18
  • Trackbacks - 0

Bloggers (posts, last update)

Welcome All

It's a wonderful conceit isn't it? Scan in some paper form sent to you, and have the computer automatically read it and process it.

Unfortunately, that's all it is, a conceit. The reality is that reading handwritten text is hard. Typewritten text on the other hand is largely a solved problem, with up to 99% accuracy rates. Handwriting, well, not so much. Oh, it's possible to jimmy pretty good numbers out of the software, and with enough training it can get pretty accurate. Another way of improving accuracy rates is by being able to set a lexicon; a list of allowed words.

So, given that you can limit the words that people use, and given that you can force them to print (cursive is often still a problem), then yes, you can get some pretty impressive accuracy: 95% per word in some cases. However, this is usually not good enough for business; consider that a 95% per word accuracy translates to the software getting more than 3 of the words in this paragraph alone wrong.

For making decisions involving money, or people's health, that's just not an acceptable level of accuracy. So, it seems we are still wedded to manual capture where humans type out what other humans have written in. Or are we? Well, sure, you can't use OCR for the truly critical data on the scanned form, but you can use it for less critical data.

You may not want to use the OCR data for making business decisions, but you can use it as an aid to finding your documents. In many cases, your computerised business processes need only a subset of data on a form, with humans assessing the remainder. In such a case the process is simple: capture the critical data manually, index the rest using OCR, and use both sets of data to find your documents.

This gives you the best of all worlds: the accuracy that only human capture can currently provide, limiting your human processing to the bare minimum, and being able to find the document based on all the data on the form.

Looking for a Document Management System? Signate 2010 is powerful, secure and easy to use.

Latest Posts

ORMs a leaky abstraction?

I'm starting to come to the conclusion that ORMs, well specifically the Microsoft ORMs (if you can call them that) are a leaky abstraction. Most tend to have some or all of the following underlying (incorrect) assumptions:

  • The data being dealt with has one and only one representation for each entity
  • Data is read in row by row, or set based; but is only ever updated, inserted, or deleted row by row
  • Updates are to the entire object, with no mechanism for partial updates, e.g. UPDATE Invoice SET Status = Cancelled WHERE CustomerID = X
  • Entity objects are internal to the data access layer; thus there is no necessity to allow customization of the attributes, data types used.
  • Entity objects are always connected, and this can be change tracked ("Internets, you say? Speak up sonny, I'm not understanding your point!")
  • It is okay to accept shoddy performance on the grounds that "it's only an ORM" and you should be able to drop down to "raw" ADO.NET for REAL work. Query hints? Isolation levels? Schemas? Don't be silly, no-one uses those things.
  • Stored procedures and functions are something disconnected from an ORM, and to be treated as an add-on, not something first-class.
  • Logging, Instrumentation and sometimes even security are not something to consider.
  • There's no need to consider extensibility, people must just wait for new database features to get given support in the ORM, if ever.
  • It's okay to support a huge ugly designer working off one file, because all REAL projects have the database scheme controlled by ONE person.
  • It's okay to require usage of your specified collection class and/or require virtual properties in your "POCO" option, because nothing says "you're in control" like forcing stupid design decisions.
  • There's no need to support a "code-first" model where the DB is generated from the DAL entities, and needs to handle change scripts.
  • Constraints? Indexes? Why would our query language need to consider those?

When are we going to get a proper ORM? I've given up on the MS ADO.NET Team ever delivering, they're too busy writing tools for data development the way we did back in 2005.

Where are the products that provide the requirements MS don't even understand?

posted @ 3/11/2012 8:33 AM by Sean Hederman

Performance Myth: Managed languages are slow

"[When comparing language performance] what's really evaluated is the skill of the compiler writers, not the languages themselves"
- Fahad Gilani

There are a lot of reasons for this myth, but basically they all boil down to two main misconceptions. The first is that managed languages are somehow interpreted, which they're not. The second misconception is a little more accurate, and has to do with the impact of the garbage collector.

Managed code (.NET and Java at any rate) runs on the CPU as native code, just like your handcrafted C++. Now, generally there is a step before that happens, called the JIT (Just In Time Compile) where the .NET MSIL or Java bytecode is compiled to machine code. This compile has to be blazingly fast, in order to ensure that the user isn't faced with stalled user interfaces as the compiler runs around doing its job. Because of this, the compiler cannot perform some of the more advanced optimisations that C and C++ compilers do.

However this is offset by the fact that the JIT compiler is running on the actual machine being used. Normally, a compile runs on a similar machine to the target computer, not the exact one. This means that theoretically, the JIT compiler can perform optimizations that most other compilers can't, at least not without massively limiting the potential install base. In addition, it's theoretically possible for the JIT compiler to recompile sections based on observed usage patterns in the application, providing even more performance. Of course, these two possibilities are currently just theoretical, but it's important to see ways in which a JIT-compiled system could approach or exceed static compiled code in performance.

The .NET runtime does perform a great many checks in order to ensure your application runs predictably. Array bounds checking and overflow trapping are just two of the safety features provided. These features consume processor cycles. However, I'm not convinced that this is poor
performance. If C++ code were written completely robustly, it would have many of these checks too. In any case (in C# at least) you can run in unsafe mode, with many of these checks avoided. Beware.

The garbage collector allows us to pretend that memory is easy to manage, and to willy nilly create tons of small objects in the (almost) certain knowledge that they won't waste memory and they won't cause heap fragmentation. This is probably the most difficult concept for C and C++ programmers to get used to. They mutter about deterministic finalization, and the various pointer mechanisms they use, and miss the most important and interesting facts from a performance perspective: no fragmentation and fast allocations.

In traditional memory management, the allocation routines maintain a linked list of available chunks of memory. When you need some memory, you walk the stack looking for the smallest piece that will satisfy your requirements. This is not a cheap optimization, which is why many low latency systems allocate chunks of memory up front for more controlled and performance critical allocations.

In .NET (and I assume Java too), the operation to allocate memory boils down to a very simple operation: move the heap pointer N bytes further along. It's marginally more complicated than that, but not by much. This is one of the reasons managed programmers are so free with object allocations, they're cheap by comparison.

The second interesting aspect is the defragmentation the garbage collector performs. In a C++ application if you allocate a bunch of objects, some of which are deallocated quickly and the rest surviving, you will find that the surviving objects are scattered around the heap. In .NET this is not the case; after a garbage collection the longer lived objects will all tend to be close together (closer than they were, at any rate).

This leads to some interesting side effects. Theoretically it means that .NET apps should have better cache coherency than C++ apps do, at least by default.

Of course, they also have the impact of the garbage collector on their execution times. Mostly, they're no longer "stop the world" collections, but their performance impact in undeniable. Of course, you need to deallocate memory in non-managed languages too, and often the deallocations occur even when the machine has plenty of memory available. One of the strengths of the garbage collector is that it deallocates when it makes sense from a memory pressure and CPU perspective. Unfortunately, that's also it's weakness, as you never really know when that will be; it could be just as a critically important action needs to be taken. In most cases, the impact is small and infrequent enough to not matter, but for some use cases, it really, really does.

So, in conclusion, we've seen that managed languages run "raw" on the CPU, that some can reduce their runtime checks, that JIT could theoretically provide similar performance benefits to static compilation by being more machine-targeted, that GC memory is fast to allocate, and should tend towards improved cache coherency, although it has small, but unpredictable impacts on execution.

Does this mean it's possible to write truly high performance managed code? Yes, it does. Fast enough to compete with C and C++? Well, fast enough to make them break out the advanced compiler options, yes, certainly. Over the next several months I'll be exploring some of the things we can do to improve the performance of .NET applications.

posted @ 2/13/2012 10:51 PM by Sean Hederman

5 Reasons to Unit Test now

  1. If you're not testing it now, you won't have time to add tests later
  2. If you do have time to add tests later, you won't remember all the context
  3. If you don't remember the context, you'll just test code paths instead of actual use cases
  4. If you think about testing as you write code, the code tends to be more decoupled and you test more "non-obvious" scenarios
  5. If you unit test everything, and there's code still not being hit, then maybe you can get rid of it

On a Top Secret Project we’re working on, we’re aiming for 100% code coverage. Does that mean that all code gets tested? No.

What it means is that when we find code we can’t test, and we’re happy with that, we mark it with the ExcludeFromCodeCoverageAttribute. So it’s easy for us to see how we’re doing on our goal (currently at about 80%).

BUT, we will do back and review those ExcludeFromCodeCoverageAttribute decisions from time to time. There’ve been a few cases where we’ve decided to revoke that decision.

So you have:

  • Code that is covered by unit tests (known as “good code”)
  • Code that is excluded from code coverage (known as “exempt code”)
  • Code that is not covered (known as “bad code”)

I guarantee that 90% of your bugs come from bad code, and most of the remainder come from exempt code. Exempt code has at least been evaluated, bad code has not.

posted @ 2/13/2012 3:09 PM by Sean Hederman

Reflector.NET Add-Ins Gallery Active

The guys at red gate have set up an Add-Ins gallery where you can browse the active Reflector Add-Ins and tools.

Check it out, and add tools and add-ins you’d like to see highlighted.

My Reflector.Diff 2 add-in is highlighted there as well.

posted @ 10/25/2011 2:34 PM by Sean Hederman

IoC, extension methods and logging

One thing I always need is nicely instrumented and logged code. However, I don’t want to be setting up performance counters and log files in my unit tests. So, how do I make logging statements which are nice and injectable? Well, clearly to start, we need an interface that can be injected into our class that needs logging:

   1:  public interface ILogging {
   2:      void Write(TraceLevel severity, string message);
   3:      bool IsEnabled(TraceLevel severity);
   4:  }
 

Simple enough. Now, the point of all this is that if the Logging isn’t injected, then we shouldn’t log. Okay, so our code (assuming the ILogging property is called Log) looks something like this:

   1:  if (Log != null && Log.IsEnabled(TraceLevel.Error))
   2:      Log.Write(TraceLevel.Error, string.Format("Security '{0}' has been suspended", ticker));
 

Meh. Ugly. Imaging having to do THAT everywhere. But, we can make an extension for it. In fact one of the nice things about extension methods is that you can call them on null objects. Because they’re not really instance methods, they just look that way. So given the extension method

   1:  public static void Error(this ILogging log, string format, params object[] args) {
   2:      if ((log != null) && (log.IsEnabled(TraceLevel.Error)))
   3:          log.Write(TraceLevel.Error, string.Format(format, args));
   4:  }
 

We can now write the MUCH simpler:

   1:  Log.Error("Security '{0}' has been suspended", ticker);
 

And if Log is null, nothing will happen, courtesy of the log != null check. Needless to say the implementation of ILogging could be using log4net or System.Diagnostics or your own scheme. It doesn’t actually matter. If you need context passed in to the logging, then one way would be to pass it in when the ILogging instance is constructed by your IoC container. I’m sure you could figure out other ways to slip context in to it.

Now, one of the places I use this mechanism is on a single stock future pricing system with massive volumes and required latencies in the millisecond range. So we really don’t want to incur any costs we don’t absolutely have to. Some logging has somewhat expensive operations to determine the parameter lists, consider for example:

   1:  Log.Verbose("Stock '{0}' listed on {1:d} and has a market cap of {2:c}", 
   2:      ticker, GetListingDate(ticker), GetMarketCap(ticker));
 

This is a somewhat contrived example, but you can see where I’m going. I don’t want the costs of pulling the listing date and market cap every time I pass the Verbose logging call, which most of the time probably isn’t enabled. We could maybe cache the listing date up front; but market cap can change second by second so it can’t be cached. So what are we to do? Well, it’s easy enough, add an extension which lazy loads the format arguments:

   1:  public static void Verbose(this ILogging log, string format, Func<object[]> args) {
   2:      if ((log != null) && (log.IsEnabled(TraceLevel.Verbose)))
   3:          log.Write(TraceLevel.Verbose, string.Format(format, args()));
   4:  }
 

So, now our contrived example becomes:

   1:  Log.Verbose("Stock '{0}' listed on {1:d} and has a market cap of {2:c}", 
   2:      () => new object[] { ticker, GetListingDate(ticker), GetMarketCap(ticker) });
 

Slightly more complicated, but it now will only execute the expensive operations when absolutely necessary.

So there we have it, a nice injectable logging wrapper that can handle being completely disabled, can be unit tested against and mocked out, and can handle all logging scenarios I’ve needed since I came up with it about 6 months ago.

posted @ 10/14/2011 6:06 AM by Sean Hederman

Apologies for being so quiet

Okay, the last year or so has been insane. On the personal side I got married last month and anyone will tell you that doing that can suck up 6 months plus of your life. Couple that to some exciting work that’s been sucking up most of my waking hours and you see little time for blogging or otherwise.

Anyway, I’ve resolved to sort that out ASAP. So, here’s my plan: I’m going to carry on blogging and stick to strictly technology based posts from now on. My screeds about strategy vis a vis Microsoft, Apple etc seem to be popular based on page views, but to be honest they’re not why I started blogging.

posted @ 10/11/2011 9:16 PM by Sean Hederman

Windows 8 preview: Yawn

So Microsoft are releasing little tidbits of info about Windows 8. As I anticipated there’s nothing ridiculous like dropping support for .NET. Instead, they’re raving about their funky new tile-based UI, and how great it’ll be to write apps for it in HTML5 and Javascript. So, .NET isn’t dead; like how VB.NET is an equal partner to C#. Great in theory – absolute drek in reality.

So, the new tile-based UI is called Metro, and apparently we’re all supposed to switch to it. Except…why? I mean, it’s pretty; and it seems to work well by all accounts, but I can’t see why I’d rewrite my current “legacy” (or “Desktop”) Windows applications to use it, unless I wanted them to use a touch interface.

To be honest, as I mentioned before, I see no reason to write HTML5 and Javascript applications for Windows. If I’m using those; I’m going to do it as a nice cross-platform web application. I mean; why would I target Windows only? In what mad world would I take the pain of Javascript for the dubious advantage of targeting only Windows?

I think we all need to stand up and give the Windows Team a round of applause. They’ve managed to accomplish what Linux, Google, Apple and the rest haven’t managed to do yet. They’ve managed to make Windows irrelevant and simultaneously annoy and frustrate a large percentage of the developers who build the apps that make Windows so popular.

This is a strategic blunder the likes of which will be studied in business schools for decades.

posted @ 9/13/2011 8:57 PM by Sean Hederman

Metadata Changes &amp; Versioning

Daniel Antion has an interesting and well thought out article called “Can Records Change” at the Association for Information and Image Management. His question details what we do about changes in data about a document, or metadata. I’m thrilled about him bringing up this topic; because it’s one I’m passionate about. Let’s think about some reasons this information changing and maybe we can shed some light on his question:

  • The underlying document changed. This is probably one of the most common reasons for metadata changing; people make changes to documents all the time. The contents may have been modified; the subject could have been modified; authors added; review information changed and so on.
  • Linked information changed. This is less common; and many document management systems don’t handle it correctly or at all. Consider a situation where we link to a Person record on our line of business system. We may store some of the fields from that record in the document management system; such as Surname or City; things that may make it easier to find the document down the line. So; we capture an Application form for a “Ms. Jones”, but 6 months later we find out that she’s got married and hew new name is “Mrs. Smith”. Do we leave the original record data as it is? Curse ourselves for storing Line of Business data in our DM system? Change the data; accepting that a search for “Ms. Jones” now won’t find a document that plainly says “Ms. Jones” on it?
  • Information captured incorrectly. Depressingly common; we obviously want the correct information. However, our auditors and lawyers will possibly also want the original metadata; especially if processing or business decisions were made using that information.
  • Extra information added. Our processing workflow might well add metadata to the document; storing information about the processing steps undertaken; approvals gained; signatures affixed and so on. This doesn’t change the original document or metadata but must be accessible as well.
  • Our metadata schema changes. This is also depressingly common, where we change what fields can/must be captured against a document type. Much as we all like to think we can plan perfectly, and much as our clients love to believe they understand their requirements full; the truth is different. Think about a scenario where we’ve been in operation for 3 months when the client comes in and tells us that they need a “Category” field added to the document type. Great; we can add it, but what about the existing documents that don’t have it? Does this mean that we have to add it as an optional field? In too many systems the answer is yes. Now, a couple months later they change their mind. “Get rid of it”, the client commands. What happens to the documents captured with the data? If we restored the field sometime in the future would their data have been lost? Again, too many systems have “yes” as to the answer to that question.

Okay, so now we’ve had a look at some of the reasons that the document can change, we can see some requirements coming out. Our hypothetical metadata system must keep a version history; and must keep it in such a way that previous versions data is still accessible in searches. Needless to say audit information about who, what, when, why must be stored against each metadata change. The system must be flexible to schema changes, allowing fields to be added later - even if mandatory, as well as allowing them to be removed and even restored.

Additionally when we keep a version history, we must also consider whether we want a bitemporal system; a system which not only stored what did happen; but also what should have happened, e.g. we only updated “Ms. Jones” to “Mrs. Smith” yesterday; but she sent us the documentation 2 months ago and we should have done it then. A bitemporal system caters for such a situation; allowing you to see both the “Operational Truth” of how events actually occurred and the “Business Truth” of how events were supposed to happen.

As you can see, what seems like a simple topic of changing information becomes complicated very quickly. It’s important that your document management system handle these complexities in an intuitive manner. Almost every system I’ve ever seen falls over when it comes to metadata. The most usual reason is that most systems are designed around their underlying database; and that database doesn’t handle one or more of the scenarios I’ve outlined above. For example, a relational database like SQL Server can’t cater for schema changes correctly without a great deal of work that frankly isn’t worth the effort. Other systems use a more hierarchical store which handles the schema changes nicely, but struggles with efficient bitemporal access and most importantly tend to have rotten performance.

Do you know of other systems that can efficiently handle all of the above reasons for metadata changing? What about scenarios I’ve left out?

Want to change your metadata reliably, accurately and quickly? Signate 2010 handles all of the above scenarios well due to it’s unique and innovative design.

posted @ 7/19/2011 8:34 PM by Sean Hederman

Apple and the Commoditization of Developers

You have to hand it to Apple; they certainly know marketing. They can sell millions of units of a product with dubious value, not only that, but they can sell the same useless product to the same person more than once. There are certain people at one of my clients who all rushed out to buy the iPad when it came out; they used it with great gusto and excitement and showed off all it’s nifty features. You know, all the things it could do that my iPhone could also do, while being less portable and unable to make calls. Anyway, I noticed that these wonderful iPads started being left around more and more. These guys are die-hard Apple fanatics so they never discarded them entirely, they just … used them less. Eventually, the iPad was pretty much being used for it’s one true killer app: reading web pages whilst sitting on the toilet.

Nonetheless, when Apple announced the iPad 2, most of these Apple-istas announced their intention to “upgrade”. It’s not really an upgrade though; they’re discarding the old one and buying a completely new one. Apple have reduced the prices of the iPad 1, so if they were to try and sell their old one; Apple have ensured that they would be hugely out of pocket. Plus, the new one has not got very many capabilities better than the iPad 1. Nonetheless, a True Apple Fanatic™ must have the latest Apple kit. So, here we see a lemming rush of people determined to throw money at an all but useless product that they already have. I’m not saying the iPad doesn’t have some value, I’m just saying that in most cases people are buying it without any need or place for it, and thus it is useless to them.

Anyway, that’s not really the point of this article.

The other great marketing trick that Apple have done is to begin the complete destruction of the livelihood of all software developers on Earth, despite a pressing shortage. This is a quite amazing trick when you think about it. Apple’s hostility to the average developer is well known, and with the Apple App Store™ their revenge is nearing it’s completion. They have decided to equate developing applications with singing songs. They encourage a price of $0.99 for a song, and for an application. Except, you see, the song is normally produced for the mass market by an artist who also has recording contracts and live shows and so on; whilst the application is normally specific to a niche and is the sole livelihood of one or more developers. Oh I agree that some applications have a broad mass-market appeal, and those successful apps are the ones that Apple holds out to the rest of us as a carrot in order to keep us churning out applications for below market value. Most of the successful apps are games and other things with limited real utility; so the other people who are suffering are the users. Fart apps are plentiful, accounting apps non-existent.

What we are seeing is a massive transfer of wealth; away from developers and into Apple. They don’t care that your app doesn’t make enough money for you to survive; it helps them push more units of product and that is their only concern. They also take a 30% cut of your sales, but leave you holding the bag on returns; even if it’s their fault. That massive bump in Apple’s share price you saw recently (there’s always a bump in their stock recently), that came straight from your sweat and blood. What did you get for it? Nothing, zip, nada, maybe some aggravation about how your app didn’t meet their high standards for $0.99 applications. A dismissive, sneering, cold shoulder from a dismissive, sneering company.

The issue is that due to Apple’s perceived dominance there is now a halo effect; the whole industry seems to be climbing on this bandwagon of cheap apps given away in perpetuity by software developers whose talents are in dire shortage, in the desperate hope that there is indeed a pot of gold under the App Store rainbow. The pot of gold is too frequently an illusion; you’d be better off buying lotto tickets than in developing for Apple’s consumer products. Plus, it’s only a matter of time before Apple make their iAds platform mandatory; and will probably make you refund them for mistakes they make with it.

So where to for software development as a profession? Well, I believe it’s increasingly going to result in a massive income disparity; successful App Store developers and top-end corporate developers will be the small elite commanding higher and higher salaries due to their rarity; and the broad mass of other developers will find their income eroded from the top by the App Store successes and from the bottom by even more desperate developers giving their wares away for free. In other words, the industry will become more like music – with a few rich swimming at the top, and masses of wanna-be’s struggling at the bottom. We’re already seeing this trend, despite insourcing growing in popularity (due to the clear disadvantages of outsourcing) and growing demand salaries are stagnant or declining, defying economic truths. Apple have realized that they can turn their marketing machine to reducing their software development costs in addition to increasing sales. The Reality Distortion Field can extend to an entire industry leaping off a cliff like lemmings.

So say no. Say no to companies like Apple which hate developers. Say no to working your butt off for the chance of an adequate return. Demand respect and demand that your talents; bought at high price by years of work and study, are respected and paid for. Stop allowing companies like Apple to steal from you and expecting you to be grateful for it. Price your software at a reasonable rate, not one encouraged by a company which does not have your interests at heart. Take the cost of development and divide it by the number of likely sales, not potential sales. If you don’t know the number of likely sales, do some market research. Admittedly, by closing off the App Store, Apple make this very difficult; your app must be full featured before you can find out how popular it might be. So, develop it as a web app first; see if it’s popular; see if people are clamouring for an iPhone version. Only then should you consider writing an iPhone version. Ideally, rather just make your Web App iPhone compatible. That way you side-step Apple’s control completely, and retain ownership of your work rather than gifting it away.

Just Say No to closed platforms; and you don’t get any more closed than the Orwellian App Store.

1984, now brought to you by Apple.

* – Disclaimer, I am not an Apple developer and have no desire to be one

posted @ 7/12/2011 1:04 PM by Sean Hederman

The Truth About Folders: a rebuttal

In AIIM, Laurence Hart makes a number of comments about the Search vs Folders debate.

Claim 1: People are used to folders, Rebuttal: They are used to search as well

Not a good enough reason to stick with folders. By that logic we shouldn’t use search engines to access the Web, we should organise it into folders instead, just like Yahoo used to do. You see, people are also used to search engines, they use them every day of the week. In fact, more and more people are using search based idioms rather than folder based idioms to access their systems. Look at the search box built into the Windows 7 menu, and every Windows Explorer window.

Claim 2: Search Engines fail, Rebuttal: So does everything else

He claims that we should have folders as a fallback position in case the search engine doesn’t work. Well, if your technology is based around an unreliable bolt-on search engine (looks meaningfully at SharePoint), then yes, this is a valid concern. If your entire system is designed around search, then the search engine is the core and any folder-based view would be the bolt-on, and thus would be the one more likely to fail. All systems can fail from time to time, but that is not a good reason to not use an entire class of technology.

“Cars sometimes break down, so we should all use horses.”

This is a classic example of the logical fallacy of the excluded middle.

Claim 3: Folders help you organise, Rebuttal: Why manually organise?

I’m actually not sure what point he is trying to make here exactly. He goes into taxonomies, and how folders help users create a “well-executed taxonomy”, and how creating a taxonomy without folders sacrifices performance and simplicity. Not one person I’ve ever spoken to about their requirements from a document management system has ever mentioned the word taxonomy. Not one. Ever.

I will not deny that a proper taxonomy is easier to do with folders than without. I will even admit that should a system have a bolt-on taxonomy system this will likely be less performant and simple than a system designed around taxonomies. I deny the need for taxonomies at all. Search and metadata is all that is required to search billions of documents, and requires zero extra effort.

He then admits that these taxonomies change, and systems must be put in place to manage these transitions. I’ve never once had to redesign search, it’s search for goodness sake; and if your data changes, just reindex it! Want to add a metadata field? Reindex. No manual effort, let the computer do it for you.

4. Not using folders cripples systems, Rebuttal: Only if the developers were idiots

This claim boggles my mind. Let me quote: “One of the problems that you get when you don’t use folders is that you can cripple most systems. While few systems claim a limit to the number of documents that can reside in one location, there is a practical limit”. I’m pretty sure that what he’s talking about here is the well-known reality that operating systems struggle when directories file up with more than a few thousand files.

He seems to be conflating the experience of the system from the outside (i.e. the users sees no folders), with the implementation details of the inside (i.e. does that mean the system stores every document in one huge directory). This is utter rubbish. Signate as an example creates an internal directory structure which documents are routed to in a balanced fashion, ensuring that no directory winds up with too many documents. This structure is internal to the system and is a performance and management implementation detail. It is not exposed outside the system at all.

In fact this argument of his is an excellent example of why folder based systems don’t work as well as search based ones. While Signate automatically balances files across a directory structure designed to allow billions of documents per node, no such balancing can be applied when humans are involved. Every folder-based system I’ve ever seen winds up with a “dump” location, sometimes more than one, where documents which don’t fit the taxonomy neatly are placed. This can swiftly grow to thousands of documents, resulting in the very problem that Laurence claims search based systems suffer from. Sure, if the taxonomy was perfect, this would not arise; and this is also a sign that the taxonomy may need to change, resulting in a great deal of manual work. In a search-based system with balancing, this situation never arises. This is not just a punt for Signate; I’ve never seen a search-based document storage without balancing, and I struggle to comprehend that anyone would ever conceive of designing such a system.

He then claims that “You can swear that nobody will ever browse to [the internal storage] location, but unless you remove that capability, someone will do it”. Well, of course we remove it! I consider it a massive security breach if people are able to access the internal document location of the system without passing through the interface to the system. Does he allow users to access his internal company databases directly? Of course not.

5. Search Engines can’t read your mind reliably, Rebuttal: nothing can

Neither can folders. Search engines help you find what you’re looking for; folders let you know where you’re looking. Which would you rather have? We don’t need perfect reliability; you can refine search terms based on the results we see. Too many results? Add search terms. Too few? Remove some. Make some more approximate, tighten up others. Signate allows an enormous range of searching options, including approximate search where words similar to the specified word are found.

Conclusion

Clearly, I’m biased. I’m so convinced of the value of search-based document management that we created one ourselves. Laurence is a specialist in Documentum, a prominent folder-based document management system. So, we’re both biased. But read his article, read mine, and then ask yourself which approach:

  • Will get the benefits of document management into my users hands faster?
  • Will result in the lowest ongoing administration whilst delivering excellent results?
  • Will adapt to my changing business needs?

Folder-based systems are great for rigorously defining the information content your organisation needs; and if you’re working in a top-down company that has an IT department that can easily define a data dictionary for your entire business and enforce it’s consistent usage; then I’d strongly suggest you look at systems that support such an approach. If, however, you work in the remaining 99% of businesses where change is constant, time is precious, and flexibility and turnaround are more important than rigor; then look at systems that support that approach.

posted @ 7/6/2011 9:23 AM by Sean Hederman

Windows 8, wpf, silverlight and HTML5

Okay, so there’s a great deal of wailing and gnashing of teeth about Microsoft’s announcement of Windows 8, and especially the comment:

Windows 8 apps use the power of HTML5, tapping into the native capabilities of Windows using standard JavaScript and HTML to deliver new kinds of experiences

As usual, Tim Anderson has an insightful post where he tries to calm people down and introduce some sanity to the discussion. Scott Barnes (the former head of Silverlight) seems to believe that WPF is dead, and Silverlight is on it’s way out.

HTML5 + JS Windows Apps? Good Idea

HTML5 + Javascript applications running directly in the OS? Kudos to them, it’s a great idea. It makes Windows programming available to a massive amount of devs who currently only do web work. Of course, they’re not too likely to switch, but hey, some might. It also allows us to treat the OS as a commodity with one app working on all operating systems. Oooh, Good Idea for devs, BAD IDEA for Microsoft.

OS-Specific Hooks for JS + HTML5? Meh Idea

Your code can detect the OS and take advantage of these OS-specific features if they’re available, much like the flicking functionality in iOS Safari. Happy days. Will this make devs write HTML5 apps only for Windows 8? No. Not at all, and if MS think this they must be retarded.

Foot off the pedal on WPF? Meh Idea

Personally, I could care less. I’ve hated WPF since it came out, but then I’m not a graphics designer (although they apparently hate it too). I grok it’s power and capabilities, but I just don’t care. I either want a simple form based application which I can use HTML and/or WinForms, or a pretty high-graphics app in which case I use HTML. WPF sits nowhere in that spectrum for me, although I understand and appreciate that it does for some. They could gain some developer defections if they go through with this.

Foot off the pedal on Silverlight? BAD IDEA

This would just be idiotic in so many ways it’s not funny. It would make MS a laughing stock in the Web Dev community (okay, more of a laughing stock). It would wipe out their fallback plan for delivering rich applications if the OS get’s commoditized. I understand that the Windows OS team don’t expect that to happen, but to remove your Plan B for the thousands of other applications you provide is just thick. Plus this would probably result in a massive developer defection.

Foot off the pedal on .NET entirely? BAD IDEA

Nobody is saying this directly yet, but it seems a logical (if extreme) conclusion from where MS seem to be thinking about going. It’s hard to consider that anyone at Microsoft could be so stupid as to be considering this, but we never know with the Redmond creature. This would be a company-killing move. An enormous amount of Windows devs hate HTML+JS with a passion. They like writing Windows applications. They don’t want to write web apps, and they don’t want to use C++ because, umm, it sucks to develop in. Hello Java, and Mr. Larry Ellison is suddenly very happy he bought Java.

Conclusion

I don’t really have any, except to say that you should take what you read with a grain of salt. A lot of it seems to be internal politics playing out in public and marketing drones putting spin on things. It’s entirely possible that the Windows 8 team told marketing about the HTML5 + JS thing as an aside; like “hey we also are allowing you to do this”, and the marketdroids turned it into “this is the way you will do things”.

Personally, I’m kind of uninterested on everything except the direction of .NET in general. I currently use Silverlight a little, for things like showing a graph or video. I gave up on the whole Windows UX team a long time ago. I cut my teeth on DOS and then WINAPI, MFC, WTL, VB, WinForms (1 & 2), was underwhelmed when WPF came out, and mildly interested in Silverlight. In almost all that time, a simple web page with a few simple tags would do everything that the latest UI fad from Redmond would do. I’m just tired of new UI mechanisms and widgets and thingies.

So, am I going to write Windows 8 only applications? No.

Am I going to use WPF? No.

Silverlight? Less and less.

HTML5+JS on the desktop? No.

HTML+JS in the browser, served over the web? Yes.

Off the treadmill, and not particularly interested in the remaining hamsters attacking each other, as long as they don’t destroy the cage.

More and more devs are coming to the same opinion, and Windows is becoming more and more marginalized and commoditized, ironically by the very Windows team that thinks they’re doing the exact opposite.

Final Score

For Microsoft

For Devs

Good Idea

0

1

Meh Idea

2

2

BAD IDEA

3

2

posted @ 6/3/2011 1:15 PM by Sean Hederman

Why Search is the #1 Feature

Searching for itemsThe main means of accessing documents in document management systems is via folders. This makes sense because it’s what people are used to. Before they get a document management system they normally arrange their shared documents in a shared location, organized via folders. They’re intuitive, hierarchical and familiar; and thus people tend to look for document systems which are focused around folders as well. This makes the migration to the new system easier as well.

THIS IS A MISTAKE!

“Why are you switching to a document management system at all?”

If your file share is perfect for you why spend a lot of money on a system which is just going to replicate it? You’re trading a cheap, convenient, and reliable system for one that is much dearer, requires retraining and is all too often less reliable. If you need to share your documents in a folder structure across the Internet, use Dropbox, it’s a fantastic service and it’s very reasonable. If your company just needs a glorified (and expensive) folder share, I don’t want to con you out of your money and add zero value. With Signate, we want to add real value, and we don’t believe that is done by a web-based share.

Case Study in failure

Set of folders containing documentsAt a large financial company where I consult on some electronic accounting issues, they had a massive shared drive where all documents were kept. They spent millions of rands implementing a company-wide Content Management System (CMS); money thrown at the software, hardware and numerous consultants involved. Most CMS systems make it easy to arrange your documents in folders, and so they reorganized the layout, designed it better and set it all up. All new projects were to use the new system it was decreed. They did, but only as a file storage medium. The more advanced features like wikis, monitoring, calendar, workflows and so on went almost unused. It was also decided that the existing file share was too large to migrate, so it coexists side by side with the CMS, and there is often confusion about where to find documents, and which version is the “current” one: the file share version or the CMS one. Net result: significant capital and operational spend, massively increased storage requirements (due to duplication), confusion, and little or no improvement. They have also been unable to get the CMS search working, which is a critical failure in my opinion (as we will see later).

Is this the fault of the CMS? Not at all. This particular CMS is a very powerful tool in the right hands. It’s configurability allows it to really shine when well-implemented. Unfortunately it is all too rarely implemented well, and this usually requires hordes of very expensive consultants.

Failure of Design

Old Yahoo pageOld Google pageThe underlying problem is that the CMS, along with all too many document management systems cater to people’s first instincts: the desire to keep things the same as they were. Let’s cast our minds back to 1998. The Web was a growing phenomenon, and the most popular portal was Yahoo!. Their web site was built around a directory, a folder structure exactly like that in your shared drive, except it consisted of links to web pages. People submitted their pages to Yahoo! and it would be placed in a category.

They had search, of a sort, but the focus was clearly on the directory structure; that was how you ensured that you found what you were looking for. You would browse through folders, hunting for the right category. Sometimes the categories were arranged somewhat haphazardly, so it could take a while to find the right one. However the task of maintaining this directory grew larger and larger, and the directory fell further and further behind.

I remember that my primary source of new pages started to be from friends’ emails rather than finding them in directories. All the while Google was making search their primary focus. We all know how that story played out; search became the dominant means of finding pages in the web. Why does search trump directories? For a few simple reasons:Multiple computer accessing one folder

  1. A directory imposes the directory organisers priorities on the consumer – If the organiser arranges things in a way that the consumer finds counter-intuitive it can be difficult or impossible for the consumer to find content that is present.
  2. A directory requires constant work to ensure relevance – Entries (or documents) can become stale or corrupted, newer locations may become popular causing duplication with work occurring in both locations.
  3. Search puts the consumers priorities first – You type what you’re looking for and the search engine finds it, what could be simpler than that? There is no organiser other than the content, so you don’t have to put up with odd filing hierarchies.
  4. Search ensures relevant content is found immediately – No hunting through folders and opening documents; the best matching results are returned first.
  5. Search allows for powerful search terms – You can use advanced features such as ranges for dates and numbers, exact matching, wildcards and so on very quickly and easily.
  6. Directories are categorised by perception, search by reality – When we decide to place a document under the “Technical Specifications” folder we’re doing so based upon our idea of what that document contains. Normally this would be done by the content author; so they’re generally pretty accurate, but there might be a better location or the categoriser may be mistaken in their assessment. Search categorises documents based on their content.
  7. Directories are static – Related to the above, documents change, and your system must cater for that. A directory structure tends not to change, even when it should. People are used to accessing a particular document in a particular place, and if you move the document they won’t find it at all. You’ll go from 100% accuracy to 0% in one swift go. Whereas with a search system, the document will move up and down in the search results for a particular set of search terms as it’s content changes.
  8. Directories take effort – You need policies and procedures and people who monitor them and control them. All of this is not productive work.

The Road Ahead

Some taking off in comparison to othersThe future of document management lies in search. In my many years in the Document Management field, across industries as diverse as logistics, healthcare, insurance, financial, travel and many others I have seen finding documents again and again become the pain point for project after project. This is why we created Signate: as a response to the appalling inefficiencies of products spanning from the cheapest of the cheap to high-end enterprise servers. Signate puts search front and center, and whilst we are ahead of the game right now, I am under no illusions as to how long that advantage will last. Search is such a compelling feature that all document management systems will have to become search-centred or they’ll fail.

The question you have to ask yourself is where you need your company to be? Do you want an easy transition to document management but very little added value, or are you willing to learn a new way of finding your documents? It isn’t even that new if you’re used to searching the Web.

Search Quality

A complicated and busy search formSo now that I’ve made my case for search over folders in document management systems, let’s look at the quality of that search. Have a look at the screenshot to the left. It’s from another document management systems search screen and exemplifies pretty much everything I dislike about search in the document management space.Just one search box

Each field you can search on is listed, each with it’s own box. Worse yet there are drop downs for “Contains”, “Exact Match” and so on. Whilst I hate the dropdowns for the date fields, at least they actually have a date range search as opposed to forcing you to pick one date at a time. But now, what if I knew that the document was created after a certain date, and the revision I was looking for was before another one. How would I enter that? Would I be able to leave the To date range empty for “creation date”, and the from date range empty for “revision date”. Possibly, it’s not clear. How would I search for where the author is either “Ray Bradbury” OR “Orson Scott Card”? I know it’s one, but am not sure which one. I’d probably have to do two searches.

Now consider the search screen on the right. This strangely enough is much clearer as to what you need to do, which is counter-intuitive if you think about it. You’d expect that the screen which spells everything out explicitly would be the easiest and most compelling, but it’s just not. An empty search box invites, a complex search screen repels. How would you search for both authors as above. Well, I’d type: “Ray Bradbury” OR “Orson Scott Card”

A criminal hacking a computerThe normal reason document management companies use search forms like the left hand one is because their search form is a thin wrapper over their underlying database. This limits them, as databases are designed to be very specific, and cannot search across fields easily. Not only that, if they don’t get the search form exactly right, it is possible for a user to run updates and malicious scripts on the database. The database that’s storing all your document data. With Signate we use a completely separate search engine, which not only is designed to search and search well, but also cannot affect your underlying data. Oh yeah, and it’s fast. Blazingly fast. Much faster than a complex query run against a database. Plus it can easily scale up to billions of documents, which database-driven searches struggle with.

Conclusion

If you need a document management system, please, please, please choose one that puts search in the forefront. Ensure that before you buy you really kick the tires on the search system; that it’s quick and easy to use. It should not take your staff longer to find an internal document than to find a web page via Google. If it does, then you have a suboptimal system. A swift and powerfulA plant growing from money document management system should pay you dividends across the board. You should have happier and more productive staff, faster processes, easy findability for your documents, and vastly improved turnaround times.

Use our Document Costs Calculator to work out the amount you’re probably wasting right now on your document costs, and thus the amount you can save every year. That financial company I discussed earlier? Our calculator shows that well designed and run document management system should be saving them between 12 and 70 million rands a year, and save between 180,000 and 490,000 staff hours annually. These figures are based on research, calculations and figures from Gartner, Cap Ventures and the Arbeidsgemeinschaft für wirtschaftliche Verwaltung.

Plug in your company’s figures and see the impact a good document management system could be having on your bottom line and client satisfaction. Plus it’s good the environment too.

Palantir (Pty) Ltd

Sean Hederman is Director of Palantir (Pty) Ltd, and Software Architect for the Signate Document Management System. He also writes the popular programming blog Codingsanity.

Signate 2010 Document Management System
 
 
 

Follow signatedm on Twitter

posted @ 2/17/2011 1:25 PM by Sean Hederman

Reflector No Longer Free

Red Gate have announced that they will no longer be offering Reflector for free, and that from May, you will need to pay $35 for it. This has caused a predictable community backlash, especially on reddit. Here are some my thoughts on the situation.

Firstly, I must say I’m disappointed. I personally was hoping that Red Gate would make enough money off the (very, very useful) Premium version, and “halo effect” to keep it free. That said, I’m not a dyed in the wool Open Source zealot, and I understand why they’ve been forced to make this switch. They’re being very careful to keep the price as low as possible, and I think it’s an incredibly reasonable price for what the tool offers.

Let’s get a few myths out of the way:

Red Gate have added in a timebomb to kill the old version of Reflector.

No, the timebomb has always been there, it was put in place by Lutz. Used to annoy me a lot. Anyone using this argument does not have more than a passing familiarity with the tool.

Reflector will be easy to replace with an OSS version

Cool, good luck with that. There are FOSS competitors. One problem: they suck. Okay, a bit strong, but they are nowhere near Reflector in usability, features, extensibility etc. I looked into doing a bit of what Reflector did myself in order to make my Diff tool standalone. Let me just say that after a few weeks at it I’d barely scratched the surface. Anyone using this argument does not know very much about programming.

Red Gate are screwing over the contributors

The idea here is that by making Reflector paid-for that I’m somehow getting screwed over. I can’t speak for the other contributors, but I wrote Reflector.Diff firstly for myself, and have kept working on it because it’s fun. I write Windows programs too, does that mean Microsoft are screwing me over by charging for Windows? No. The people using this argument do not appear to be contributors themselves.

Red Gate should give Reflector away to the FOSS community

Ummm, why? I assume they paid a fair whack of cash for it, whose going to refund that? They’ve added new features (some very, very cool), who’s going to refund them those? I see not one of the self righteous ”give it away” crowd contributed to the tool even when it was free, now they expect to be given stuff for free for their lack of participation?

 

 

In conclusion it appears that most of the people who use the tool seem quite happy with the idea of paying $35 for something that has given them great value of almost a decade and will likely continue to do so for ages to come.

The ones who don’t use the tool are the ones screaming the loudest. The ones who didn’t contribute and add to the product are the ones complaining about how contributors are getting shafted.

I have a loop in my head of Bill Hicks screaming “THINK OF THE CHILDREN!!!”.

posted @ 2/3/2011 7:43 AM by Sean Hederman

Bing caught cheating

Microsoft has been caught out by Google stealing their search results. They noticed recently that Bings search results were more and more similar to Google’s and they came up with a hypothesis for why. They felt that IE9 was “dialling back” to Redmond with Google's search results, allowing them to indirectly copy Google. Microsoft have admitted that they have been using “clickstream data we get from some of our customers” to improve their search results, and accused Google of using “spy-novelesque stunts”.

There’s quite a few issues here so let’s unpick them.

Microsoft have done nothing illegal here. They’ve collected data that their customers have said they can collect and used that to improve their search results. That said, their toolbar by default opts users in to allowing Microsoft to collect this data, so that leaves a bad taste in the mouth. Microsoft have invited themselves into your living room and are peering over your shoulder watching your searches, and using that to compete more effectively. Sure, you can ask them to leave, but it’s not a nice situation to discover.

One has to wonder about Microsoft’s claim to be an innovator now. It seems that this innovation is driven at least partially by just copying Google. They are different from scraping sites that republish other’s data only in the matter of degree, not in principle.

Whilst their actions are not illegal, they are by almost any definition unethical. A very good definition of ethical behaviour is that you would not be upset if your actions were on the front page of a newspaper. Microsoft’s now are, and they are being defensive and running for the hills. They’re trying to spin it that this data is just one of thousands of data streams they use to improve their results. Well that’s nice, that makes it okay then.

When a schoolchild sitting for a major test says that they only copied a few of the questions from the guy in front, does that make it acceptable? No, it does not. Firstly they were caught cheating, which is unacceptable behaviour and it leads inevitably to questions about where the other answers came from? Hidden notes? Also, what about other tests? Their entire history and character is now tainted with the epithet “cheat”.

This is where Microsoft find themselves. They have been caught cribbing off the other kids results and they’re trying to make out that it’s somehow honourable.

It absolutely is not. They would not be happy with similar behaviour from their kids.

Somebody needs to get fired for this, because if not, it would be an acceptance by Microsoft that “innovation” is just copying others; something we’ve seen them do way too often.

Companies that hold themselves to a lower ethical standard than cheating schoolchildren need to get smacked.

posted @ 2/2/2011 6:30 AM by Sean Hederman

Why doesn&rsquo;t MS use Android?

John C. Dvorak asks why MS doesn’t give up on the faltering Windows Phone 7 and instead use an Open Source OS for their phones to allow them to compete with Apple and Google. He suggests maybe using Android, and then goes into a tizz about how MS and other tech companies seem so scared of Open Source. Normally I quite like Dvorak, but he’s being a bit silly here:

  1. If they use Android, then they’re not competing very effectively with Google.
  2. MS aren’t completely scared of Open Source. They’ve released more code into the community than other tech company IIRC [citation needed]
  3. Switching to another supplier’s phone OS (even if said supplier is ‘the community’) would be an admission of defeat; an acceptance that MS can’t do a decent phone OS. This would be a marketing nightmare as their server and desktop OS customers would start saying “Well if they can’t do that, maybe they’re not so good at my stuff either”.

FOSS advocates love to believe in their turgid world that Open Source is a mantra that will cure all ills.

It’s not.

posted @ 2/1/2011 7:29 AM by Sean Hederman

New Feature: Send Email Activity

We've just added the "Send Email Activity" to Signate. This allows you to easily drop templated emails into your document process.

 

SendEmail

This powerful, and much asked for feature comes standard and free with all Signate editions. If you already have a Signate license, you will automatically get this feature on the next major upgrade as per your support agreement.

The ease of use of this activity is incredible. Drop it on to the Workflow Designer, set the properties, and edit the Body as indicated in the screenshot to the left.

Even dedicated workflow tools such as K2, have difficult procedures to accomplish this same task that is dealt with in seconds by even a novice in Signate.

posted @ 11/24/2010 7:19 AM by Sean Hederman

New Feature: Approval Workflow

This is just a quick announcement of our new Approval Workflow functionality which will be released soon. The workflow designer now adds the Approval Workflow task:

Approval Workflow

As you can see from the above there are two legs, Approve and Reject, and document capture only happens in the Approve leg. You can add more legs and/or rename the "standard" legs. The approval activity has an Allocated To property, like Capture Document which specifies which group the Approval will be routed to. When an approval step is opened by a user, the screen below is displayed:

Approval Screen

Each leg which was added in the design appears in the Approval selection box. Using the example above, clicking on Submit Decision would route this document to Capture Document for capturing of document metadata.

posted @ 10/24/2010 5:43 PM by Sean Hederman

Windows Phone 7 doesn&rsquo;t support Sockets

Because Smartphones don’t communicate with servers apparently.No, no, no, no. NO

Although it’s "a pretty high priority for inclusion in a follow-on release”.

It seems Microsoft are continuing their brilliant strategy of competing against where the market used to be 3 years ago. You know, the one where they lose billions of dollars in shareholder value as compared to their competitors.

WP7 also cannot connect to non-broadcasting access points, so I can kiss my company’s WIFI network goodbye.

From the reddit thread:

But you look at something like sockets connection in a mobile phone API, and it's like the number 1 thing I would list as a feature. Hell, I probably wouldn't even list it, taking it for granted that it would f*****g be there.

To me, this would be like writing the API for a television set gadget and not including any way to write to the screen.

How is this not in the box?

Apparently, this decision is not a popular one.

For what it's worth, just about everybody on the wp7 dev mailing list inside MS is up in arms about this

Now, MS lost me as a mobile customer some years back. You know, when they pretty much had the whole field to themselves and did absolutely nothing with it. Then Apple came along and in one year, beat the market share MS had painstakingly built up over 5. Then Google came along and did the same thing, except this time as a component supplier, just like MS was, and doing it even faster.

MS have been utterly humiliated in the mobile space again and again, and you’d think that this embarrassment would make them ensure that WP7 had the features it needed to compete. Clearly not. When I see MS tripping and stumbling about these days I’m reminded of nothing so much as IBM in the 80’s.

Please, please, please, fire Ballmer, don’t lose Ozzie and focus on just creating good products instead of the marketing-led drivel you’ve been pumping out for the last few years.

posted @ 10/20/2010 6:55 AM by Sean Hederman

Estimating Software Development

Creating an estimate for a software development project is hard, really hard. There are books and articles and speeches and academic papers, and you know none of them has got it completely right, because projects are still badly estimated. So what am I going to add to the mix? Nothing, really. Just a set of tools and techniques you can use to help you improve your estimations. But first off let’s dispel some myths.

Myth 1: Agile doesn’t need estimation

Well, if you work in a company where they’re happy that the date of delivery and feature set be vague, then good for you. In the rest of the world we often have to present a quote to a customer that specifies what they’re going to get and how long it’s going to take. This means you’re going to have to estimate. Happily most estimation techniques work fine with Agile, they just require that you get an idea of your scope up front.

Even without a requirement for a full estimate, you still need to estimate the scope of the user stories to see which will likely make it into the sprint and which won’t.

Myth 2: Developers make good estimators

Sorry, but we really don’t. Look at our track record. We tend to underestimate tasks we consider “fun”, and overestimate those we consider “boring”. It would be great if that added up to the right amount, but it turns out that we suck even at overestimating, and generally forget about huge swathes of requirements when asked to “thumbsuck”. Oh, about giving a thumbsuck estimate: don’t, ever. It is about as reliable as throwing a dart at a board, so you might as well just do that instead. If you’re being pressured for one, consider a parable:

There was a pilot, who often flew from Paris to London
He was asked how much fuel he’d need for trip between Madrid and New York
He didn’t ask in which direction, nor the make of plane, nor the number of passengers
He just pulled a number out the air
And the plane ran out of fuel mid-ocean and everyone died

If you think people’s lives are not affected by poor estimating for software projects, think again. I’ve helped rescue some projects where poor estimating had badly affected code quality on systems affecting people’s medical aids, finances, and yes, the possibility of making an airplane fall out of the sky.

It does not take long to do a proper estimate, a few hours at most, so do it properly.

Myth 3: Estimates aren’t important

A proper estimate serves a broad array of functions:

  • It lets you know when something is harder than expected, critically important knowledge in most projects
  • It allows you to synchronize timelines with other departments. For our Signate 2010 Document Management System we delivered the first version 4 months before the web site and marketing was ready. Ooops! Admittedly my software delivery estimate was spot on, but I hugely underestimated how long the marketing would take.
  • It allows you to determine what your profitability would be on a project, which can mean the difference between costly failure and profitable success. You can walk away from projects that wouldn’t make you money. Otherwise you just don’t know. In the early days of Palantir, before I joined, there was a project like that, and it very nearly sank the company.

Okay, so let’s move on to some of the techniques:

Technique 1: Always keep records

This comes from Joel Spolsky I believe: always record your initial estimate, as well as the final result. Keep those records and use them in future estimation techniques. They allow you to build up a pattern of how various people estimate. Make sure you break up your estimates based on who gave which one. It is not embarrassing being off in your estimates, what is embarrassing is being off consistently and not factoring that into your future estimates.

Project 1: Estimated 50% below actual (ouch)
Project 2: Estimated 45% below actual, 50% below quoted (yay!)

We learn from our mistakes. Not doing so is insanity. You can build up an estimation factor to apply to people’s estimates to get a more accurate feel. Just make sure you update your estimation factor constantly, as their skills will change. Make sure to keep it isolated on project type. My estimation factor on Windows Services may be 80%, but on ASP.NET WebForms it could be 120%.

Technique 2: Multi-estimate

Have multiple people independently give estimates for the same item. One advantage of this technique is that you get a broader base of data for Technique 1. Probably the most important however is that it provides a level of confidence.

Assuming estimation factors already applied.

Item 1: Sean estimates 6 hours, Graeme estimates 7 hours, Craig estimates 5 hours. Standard deviation is 1. So, we can say the estimate is 6±1 hours with high confidence.

Item 2: Sean estimates 3 hours, Graeme estimates 7 hours, Craig estimates 14 hours. Standard deviation is 5.56. So, we can say the estimate is 8±6 hours with low confidence. With such wildly differing estimates they could all be way off.

This is a hugely important technique, and the least often applied. It offers huge benefits in accuracy as well as feeding nicely into Technique 1.

Technique 3: Constantly compare progress against the estimate

This is often done as part of standard project planning, but the real-world data is not fed back into your estimation. As you go, you can start calculating a new estimation factor which is specific to this project. You can then apply that back into your original estimates to get an updated idea of how your estimation might differ. This would mean you now have the following:

  • Original Estimates
  • Final Estimate – Calculated from Originals with factors applied
  • Committed Timelines - Hopefully somewhere north of the Final Estimate
  • “Current” Estimate – Recalculated from new estimation factors generated from progress

If your current estimate is creeping upwards towards the committed timelines, you need to raise that as a problem, before it becomes a problem. You might also find that one persons estimates for the project seem to be more in accord with reality, and give their estimates more weight, but be careful to keep the uncertainty values in place.

Technique 4: Be detailed

I get very suspicious of “2 week” items in an estimate. Sounds like a thumbsuck to me. Joel says you should break everything down to 16 hours at most. I prefer shorter even than that, but could live with 16 hours.

So why break it down? Well, it turns out that we’re really bad at estimating all the pieces that go into a big task, but pretty good at estimating small tasks. So, by forcing us to break it down into smaller items we’re required to think more about the makeup of each task. How much of a difference can this make to your timelines?

Oh, about 50%.

Oh, and make sure you include everything in your estimate, including wasted time waiting for third party vendors, holidays, sick leave, maternity leave, scope creep, project start delays, document signing delays, testing, debugging, the likelihood of having to rewrite a module or two, the lot. You can use the data you accumulate to feed back into future projects, improving their accuracy.

Other Techniques

Take those variances you got from each developer in Technique 2, plug in their historical variances, and use Monte Carlo simulation to generate probability distributions. Now, you can confidently go along and say, “we have a 90% chance of hitting X months”.

If you find a great deal of variance on a task, it likely has not been scoped well. Consider investing a little more time in nailing down the requirements, and then re-estimate. Yes, still keep the original estimation data for historical reference.

If you find that some staff are more accurate, don’t use them in preference to everyone else. They could always have a bad day after all. Rather ask them to share the techniques that they use that make them so accurate.

Estimation must be seen as a high priority item, one of the highest. It’s more important than the project plan, more important than the specifications, more important than the actual development work. How can I say this? With a badly estimated project, you can do the development work perfectly and still make a loss.

Estimations are also not made just by padding them. Rather give the real numbers as accurate as you can to management. They can then use that as input on their decisions about whether to go for a project and how much to charge for it. If you pad your estimates too much, you could lose out on very lucrative business opportunities.

Do not get pressured into removing the error bars. Make sure you include the ± variance, or if you’re using Monte-Carlo (hopefully), the percentage probabilities. The nice thing about Monte Carlo is that you can give a 100% number if pressured for it, but it’s usually way more than the 90% figure. On some projects we go with 80%, on others 90%, on a few 95%. 100% is usually not worth using, unless millions ride on the delivery date.

Any others? Share them here.

posted @ 10/12/2010 8:02 AM by Sean Hederman

Obfuscating Code

For my Reflector.Diff v2, I decided to obfuscate it for two main reasons:

  1. It made use of our high-performance Palantir.Diff library, which is based on a popular algorithm, but with some very sneaky tweaks to get the optimum speed out of it. We’d like to protect that sneakiness.
  2. I wanted to use this as a dry run for our trial and download versions of Signate that we’re working on.

Now, the nice folks at Red Gate gave me access to their {smartassembly} obfuscation tool a while ago, and I’d never gotten around to trying it out. I really like their software, especially the ANTS Performance Profiler which has consistently kicked the snot out of the performance tools in Visual Studio.

Anyway, I started up {smartassembly} and created a new project within minutes and happily obfuscated it. I quickly learned that obfuscating everything to the max doesn’t work with something like Reflector.Diff, since you can’t obfuscate it so much that Reflector can’t load it. So, note to self: scrambling the metadata can cause problems with dynamic loading. Fair enough.

So, I backed off on some of the options and obfuscated again. Worked a charm. It linked the Palantir.Diff assembly directly into the Reflector.Diff library, so that even the public types of Palantir.Diff weren’t visible. The code was obfuscated more than I’ve ever seen in a commercial product, and with a few desultory attempts to crack it, I gave up. It then took me just a short while to integrate it directly into my build process (they have an MSBuild task for this).

If there’s a single feature that I like most it’s the linking. I like to have nice assemblies in our projects with public types nicely visible to all and sundry, but I don’t want that getting out in our final product, and the linking takes care of that nicely. Not even that, it allows us to hide exactly what third party tools we’re using in our product, making it even more difficult for someone to figure out what’s cutting.

The cost is not cheap, at $795 it would be difficult to justify in a lot of scenarios. But if you are going to obfuscate, and you want it to be able easy as pie and reliable and good, then you can’t go far wrong in giving Red Gate your money.

I must admit I’d like more control over what {smartassembly} does in the build process. Currently it just runs the project you manually create, but it’d be nice to be able to control the obfuscation settings within MSBuild/Team Builds. Other than that, no complaints so far.

posted @ 10/10/2010 10:30 AM by Sean Hederman

Latest Images