Tuesday, November 30, 2010

Will The Real Programmers Please Stand Up?

Arousing posting on the RethinkDB blog.

In the interest of openness, we’ll post the smoke test that makes us turn away 19 out of 20 candidates within half an hour of a phone conversation (and that’s after screening the resumes). We don’t ask people to code a solution to a complex algorithms problem. We don’t ask to solve tricky puzzle questions. We don’t ask to do complex pointer arithmetic or manipulation. Here is the question that the vast majority of candidates are unable to successfully solve, even in half an hour, even with a lot of nudging in the right direction:

Write a C function that reverses a singly-linked list.

That’s it. We’ve turned away people with incredibly impressive resumes (including kernel developers, compiler designers, and many a Ph.D. candidate) because they were unable to code a solution to this problem in any reasonable amount of time.

Thursday, November 25, 2010

Using Database Snapshots For Fast Testautomation DB Restore

The textbooks basically suggest the following alternatives for restoring databases to their original state after individual automated test runs:

(1) Let each test insert its own testdata on initialization, and when done, revert those changes, as well as everything the test did update.
(2) Embed the whole test run in a transaction, and rollback the transaction at the end.
(3) Restore the database from its original state before each test.

Unfortunately, none of those options worked for us. (1) is not practicable on complex database operations. (2) is not possible if there is no access to the database connection / transaction applied by the tested component. Plus we us a different data access layer for verifying test results (as the same data access layer could camouflage errors), so there is no way to work within the same transaction scope. Option (3) - when taken conventionally (applying traditional backup/restore) - was just too slow; because of several reasons, that very test database is about 150MB of size, and we have over 2000 automated tests to run.

What helped me out was SqlServer's wonderful Database Snapshot feature (available in Developer and Enterprise Editions). Snapshots are readonly-views of the database at a certain point in time. Their copy-on-write approach allows for much faster database restore, as the database now keeps track of which data pages have really changed.

So what I basically did was to create a snapshot at the beginning, then run the test on the original database, and then restore that database from the snapshot. Lightning fast!

Saturday, October 16, 2010

Hibernate: Joins Without Associations

I think I have seen this topic come up dozens of times on the Hibernate discussion forums:

How can I join a table in Hibernate when there is no association in the object model?


And the usual answer is:

Use Theta-style Joins in HQL, and it can't be done in Criteria API.


While this is basically true, I would like to elaborate a little bit more on that.

First of all, we should distinguish between joining and join fetching in Hibernate. Are we using the join in order to fetch a child collection or a single entity within the same query, or not? Because if we are not, there is the chance we don't need a Join at all, and a simple Exists-subquery will do as well, because the joined entity only appears in the From-/Where-clauses, not in the Select-clause. And in a correlated subquery, we can connect pretty much everything to the main query. This has other advantages too, e.g. avoiding unwanted join duplicates and achieving better performance.

Thinking about join-fetching, this might also give us an idea (at least by my interpretation) on why Theta-Joins are required in HQL. They basically look like this:

select foo, bar
from FooEntity as foo, BarEntity as bar
where foo.someothercol = 'foo' and
foo.somecol = bar.somecol


So we will end up with a resultset that contains a list, with each list element being an object-array of size 2, with obj[0] representing a FooEntity and obj[1] representing a BarEntity .

If we had by contrast an association between FooEntity and BarEntity, we could right something like:

select foo
from FooEntity foo
inner join fetch foo.Bars
where foo.someothercol = 'foo'


Looking at the query alone makes it pretty clear, that there must be a distinct definition on how the association between FooEntity and BarEntity looks like. This definition, usually representing a foreign key relation, will be applied on join fetches, lazy loading and - maybe most importantly - when updating associations. That is what Hibernate associations are all about. FooEntity.Bars will always contain child-rows following that association definition (either all child-rows, or a subset). This is why we can not choose any other arbitrary join criteria in order to fetch other BarEntities into FooEntity.Bars. It just does not make sense.

"But I want to use Criteria API", I can hear you say. OK, I'll come to that in a moment.

First I should mention at that point, that Hibernate also offers nice support for legacy databases in this context. Associations can be defined even when referencing non-primary keys. Unfortunately the NHibernate-guys have not ported that feature yet.

And I want to note that the restriction on associated rows can be narrowed down even more by providing an additional "With"-expression in de Join-Clause or within the Where-clause in HQL.

But I digress. So we have a solution for HQL, and we know why associations are needed for join fetching, and when there is no join fetching but only joining, we might be better off using an Exists subquery anyway. So what about Criteria API?

Fact is: Joining requires an association under Criteria API. Fact is also: There are join criterias that can not be expressed by an association. So the solution I came up with was: Use a database view as a kind of mediator. This view, let's call it FooBar, contains the join criteria expression in it's From/Where-clause, and selects the primary keys of the two associated tables Foo and Bar:

create view FooBar as
select Foo.Id as FooId,
Bar.Id as BarId
from Foo
inner join Bar on foo.somecol = bar.somecol


The view can as well contain further columns, and even a primary key (constructed on-the-fly), or might be materialized for performance reasons.

It is mapped into the Hibernate object model just as any table would be in case of a 1:1 or 1:N relation (and we can then define associations both to Foo and to Bar), or as a N:M mapping table. Hibernate won't even know (nor care), that it's a view. In our object model, we make those associations read-only (as we rather won't try to update the view), simply by providing public access only to an enumerator/iterator, resp. a getter but no setter for a single associated entity.

We can now join-fetch formerly non-associated entities under Criteria API as well, or project some of their columns into a flat resultset (AKA DTO-fetching). And we are able to provide further criteria in our queries' Where-clauses. Works like charm!

From a design viewpoint, we should not pollute our domain model with too many of those view-associations. What we can do is to provide an additional table mapping for those cases, e.g. by subclassing our main entities.

With all those options at hand, there is very little we can NOT do with Joins in Hibernate. And in that case, there is always native SQL support in Hibernate as well.

More Hibernate postings:

For database tuning tips, you may want to have a look at:


Friday, October 15, 2010

HashMap Vs. TreeMap

This is one of my favorite programming interview questions: "What's the difference between Hash Tables and Binary Search Trees?" (or: "HashMap vs. TreeMap"). Let's see...


Hash Tables (HashMap)Binary Search Trees (TreeMap)
AlgorithmKeys are mapped to values by using hash functions.

Hash functions transform the key to a numeric index (usually by calculating an integer value from the key first, and then applying a "modulo arraysize" operation). This index identifies an array element ("bucket"), where all corresponding values reside (values which all stem from keys with equal hash value).

The key must still be looked up within the bucket, but as hashing algorithms are supposed to ensure a good hash value distribution, hence small bucket size, this lookup is very fast.

The bucket array may have to be resized dynamically when a certain load factor is reached.
Node-based tree data structure, where every node contains one record (key-only, or key and value), and has a left subtree (with smaller keys) and a right subtree (with larger keys); hence keys must be comparable.

Self-balancing trees, like red-black-trees, keep their height as small as possible for fast searching and inserting.
Java ImplementationsHashtable
HashMap
HashSet
TreeMap
.NET ImplementationsHashtable
Dictionary
HashSet
SortedDictionary
SortedList
C++ STL Implementationsstd::unordered_map
std::unordered_set
std::map
SortedNoYes
Range SearchNoYes
Runtime ComplexityInserting: O(1) (normal case), O(N) (worst case, only with bad hash algorithm)

Searching: O(1) (normal case), O(N) (worst case, only with bad hash algorithm)

More details
Inserting: O(log(n)) (when balanced)

Searching: O(log(n)) (when balanced)

More details
Implementation NotesEqual objects must produce the same hash code, however unequal objects need not produce distinct hash codes. Equals() implementations must be reflexive, symmetric, transitive, consistent. More details.

.NET's ValueType.Equals() and GetHashCode() methods are based on reflection, hence slow; you should provide your own implementation in structs.
CompareTo() implementations must be reflexive, symmetric, transitive, consistent. More details.

Thursday, October 14, 2010

The Story Behind Balsamiq Mockups

Peldi Guilizzoni writes on the creation of Balsamiq Mockups. They have indeed built a wonderful tool, elegant in its design, simple in its usage, yet extremely powerful in what can be achieved by applying it. Our UI folks have been using Balsamiq Mockups since summer 2008, but I just happened to read the history of its creation today. What a truly inspiring story - sometimes one has to take risks to fulfill his dream.

Saturday, October 02, 2010

The Best Programmers...

The best programmers are not marginally better than merely good ones. They are an order-of-magnitude better, measured by whatever standard: conceptual creativity, speed, ingenuity of design, or problem-solving ability.
(Randall E. Stross)

Friday, September 03, 2010

SqlServer: Favor Inline UDFs Over Table Variables

T-SQL's Table Variables surely do have their benefits, but several drawbacks as well. We used to apply them in the past within larger T-SQL batches, in order to save and reuse some smaller query results, and to avoid code duplication:

declare @MyTable table (...)

insert into @MyTable(...)
select ...

select *
from @MyTable MyTable
inner join ...
where ...


But some databases grew in size far beyond original expectations, and those table variables that used to hold 100 rows, now hold 10.000 rows or more, and spill over into TempDB, hence perform badly.

So I replaced them with Inline User Defined Functions (these are table-valued UDFs, which only consist of a single Select statement):

create function dbo.MyUdf(...)
returns table
as
return (
select ...
)
go


They can then be inlined like this:

select *
from (dbo.MyUdf(someparam)) MyVirtualTable
inner join ...
where ...


This means high performance (due to the UDF query being inlined, query optimizer can do quite some magic) AND avoiding code duplication. Nice!

Thursday, September 02, 2010

SqlServer Standard Edition DOES Support Indexed Views

Why on earth do nine out of ten articles on SqlServer Indexed Views state, that they are only available in Enterprise and Developer Edition? This is blatantly wrong, and prevents developers from using a great performance tuning feature.

Standard Edition DOES support Indexed Views (and from what I have heard, but not tried: even so does Express Edition). The only difference is, that one has to provide a "with (noexpand)" query hint in order for the Indexed View to be applied in Standard Edition (I checked execution plans to go sure). Plus Enterprise Edition can choose to use an Indexed View for optimization, even if it does not appear in the original query.

I think this whole confusion might be caused by Microsoft's SqlServer feature matrices, which are pretty unclear on that topic, too.

Wednesday, September 01, 2010

.NET ValueType Equals()/GetHashCode() Performance

As most people are aware of, .NET Framework's builtin implementations of Equals() and GetHashCode() for ValueTypes are based on Value Equality (as opposed to many ReferenceTypes, where those methods are based on Reference Equality). The problem is, those ValueType.Equals() and GetHashCode() implementations are quite slow for user-defined structs - which is not surprising, as they must apply reflection for obtaining field and property values.

That's why it's better to code your own Equals() and GetHashCode() methods in these cases. The task is pretty much straightforward - just compare those fields and properties, that define equality, in Equals(), and XOR their shifted values (or numeric representations of their values) in GetHashCode():

public override bool Equals(object obj) {
    Nullable<MyType> other = obj as Nullable<MyType>;
    return other != null && 
           this.X == other.Value.X && 
           this.Y == other.Value.Y;
}

public override int GetHashCode() {
    return (this.X * 31) ^ this.Y;
}


I recently was asked to tune some code, and replaced several nested loops with dictionary lookups - only to find out that this was not as fast as I had expected, due to structs being applied as dictionary keys. Provided my own Equals() and GetHashCode() implementations, and voila - ten-fold speedup!

Monday, August 02, 2010

Developer Candidate Interviews

Uncle Bob / Robert Martin writes in Developer Certification WTF?:

So what if we… interviewed ... candidate developers? What if we asked them what accomplishments they’d achieved in previous employment. What if we asked them who they used to work with. And… ok, this is even wilder ... what if we called those references and checked them out?


Yeah, and what if we would additionally let them write some code during the interview? I know this sounds crazy to many software development companies, with some notable exceptions (Microsoft, Google and Apple for example - hmmm, maybe this should make us think?)

Sunday, May 02, 2010

How to improve Visual Studio .NET Build Performance

Funny how slow Visual Studio .NET compile times in often are seen as being carved in stone.

"Building our solution takes 3 minutes, so each time I change a line of code in a base component and run the app, I must wait 3 minutes. Period."

Is this really how software development should work these days?

As projects grow over time, compilation gets slower and slower. Productivity goes south when people are watching the compiler do its work half the time. The decline creeps in and people think this is just normal, until a new developer arrives and throws his hands up in horror.

Also, Visual Studio seems to lack behind other IDEs (most notably Eclipse) on this topic. Eclipse has built-in support for REAL incremental background compilation for ages now (it compiles while the developer types (or during idle time), hence can start the application with no or little delay once he hits <Run>). Visual Studio still doesn't.

So let's see what we can do to speed up things in Visual Studio.


(1) Decrease File IO / Speed Up Remaining File IO

Use the same output path for every project in your solution, e.g. "..\Output\Debug" for "Debug" configuration (Project / Properties / Build / Output Path). All assemblies will be created there, so no need to copy files back and forth between project folders.

If there is only one output path, project references can be set to "Copy Local = False" (References / Properties / Copy Local). This furthermore decreases File IO.

I have experienced up to three-fold buildtime improvements by applying that simple configuration change.

In addition, you might want to mitigate virus scanner policies for scans on devenv.exe file access, as well as using a separate phyiscal drive for source files and compile output, maybe even a SSD drive.


(2) Prevent non-dependent project compilation

If a solution contains projects, that the current startup project does not depend upon (e.g. test projects), compiling those projects can be avoided by selecting the option "Only build startup projects and dependencies on Run" under Tools / Options / Projects and Solutions / Build and Run.


(3) Prevent project dependency tree compilation

Visual Studio resp. MSBuild compiles all projects with changes, plus the complete project dependency tree underneath those projects. This is all good and fine, but it also implies that once only one line of code is changed in a commonly used base class, this might cause a complete solution recompilation. But why should we recompile everything, when say only the interior of a single method implementation has changed?

When you know that you are going to change something deep down in your dependency tree, but won't be altering non-private method signatures, here is what you can do:

Option 3.1: Manually avoid dependent project recompile
Under Tools / Options / Projects and Solutions / Build and Run, choose "On run, when projects are out of date: Never build". Then, manually build the project only where sourcecode has been edited (Build <projectname> resp. Shift-F6), and finally run the startup project. This too only works when all projects use the same output folder (see (1)).

Option 3.2: Additional solution configurations
Create additional solution configurations for partial builds, and apply identical output paths as defined in (1) (Solution / Properties / Configuration Properties / Configuration Manager / Active Solution Configuration / New).

Solution configurations allow for defining which projects should be built, and which should not; this can lead to much smaller build dependency trees. We might provide configurations for framework class libraries, for functional modules or for data access projects, and so on. And once we know we only are going to change the inner workings of certain components, we switch to the suitable solution configuration first.

Other dependent projects are not going to be compiled if they are not part of the active configuration. What does that imply at runtime? As there are no newly built dependent assemblies, the previously built assemblies (still residing in "..\Output\Debug") are going to be loaded, which is just what we want.

Option 3.3: Decouple projects, use assembly references instead of project references
Decoupling is not only recommended for software architecture in general, it might also boost compiletime. When application layers are separated (e.g. coding against interfaces, and acquiring implementation classes only at runtime by using factories, dependency injection and the like), code changes other than interface modifications will not lead to dependent project compilations any more.

Also, choose solution structure carefully. You don't want one project containing 80% of all code, and the rest spread on ten other projects. Or consider having Team A work on projectset #1, and Team B working on projectset #2, which depends on what Team A has produced. Provide separate solution files for each Team, let Team A deliver libraries to Team B, and integrate them in Team B's solution by using assembly references (instead of project references).

Take those options with a grain of salt, though. Once signatures of methods invoked by other assemblies are actually altered, the compiler will not be able to pick that up, and runtime errors will occure. In addition, these approaches require stable assembly version numbers.

Wednesday, March 31, 2010

And I shall call it "Foreign Key Cascade Delete"

I know it sounds kind of stupid trying to implement cascade delete in T-SQL, when all you have to do is to activate the "cascade delete" option on each foreign key within the association graph. Similarly stupid as implementing referential integrity by hand.

But imagine a case where you generally don't want cascade delete in normal production mode (because no user should be able to delete a parent row which has any child rows still referencing it). And unfortunately there are those rare occasions, when just for the moment, for one single parent row, cascade delete would make sense again, because you really have to get rid of it, only exceptionally.

Of course one could simply activate cascade deletes for this single transaction. But I don't like to change database schemas in production.

The guys at sqlteam.com came up with a different solution: A stored procedure, invoked recursively, traversing over all foreign key relations and deleting child rows bottom-up.

Sounded nice, but did not work in our case. The approach was based on nested subqueries, and from about 8 nested queries on, SqlServer refused to even calculate a query execution plan (I do have sympathy for SqlServer here, I wouldn't enjoy building an execution plan on queries like this either. And even with a plan, the performance would most likely be extremely poor). Plus the script could not handle foreign key relation cycles (I mean cycles on a table-basis, not on a row-basis; the latter can't be solved anyway), a fact that caused a stored procedure "stack overflow" (the limit is 32 recursive calls) on our database.

I then started to adapt the sqlteam.com script in order to make it work, and was about halfway done, when I noticed that commentator GreySky had already provided a solution which was following the same approach I had planned for. He solved the execution plan / performance issue by storing parent primary key values in a temptable, together with information about the current call level, so that the delete criteria would be ONE simple IN-subselect instead of many nested subselects.

So GreySky's script worked fine, with the exception that it still did not handle cycles in foreign key relations. But at this point, with a temptable holding all parent primary keys at hand, this was a simple task. I only had to add an extra column storing the table name that the primary keys belonged to. Thus it was possible to extend the statement's where-criteria to "visit" only those child rows, that had not been marked for deletion before. (Nearly) like Dijkstra's algorithm, you know ;-) Works like a charm!