Developing Linq to LLBLGen Pro, part 12

Tuesday, January 29, 2008

(Updated Wednesday 30-jan-2008). It was mentioned that we would implement 'Skip' as well, although we already had a paging method added, TakePage(). After carefull analysis, we decided not to implement Skip for now. The reason is that it can lead to confusing queries, while paging is what the developer wants. We believe TakePage() serves the developer better than a combination of Skip / Take (Take is still supported separately) which won't work in a lot of cases if Skip is specified alone.

(This is part of an on-going series of articles, started here)

Cast again
The last episode in this series contained a remark about Queryable.Cast, and that it can be ignored. I've to correct myself there, this isn't entirely correct. Let's look at the example query at hand:

// LLBLGen Pro Linq example
var q = from e in metaData.Employee.Where(e => e is BoardMemberEntity).Cast<BoardMemberEntity>()
	select e.CompanyCarId;

Here, the Cast is actually irrelevant because the Where already filters on the BoardMemberEntity type. However in the following small query, it's not:

var q = from e in metaData.Employee.Cast<BoardMemberEntity>()
	select e.CompanyCarId;

Here, the Cast is actually a type filter. Well... sort of. The thing is: there's no real 'good' way to handle this, similar to 'as', which is discussed below. Imagine the situation where the employee instance is of type ManagerEntity (supertype of BoardMemberEntity), and not of type BoardMemberEntity. What should the query above do in that case? The instance 'e' in that case doesn't have a CompanyCarId field, as that's a field only available in the BoardMemberEntity type. The specification of Queryable.Cast doesn't reveal any help here: it doesn't say what should be done when the Cast can't be performed. So, I decided it emits a type filter instead. This means that the query above always succeeds if there are BoardMemberEntity instances in the database, it simply filters out any types which aren't of the specified type, in this case BoardMemberEntity.

OfType
Queryable.OfType is a bit of a weird method, in that it actually does more or less what Cast does in that it filters out any elements which aren't of the type specified. Of course, 'Cast' by definition isn't a filter, but in database queries, you have little choice here: SQL isn't imperative, at least not in the way SELECT queries work. OfType therefore is implemented similar to Cast: it emits a type filter into the query.

The 'as' keyword
I already had implemented support for 'is', which resulted also in a typefilter (if you now have the feeling there are a couple of redundant ways to specify the same thing in Linq, you're correct), but 'as' is a little different. It again is more or less a keyword which is actually not really usable in database queries. Let's look at a query which illustrates this:

var q = from e in metaData.Employee
           where (e as BoardMemberEntity).CompanyCarId==1
           select e;

This in itself looks pretty straight forward, but look closely to what it represents: what if 'e' is a ClerkEntity (not a supertype of BoardMemberEntity) ? The 'as' keyword should then result in null/Nothing, so accessing the property CompanyCarId on it should result in an exception, at least in .NET code it does. But how are you going to convert this into SQL? One could argue, and I agree with that, that the query above is pretty poor code and constructions like that should be discouraged. Though the thing is that just because it's possible, it will cause some people to write code just like that.

Fortunately, the code above is translatable to SQL in general form, if the 'as' keyword is seen as a type conversion, so 'e' is simply seen as a BoardMemberEntity and the fact that a 'null' can occur is ignored as the filter on CompanyCarId will weed out any types without that field in the query anyway (no matter what inheritance mapping strategy is used). The only difference is that if you expect an error, you won't get one. Let's look at a more nastier version of the one above which actually is more correct:

var q = from e in metaData.Employee
          let x = e as BoardMemberEntity
          where (x!=null) && (x.CompanyCarId == 1)
          select x;

Here, across two statements, a type filter is written: first the 'as' conversion in the let statement, and then the test for null on the result. You can generally write this as a handling of type specification equal/not equal null and by adding a special handler for that in the general binary expression handler, you can convert that situation to a type filter!

There's something not really great about the query above though. Linq to Sql has no problems, because it only supports single-table inheritance. If the O/R mapper supports multi-table/view inheritance, like LLBLGen Pro does, you have a bit of a problem with field names. Consider a hierarchy where the root has three branches of subtypes, and in two branches a field Foo is present, though not in the third branch. If you now fetch all entities of the root type, you will end up with multiple times the same field in the projection. The fix for that is field aliasing, namely you give every field an artificial alias, e.g. Fx_n, where x is some index to specify the entity and n is the index of the field in the entity.

The problem begins when 'let' entered the room: a 'let' is effectively a wrapping SELECT statement. (You can rewrite some queries by moving the query to other parts of the query but that's not always possible). The query wraps the original query in its FROM clause as a Derived Table and simply selects from that source, making the projection required for let. Now, imagine you have your fields aliased with artificial aliases inside that source. Your Derived Table fields now have names like F0_1, F3_4 etc., but not 'CompanyCarId'. This means that the filter on CompanyCarId, which is outside the Derived Table, as let wrapped what's before 'let' in the query, will fail if it targets 'CompanyCarId'. Normally it would work, as the JOIN statements which joined every subtype's Table/View would be accessable for the WHERE clause, but in this case that's not possible because the joins are inside the Derived Table.

As elements are processed in different areas of the Linq provider (you're not going to write a big routine which handles every situation), the handler for Where has no clue if the field should be re-aliased, at least not at that point. So I wrote a traverser, a class which crawls over all LLBLGen Pro query api objects and finds the elements sought. Basicly it's a base class which simply visits all object elements and if you in a derived class override a given method you can tap into this process and do what you need to do, e.g. collect Derived Table definitions, fields targeting derived tables etc. I wrote a couple of derived classes which collected information for me this way and with that I could easily fix every field alias and target in the entire query, so CompanyCarId references in filters outside the Derived Table would then be fixed to target the Fx_n field which represented CompanyCarId. The crawlers are also handy to find places where inheritance relations have to be injected into the query, for example because the query is folded into a subquery inside a Derived Table. The classes will be public in the upcoming runtime library of LLBLGen Pro v2.6 so our customers can use them as well for their own fine-grained query voodoo magic .

Except and Intersect
Except and Intersect are two methods which are actually each other complement: you can implement both with a single construct and just a flag to add 'NOT' to the query fragment. Except and Intersect can be implemented as an EXISTS query construct, similar to the work done for Contains: a lot of code can be re-used for these two methods.

Except and Intersect can have an IEnumerable as argument. Although Linq to Sql doesn't support it (I don't know why) it's perfectly doable, and in Linq to LLBLGen Pro you can pass an in-memory set of elements to Except or Intersect. What's particularly weird with Linq to Sql not supporting it is that the code to support in-memory sets as argument for these two methods is also usable for Contains. Oh well... .

There's some pitfall to be noticed with these two methods however. Consider this query:

// won't work
var q = from c in metaData.Customers
          where someCollection.Except(c.Orders).First().EmployeeId==3
          select c;

This query might look like a bit far-fetched, but the general idea is this: it's not possible to use an in-memory collection in a database-targeting query where Except is called on that in-memory collection with a database set as argument. Do you see why? It took me 3 days to realize this, so don't worry if you don't see it right away. The thing is: Except filters the set it is called on. But that set isn't in the database, you've to pass it to the database in the query. With PK fields, that might be possible, but not with complete entities, that's undoable. With Intersect it could be done though, however I haven't implemented that for now, as it's easy to work around it (swap the operands of Intersect ).

Single
Now here's some method I have no idea why it is in the API: Queryable.Single. Let me first quote the specification of the method:

Returns the only element of a sequence, and throws an exception if there is not exactly one element in the sequence.

There's an overload which accepts a predicate and which simply means that Single(predicate) will return the only element in the sequence which matches this predicate. What I find odd is the remark about throwing an exception: why would anyone call such a method? There is a rule about exceptions: "Don't ever use exceptions for control flow in your application". Exceptions aren't expressions for if/else constructs, they're serious business: they mean something was definitely wrong and needs handling to avoid a total crash. In database scenario's it's even weirder: which exception should you expect? And more importantly: what does it mean? If you don't know what the true meaning of an exception is, you'll never be able to handle it.

The example in the MSDN to illustrate Queryable.Single() is pretty bad actually, because it uses exceptions as control flow. Not only that, the example fails to illustrate a valid case where the exception would be something you would want. This is important, because... I can't think of a use case for Queryable.Single() where you would want the exception. After all, exceptions aren't meant for control flow, so I don't expect an exception to drive my code as if an if-statement resolved to false: it's not meant to be a test if a set contains more than one element.

The thing with Single is that it's redundant, at least for database queries. You can also use First(). First() also returns a single element, but it doesn't throw an exception when there are more than one element in the sequence. With database queries, using Single() has the same behavior as First(). Sure, I could add code which flags the resulting QueryExpression object that the result should be checked if it has more than one element and if so, an exception should be thrown, as Linq to Sql does with this query:

// Linq to Sql code
var q = (from c in nw.Customers select c).Single();

However, do I get the same exception that there are more than one element in the sequence with this one:

// Linq to Sql code
var q = from c in nw.Customers
          where c.Orders.Single().EmployeeId==3
          select c;

No. With this query I get a hardcore severity level 16 error from SqlServer that the query is wrong because the subquery returns more than one element which can't be used with the operator (=) specified. This isn't the fault of Linq to Sql, what else can it do? Call RAISERROR (someone at Sybase still feels ashamed about that typo I bet ) with a Count check? Why? Would that help the caller of the query? I'm not convinced it necessary will help. The 'Single' method is simply not useful in database queries: for the Single() overload, use First(), for the Single(predicate) overload use Where(predicate).First(). Though, as the requirement of the method is that it should throw an (not specified which one) exception, these usable synonym statements aren't completely covering what Single() represents. Though in my opinion, the exception requirement is a big mistake: if you need behavior to be called when a set has more than one element, you should test on that specification and call the behavior if the test succeeds, not by issuing a query which in the end fails with whatever error takes place, and count on that to handle things further.

I'll add support for it, though under protest. The reason I'll add support for it is to be compatible with queries which target other O/R mapper frameworks.

Linq to Sql issue(s)
During development of the Linq to LLBLGen Pro provider I often check what Linq to Sql does with a given query I use for testing and to see if for example my SQL is more efficient or falls flat on its face. Sometimes you run into unexpected things. When I tried an Except or Intersect query, I saw that 'DISTINCT' was emitted into the query. Everyone who knows SQL knows that DISTINCT is a keyword you have to be careful with, not all databases support every type of field with DISTINCT. In SqlServer for example, (n)text and image fields aren't supported in a DISTINCT projection. Not sure if this was a small glitch or that there was logic which would prevent DISTINCT in queries where it's not allowed I tried:

// Linq to Sql, gives crash with NotSupportedException. 
var q = (from e in nw.Employees select e).Distinct();

It too emitted DISTINCT into the query, which was caught by its validation checker. However what's worse: Except and Intersect therefore also aren't usable with Linq to Sql and any image/(n)text containing entity type: DISTINCT is always emitted into the query. This query for example gives the same exception, though no DISTINCT was specified in the Linq query:

var q = from e in nw.Employees.Except(
                  from e in nw.Employees where e.Country == "USA" select e)
           select e;

One could argue: "But if DISTINCT isn't possible, how to weed out the duplicates?". Well, you do that on the client in the routine which consumes the datareader and constructs the objects to return. You keep hashtables with hashes calculated from identifying attributes like PK fields and with that you filter out duplicates. At least with entity fetches like this one. Not supporting situations where DISTINCT can't be emitted into a SQL query is a typical error one could make in a v1 O/R mapper, it's only a bit sad for Linq to Sql users that Microsoft is so generous with releasing fixes for their framework on a regular basis.

Writing a Linq provider is a lot about true software engineering, I've written about that several times before: some things are known but the biggest part is unknown territory: what is constructed in which order, when to expect what, is it safe to ignore this or that? Unclear things one can only find out when it is used, i.e. by trial/error approaches. This is actually a bit sad, because it's now easy to overlook mistakes or miss corner cases, as the bigger picture isn't always clear. For example: what does VB.NET emit into the expression tree for string concatenations if C# emits 'Add' operations ?

With the DISTINCT keyword popping up in the Linq to Sql query for Except I was immediately alarmed: why is it there? Linq to Sql never emits DISTINCT when it's not told to do so. You then start thinking about it: is it something which is a left-over from their tree manipulations? Or is it hardcoded set to be emitted? It turns out it is. In QueryConverter.VisitExcept and QueryConverter.VisitIntersect it sets 'IsDistinct' to true. I couldn't think of a reason to have it default to DISTINCT, so I didn't add a requirement for that to our tree for Except and Intersect, also because it uses an EXISTS query, which doesn't care if DISTINCT is there or not.

Both routines are also a clone of eachother. Clones are easily created and often overlooked, however this one is particularly obvious, especially because the behavior of the two routines is closely related, so the implementation of the handlers is then too closely related.

What's next?
Implementing 'Single', probably tomorrow, then on to implementing the database function framework I have in mind: it has to be a framework where developers can add their own function mappings to the provider so they can map their own extension methods to database functions easily. When that's done, Queryable.Convert gets another look as some scenario's require it not to be stripped off but handled instead, though that relies on the function-mapping framework, so it has to wait till then. After that, hierarchical projections and prefetch path support are on the menu (prefetch paths are more a small addition as the core functionality is already in the runtime for quite some time, hierarchical projections require the prefetch path merge code already in the framework to be opened up to the Linq provider) and then I'm done with the Linq part. Finally. Stay tuned.

With regard to Single():

I can't speak for the architects of Linq but I have written an O/R layer of my own and I've come across a need for exactly the equivalent to that method.

The point is not to use the exception for control flow - you won't be writing

try {
x = whatever.Single();
} catch {
x = something else;
}

The idea is that a lot of the time you KNOW that your query should return only one item, and if it returns multiple (or none) it represents a bug somewhere else.

Thus, the difference between:

x = whatever.Single();

and

x = whatever.First();

is nothing. EXCEPT if there's a bug somewhere else in your code.

If you use .First() your code will silently succeed - and potentially do the entirely wrong thing until much later when some bad consequence crops up.

If you use .Single() your code will fail the moment the multiple values are detected, giving you the opportunity to debug and find the problem sooner.

Think of it as .First() plus an "assert" and you get the purpose of it, I think.

Stuart Ballard - Tuesday, January 29, 2008 7:10:10 PM

Stuart: interesting point.

How you explain it, it makes sense. If I then re-apply this to the rest of the api, why isn't there a method which allows me to test if there are n elements? I do agree that in your case it makes sense, but it's the only method in the API which has this spec to throw an exception if it fails. What's also a problem is that if you use Single in other parts of the query, you're a bit out of luck: the query might turn into valid SQL and not throw an exception.

FransBouma - Tuesday, January 29, 2008 7:59:19 PM

As far as why there isn't a method that allows you to test if there are N elements: again, I can't speak for the designers of Linq but it seems to me it's a much rarer case, plus being less obvious how to express it in the API. query.TakeExactly(n)? I can't even think of a use case for it - Take() is generally for paging, but you don't want to throw an exception getting the last page, do you?

There's a much bigger difference, conceptually speaking, between "one" and "several", than between n and n+1.

As far as using Single in other parts of the query: I've always been somewhat unhappy with the way that Linq providers are expected to translate to SQL, so I have less insight into the architects' frame of mind on that one, because it doesn't match my own.

My own opinion is that it should be considered a provider bug if the query does not produce the *exact* same result as it would if run purely in C# over real objects - even if that results in making the provider more complicated, or requires client-side joins in more cases. If a method can't be mapped to something on the server side - I'd love if there were a way to make that a compile time warning (I recognize that in practice that's impossible) but the query should still succeed, just slowly.

I don't know the right way to handle Single - I realize there are a LOT of nuances to trying to translate expressions to SQL and I haven't thought through more than a tiny fraction of them. I recognize too that I'm probably in a minority with my extreme view on how faithful the translation should be to the C# semantics. But for what it's worth, that's how I feel about it.

Stuart Ballard - Wednesday, January 30, 2008 3:54:38 PM

I've Single now working. It turns out that if 0 elements are returned, it also has to throw an exception. Linq to Sql throws 2 different exceptions, InvalidOperation if there are 0 and NotSupportedException if there are 2 or more. Odd, but alas... it obeys the contract ;)

I agree with your point that the difference, conceptually, between 1 and n vs. n and n+1 is indeed bigger with 1 and n. If only for the fact that 2 or more require a collection to be stored in, a single entity instance is just an object.

"My own opinion is that it should be considered a provider bug if the query does not produce the *exact* same result as it would if run purely in C# over real objects - even if that results in making the provider more complicated, or requires client-side joins in more cases. If a method can't be mapped to something on the server side - I'd love if there were a way to make that a compile time warning (I recognize that in practice that's impossible) but the query should still succeed, just slowly."

That's never going to work. Typical example: grouping on a result of a lambda which returns a bool. This is possible in Linq to Objects, but not in databases. Also, reverse of a query isn't always possible:

var q = from c in metaData.Customer

       join o in metaData.Order on c.CustomerId equals o.CustomerId

       orderby o.EmployeeId ascending

       select c;

you can't reverse this set in the database. At least not with deep analysis of the sorting and reversing that, OR by inserting the complete (!) set into a temptable with an extra identity ID field and then reversing the fetch by a reverse sorting on the extra identity field. I think it can be done in oracle though: order by descending on rownum with a wrapping query...

I'm not sure if all that is so useful.

I understand your point though. It's just that Linq itself is too focussed on Linq to objects, so it has operations which have no meaning in a database query, only on a materialized sequence. The main difference is is that Sql works with sets and set operations and Linq is designed to work on sequences. A sequence is a set but a set isn't a sequence, it lacks a sequential enumeration operation, as it has no defined order.

There's a principle which is called a monad comprehension which dictates that a set is actually always interpreted as a sequence, even in set operations, so you can take advantage of that. The sad thing at the moment is however, that all we can do with an expression tree which represents a query for a database is to translate it into SQL. So it's not possible to convert it to the inner engine commands of an RDBMS (yet). So until that happens, we're not really going to see a true Linq implementation on a database engine with 100% coverage of all statements.

At the moment, VB.NET and C# even generate different expression trees. One example is string concatenation, though there are differences in handling NULL values as well. This is particularly bad, because the actual meaning of a linq query is language independent, and so is the expression tree.

The typical example of how weird it is is the '+' operator on a string. In the C# tree, it is formulated as an Add expression between two string typed operands. However, this ONLY makes sense if you assume that the language used was C#. If the language was VB.NET, it apparently has no or a different meaning, who knows, you don't know the language when you get the expression tree. I haven't tested yet what VB makes of it, but I guess it will be an Or expression between two string based types. Which has no meaning in C#.

So I in theory agree with you, however in practise it's impossible to do. I also think that in practise we should try to achieve it. We'll end up with a model similar to the original DirectX, where things which weren't doable by hardware were emulated in software. This was horrible to program for as a small detail could completely ruin your performance.

FransBouma - Wednesday, January 30, 2008 4:19:22 PM

Hi Frans

Just wanted to let you know that we are really looking forward to the LINQ provider release for LLBLGen.

We actually conducted an internal study of DLINQ versus LLBLGen and LLBLGen was hands down winner, amazingly better at DLinq for .NET data binding and UI programming. This holds true especially for use with 3rd party UI control sets. Another area where LLBLGen wins is distributed programming and I for one am not ready to part with the dependency injection mechanisms even though these can be achieved with the enterprise library.

However my general feeling is that team members are going to want to use LINQ even though they can't use DLinq. They are going to want to write Linq queries rather than LLBLGen predicate expressions. This is a matter of keeping up with the times and the market. So LLBLGen's LINQ provider is something we are really looking forward to!

Gavin Sinai - Saturday, February 16, 2008 7:16:42 AM

Thanks Gavin for the kind words! :)

We hope to release the Linq to LLBLGen Pro code as a CTP to customers in 2 weeks. We've managed to write a FULL linq provider (not a toy one like most others) and I'm sure you'll like it :) Stay tuned :)

FransBouma - Saturday, February 16, 2008 10:46:05 AM

6 Comments