Developing Linq to LLBLGen Pro, day 2
(This is part of an on-going series of articles, started here)
Adding Linq support to an O/R mapper like LLBLGen Pro is a matter of choice: either you implement new SQL engines or you convert the expression trees to native query language components. The former is a lot more work and the latter will probably cause problems here and there. We decided to create a converter and I think that's the only real option for any O/R mapper developer out there who wants to support Linq queries. It's not without problems though. During my second day of working on the project I already hit a major roadblock. I'll describe that in a bit.
In my previous posts I've explained that the source of the Linq query is the most important part as there it all starts and ends. To be able to specify the source in the context of LLBLGen Pro, I generate a class which simply returns for each known entity a DataSource<T> instance, where T is the type of the entity, e.g. CustomerEntity. The Linq oriented classes like the QueryProvider etc. are placed in a separate assembly, and will be merged into our runtime library at release (the .NET 3.5 build that is), if it's not possible to keep things separated, otherwise we'll release a separate dll. With the roadblock I hit this morning, things look a bit complicated so I think having a separate assembly isn't going to cut it alone, runtime library changes will be needed.
So which problem did I ran into? It's actually based on a little Linq to Sql query posted in the C# newsgroup this morning. The query looked something like this:
// C# var q = from c in nw.Customers select new { ContactName = c.ContactName, TotalOrders = c.Orders.Sum(o => o.Order_Details.Sum( od => (double?)od.Quantity * (double?)od.UnitPrice)) };
This query retrieves for every northwind customer its overall total for all orders ordered. I looked at it for a while and wondered: is that 'Sum' executed in memory or in the database? So I added it to a little Linq to Sql project and it resulted in this query:
-- T-SQL SELECT [t0].[ContactName], ( SELECT SUM([t4].[value]) FROM ( SELECT ( SELECT SUM([t3].[value]) FROM ( SELECT (CONVERT(Float,[t2].[Quantity])) * (CONVERT(Float,[t2].[UnitPrice])) AS [value], [t2].[OrderID] FROM [dbo].[Order Details] AS [t2] ) AS [t3] WHERE [t3].[OrderID] = [t1].[OrderID] ) AS [value], [t1].[CustomerID] FROM [dbo].[Orders] AS [t1] ) AS [t4] WHERE [t4].[CustomerID] = [t0].[CustomerID] ) AS [value] FROM [dbo].[Customers] AS [t0]
Yes, I was also surprised to see that at first, but of course, the nested Sum() aggregates leave little room to enhance this. However, if I would have to write this in plain SQL, I'd do it like this:
-- T-SQL SELECT c.contactname, SUM(quantity * unitprice) AS TotalOrders FROM customers c INNER JOIN orders o ON c.customerid = o.customerid INNER JOIN [order details] od ON o.orderid = od.orderid GROUP BY c.contactname
A complete different approach. The reason I picked this one and not the nested selects is because it's very efficient (check the execution plans if you want). So I wondered, how can I rewrite my Linq to Sql query to get the group by query over the joined set? But... I don't know! . I tried everything I could think of but I couldn't get a compilable query which gave me the right results!
The thing is that LLBLGen Pro doesn't support derived tables in query specifications (the FROM ( SELECT ... ) constructs), because it determines the FROM clause of a select from the elements passed to the fetch method, as this is easier for the developer using the API and it leads to less mistakes. For example the query with the group by posted earlier can be written in LLBLGen Pro (using Adapter) as:
// C# ResultsetFields fields = new ResultsetFields(2); fields.DefineField(CustomerFields.ContactName, 0); fields.DefineField(new EntityField2("Total", (OrderDetailsFields.Quantity * OrderDetailsFields.UnitPrice), AggregateFunction.Sum), 1); RelationPredicateBucket filter = new RelationPredicateBucket(); filter.Relations.Add(CustomerEntity.Relations.OrderEntityUsingCustomerId); filter.Relations.Add(OrderEntity.Relations.OrderDetailsEntityUsingOrderId); GroupByCollection groupBy = new GroupByCollection(); groupBy.Add(fields[0]); DataTable table = new DataTable(); using(DataAccessAdapter adapter = new DataAccessAdapter()) { adapter.FetchTypedList(fields, table, filter, 0, null, true, groupBy); }
As it fetches the set into a datatable, it's of course not really 'typed', but for the rest it gets the data out as expected with the query as expected (I can also fetch it as a datareader and project it onto a class if I want to, the query given is just for illustrational purposes, so I used a datatable). The result of this is that I have a Linq query I can't convert to LLBLGen Pro constructs and I have a query which is specifyable in LLBLGen Pro constructs but which I can't formulate in Linq. And it's not even monday!
Clash of the Paradigms
So I wondered... why is it so hard to write the SQL query in Linq (or why am I so stupid not to understand the C# 3.0 spec )? And then it hit me: Linq isn't set oriented, but SQL is. At least, that's my conclusion. The group by approach is logical from a math / set oriented viewpoint, but it's an odd approach if you look at it from an imperative / functional viewpoint. In an imperative / functional executed piece of code, you want to specify what has to be done and at the end of your set of statements you arrive at your result.
I have no idea if my Group By SQL query is even possible with Linq, however possible or not, the conclusion can only be: developers using the Linq extensions to C# and VB.NET will think in an imperative way, they don't bother with the fact that they're now suppose to switch their minds into Mode.SetOriented (poke &H6EF8, &H0E to be exact), they just want to write the query as if they were writing it in C# as they actually are doing that. For the people who think Linq will therefore have a small learning curve: I'm afraid I have to disappoint you: the same problems arise as with every other O/R mapper query language: 'how do I formulate this [insert complex SQL written by a 70 year old Oracle DBA here] SQL statement in [insert O/R mapper query language constructs here]?'. That won't go away, simply because there's no 1:1 mapping between Linq and SQL.
So the problem then arises for me: what to do? It's not as if the query presented is one only some bloke in South East Alaska would run once a year. However it's also not the end of the world: the developer won't be able to avoid all O/R mapper constructs anyway: there will be O/R mapper specific language elements in the final Linq query or around it and all CUD (Create, Update, Delete) operations aren't even mentioned. So if a given required set of data isn't specifyable in Linq, it's likely the native O/R mapper query language constructs will offer a way to obtain the data in that form using that particular query. So the option to not support given Linq constructs isn't that big of a deal as it seems.
However, the one true reason we're doing this Linq implementation in the first place is for marketing and strategic reasons: to be able provide an upgrade path for users of Linq to Sql towards our framework and to be able to offer people who have enjoyed Linq courses or have experience with Linq a way to leverage that experience on our framework. It doesn't add any functionality, the framework already supports almost all the ways you want to fetch any data in what kind of wacky format you can think of. So the only really satisfying solution is to solve the problem of the derived tables and be able to support as much Linq queries as possible.
My initial research shows that of our supported databases only Firebird 1.5 doesn't support derived tables (even MySql does, who would have thought!). I can live with that, as there's a v2.0 for firebird which does support derived tables. The only thing I have to solve is: how do I add this to our API elegantly and also without breaking a lot of code already out there (and of course also in such a way that the amount of code to change is minimal)? This is the root aspect of maintainable software, though it's one we call can't avoid sooner or later. So my next stop is first to alter our runtime library API in such a way that the derived table specification is possible so I can formulate the Linq constructs in our own query language elements without much conversion. Because, it's of course (in theory) also possible to transform the group of nested selects into a join set with group by, after all they lead to the same result, and the nested selects actually represent a join + group by. However my head already starts to hurt when I think about that so I leave that to the set-theory junkies over at MIT.