Saturday, May 20, 2017

Microsoft 70-487: Choose data access technologies

Exam Objectives

Choose a technology (ADO.NET, Entity Framework, WCF Data Services, Azure storage) based on application requirements



Quick Overview of Training Materials

[MSDN] Summary of Data Access Guidance
[StackOverflow] Entity Framework VS LINQ to SQL VS ADO.NET
[StackOverflow] Are there good reasons not to use an ORM?
[StackExchange] Is Entity Framework Suitable For High-Traffic Websites?
[StackOverflow] When to use WCF data services and when not?
[StackOverflow] Why would you not use WCF Data Services for querying data?
[StackOverflow] What is the need of OData when I have JSON?
[MSDN] WCF Data Services 4.5
Azure Storage Service landing page
Introduction to Microsoft Azure Storage
Azure Storage 5 Minute Overview (video)
ASP.NET Data Access - Recommended Resources
[Channel9] All Data/All Day Dive into .NET Data Access
[Channel9] OData Overview (2015)
Why Entity Framework Can Be Your Best Friend
[MSDN] Storage queues vs Service Bus queues

Pluralsight Courses:
ADO.NET Fundamentals, Introduction to OData, Anything by Julie Lerman


In order to get a handle on which data access technologies are appropriate in which scenarios, the first task is to get a basic handle on what each of these technologies does, and the patterns of how they are implemented.  In many cases, these technologies are not mutually exclusive, and should, in fact, be used together.



ADO.NET


ADO.NET (I've seen it called "ADO.NET Core") is the oldest of the data access technologies considered here.  It is a low level abstraction over the data source, with classes representing connections, commands, and datasets (essentially database tables).  Using ADO.NET requires writing SQL queries by hand, either as stored procedures ("sprocs") that live in the database, or as "in-line" SQL strings in the code.

ADO.NET offers a much lower level of control compared to an ORM based solution.  In scenarios where your application requires fine grained control of database access, it may be the right choice.  Though to be honest, I've had a hard time finding any scenarios where this applies (not saying they don't exist, they just seem to be uncommon).  EF supports the use of views and stored procedures, so that by itself shouldn't be a factor.  This post on Coderwall seems to suggest that EF has problems with transactions, which would be one of the "low level control" kind of topics that might only apply in a very limited number of situations.

The other big advantage with ADO.NET is performance.  I've seen performance test numbers (generally in the context of looking at Dapper) that suggest ADO.NET is substantially faster than Entity Framework:


The drawback to using ADO.NET directly in lieu of an ORM is that you are on the hook to write all the plumbing.  EF gives you a lot for free in this regards.  One drawback of Entity Framework, though, is that because it is such a high level abstraction, it can be difficult for developers to wrap their head around what it's doing and debug problems when they arise.  I'm not saying "developer ignorance" should be a major technology selection factor, but it is a reality in some environments.  Because ADO.NET is a lower level abstraction, it is easier to recognize how all the parts fit together, and it may make sense on a given application to use a technology that is simpler rather than pay for the learning curve.  Again, a short sighted strategy, but in a budget constrained environment...

While it may be short sighted to avoid learning EF on new projects, on legacy applications with a lot of existing ADO.NET code, maintaining a consistent pattern may genuinely make sense.  While some of the materials seem to advocate for incremental introduction of EF in existing projects.  I can't say whether this is always good or always bad.  In tiered applications where the business logic is separated from the data access code by a level of abstraction, then this is probably a much more viable strategy than in cases where there is tighter coupling.  

One quick word on the exam ref.  The author proclaims that he's going to keep everything "high level", then goes on to include unit tests to compare the performance of a DataAdapter and a DataReader and notes they differ by 1ms... I'm already seeing why the reviews were so poor.  The authors spend a lot of time talking about the architecture of ADO.NET, but I think I'll leave that to the implementation post.

I will share one example application I worked on that using an ASP.NET data provider made sense.  The application was a simple utility intended to upload the contents of an Excel spreadsheet into a database.  That was it.  I simply connect to the spreadsheet using an OleDbConnection (which is a subclass of the DbConnection), then use the SqlBulkCopy class to push the data to the Sql Server database.  The core code is about 80 lines and could probably be made even cleaner if I'd had more than a few hours to put the app together. 



Entity Framework



As early as 2010, Microsoft was recommending Entity Framework as the default data access technology for new projects (based on the 2010 TechEd talks). Now, for some reason the Exam Ref spends a lot of time on history and mechanics of EF.  I think the workings of EF are better addressed in the topics specifically geared to that  (query using EF, query using EF data provider, query using LINQ to Entities, building EF data model).  And the history story telling felt like fluff.  I haven't been too impressed with the Exam Ref so I'm not surprised. (I wrote this section like six months after writing the bit about ADO.NET, so it amused me to reread my take on the Exam Ref and reach the same conclusion...)

I think the two biggest advantages that come with Entity Framework compared to hand rolled ADO.NET are increased developer productivity, and abstracting away the low level data access code allowing developers to focus on a higher level of abstraction.  The 5 reasons article hits on these at a more granular level, but it still boils down to these two ideas.

Entity framework makes programmers more productive both when they are writing the code, and later when they have to come back to the code.  The code generation and object mapping facilities in EF save developers from having to write myriad boilerplate CRUD queries in raw SQL, and then having to manually parse out domain objects from DataReaders in ADO.  Instead, they can compose their data models of the object relevant to their domain, and let EF handle the persistence.  Or if the database is already there, EF gives them the flexibility to leverage the existing database structure, views, sprocs, etc.

Entity Framework works more efficiently with patterns such as the Repository and Unit Of Work pattern, because of the way the DataProvider constructs queries from the IQueryable expression tree. While ADO.NET can certainly be abstracted behind a repository that provides an IEnumerable or IQueryable list of domain objects, all the optimization must happen within the hand written SQL.  At best, interesting queries could be parameterized, but with Entity Framework robust, efficient queries can be performed using the readable, fluent syntax of LINQ.

Entity Framework maps well to the ideas in Domain Driven Design, which is all about focusing on the business problem.  Julie Lerman gives a talk at TechEd 2015 title Domain-Driven Design for the Database Driven Mind that goes deep into the benefits and challenges of using EF in DDD.  She says that in many cases, EF can act as the "anti-corruption" or "mapping" layer between the bounded context and the persistence layer.  Entity Framework allows you to map dbContexts to different schemas in the database (or even to different databases), so each bounded context can have it's own isolated data store.  She goes much deeper in her DDD Fundamentals PluralSight course, and also does another EF+DDD TechEd talk (Entity Framework Model Partitioning in Domain-Driven Design Bounded Contexts).

There are many good tools for Entity Framework, both built into Visual Studio and as add-ons, such as Entity Framework Power Tools.  There are also pitfalls that one needs to avoid when using Entity Framework.  Performance is a popular reason to cite for not using it, but there are ways to enhance EF's performance (usually by not making bone headed mistakes).  Ben Emmet's article Entity Framework Performance and What You Can Do About It has some excellent tips for addressing performance issues in EF.



WCF Data Services


While Entity Framework and ADO.NET are more or less mutually exclusive (you generally do one or the other), WCF Data Services is meant to be used with another data provider (commonly, but not necessarily, Entity Framework).  The purpose of WCF Data Services is to make data queryable in a RESTful way, returning data as JSON, Atom, or XML.  By following the OData standard, data exposed through WCF Data Services can be queried in a consistent and predictable way.

So what, right?  Well, one thing that is cool about offering up your data feed like this is that consumers can use your data right out of the box.  One example of this is Excel:


I did find a blog post by Ben Morris about Netflix dropping their support for their OData feed, and he makes some interesting points about the weaknesses of the approach.  He states that API should have a specific purpose, as opposed to the OData approach which is much like exposing a generic repository of all your internal data implementation details.

Searching for "OData feed" you find the highest ranked results are all sample feeds.  I did find a page for the City of Scottsdale, AZ data feeds, which are super simple data models for public data. Looking at the "Producers" and "Live Services" pages of the odata.org page, it's clear that OData is still an ecosystem closely tied to Microsoft (though there are a few public entities in there as well).

OData is well suited to situations in which you want to make your data open and discoverable.  While Ben Morris compares the metadata functionality to a WSDL like interface (he wasn't being complementary), this does provide an explicit description of the data and how to navigate the data set.  Lack of discoverability is one of the main drawbacks to REST style API's, something OData is trying to address.

Mike Pizzo's 2015 talk at Open Specification Plugfest really covered the motivations for OData.  He gave a definition of OData as this:
A set of conventions that encapsulate best practices and foster interoperability for any RESTful service
He highlights a much more diverse ecosystem than the Google search seems to imply, part of which could be because many producers exposing their data through OData don't necessarily make a big deal about the OData bit.  The Office 365 API's are actually built on OData, but Microsoft doesn't really make a big deal of it on the documentation pages.  



Azure Storage


Scenarios that require high availability, scalability, and durability are increasingly being solved by cloud storage solutions. Azure storage offers several services to meet different needs: blobs, tables, queues, disks, and files.  All these storage types live within an Azure Storage Account, which basically acts as a namespace.  A single subscription can have up to 200 storage accounts, each with a limit up to 500TB.  Overall the limits on storage accounts are pretty generous, and you only pay for what you use.

The first, and arguably most basic of these services, is Azure Blob Storage.  Blob storage allows you to upload very large text and binary files that can be exposed for public consumption or used internally by your application.  Blob storage is unique among the Azure storage services in that it is the only one that allows for anonymous public access through a url.  Blog storage is well suited to serving images and videos, storing backups, or hosting large datasets for analysis.

Blobs come in three varieties: Block, Append, and Page.  Block blobs are well suited for documents and media files.  A single block can contain up to 50k blobs of up to 100MB each.  Append blobs are similar to block blobs, but optimized for append operations, making them ideal for logs.  Page blogs can be much larger (up to 1TB), and are optimized for frequent IO operations (Azure VM's use these blobs as OS and data disks).   (full disclosure... I'm pulling most of this straight off the documentation)...

Table storage is a NoSQL, key-attribute data store.  Unlike blobs, table entries are relatively small at 1MB each. The Overview video demos a method of combining table store and blob store that leverages the strengths of both to provide a fast, scalable storage solution.  While intended for structured data, limitations on indexing require implementation to rethink how data is structured compared to traditional relational systems.  Data is table storage can also be accessed through an OData interface automatically.  Table storage can serve application data at web scale.

Queue storage offers many of the expected benefits of a message queuing system.  It is important to note that storage queues are not the same thing as service bus queues.  Storage queues offer a simple REST based Get/Put/Peek interface (much like an abstract queue), whereas service bus queues offer more advanced patterns such as pub-sub.  The use of message queues allows modules in a system to remain decoupled.  Individual message are small (up to 64KB), but a queue can contain millions of messages (up to the storage limit of the account, 500GB).

File storage essentially creates a file system share in the cloud using the SMB protocol (anyone who's used Samba in Linux has dealt with this).  There is also a REST api available. Azure VM's and on-prem clients alike can mount these file shares just as they would any other file share.  File storage can be used to store shared application data (config files), diagnostic data (logs, crash dumps, etc.), and tools and utilities needed for system administration.

Disk storage allows you to create virtual hard disks for your Azure VMs.  Azure VM's have two virtual disks by default, one for the OS and another for temporary data.  Optionally you can create data disks that are persistant and attach these to VMs.  Managed Disks are a related service that abstract away some of the gritty details, allowing you to specify a type and size and letting Azure manage the rest.



As seen in the architecture overview, the various storage offerings all share some underlying infrastructure.  This infrastructure is what allows data to be offered with low latency world wide, high durability with multiple redundant copies in geographically diverse locations (as long as the appropriate options are selected). Each of the services addresses a fairly specific problem space, and are likely to complement other data access technologies in our apps.


3 comments:

  1. Thank you for all the time you put into this.

    ReplyDelete
  2. very good article specially whoever is looking for certification reading. Thank you.

    ReplyDelete