The definitive Concurrency Management in ADO.NET 2.0 article!

Posted on 6/3/2007 @ 3:10 PM in #Bak2Basics by | Feedback | 5470 views

Have you ever been to a restaurant with a friend? What do you do when the waiter shows up? You give your order, right? But you do it one by one. Either your friend goes first, or you go first. You may even interject one another, but if you try ordering together – it will only confuse the waiter, and in return not make you so happy, when you get your food.

It is rather surprising if you consider the number of instances in everyday life, where concurrency and transaction management are second nature to us. When we drive, ask questions in a classroom, place an order at a drive through, wait in a queue for service – we wait for each other, so we all can use a common and scarce resource cordially. And when some of us don’t, we usually see accidents, a frustrated teacher, mistaken orders, or a service counter that just cannot serve anyone.

Yet when designing computer systems this concept escapes many us. A database in any system is a common scarce resource. Usually there are one or few of them, serving the needs of many requests – be it reading data, or changing data. Thus, in any well designed application, you have to consider concurrency and transaction management scenarios when designing the system. In this article, I will explain the basic concepts behind concurrency management in data driven applications in ADO.NET 2.0.

The two halves of data access

ADO.NET classes can be broadly classified into two major categories.

On one hand you have classes that require you to be connected to the data source in order to operate. Examples of such classes are DBConnection, DBCommand, DBParameter etc. While you could create instances of such classes while not being physically connected to the underlying data source, really in order to use them in any fashion, you have to remain connected with the underlying database. This mechanism of data access constitutes the “connected paradigm” of data access.

The connected world is a very logical way of thinking about data access. If I wish to work with the database, I will have to connect. However, most online transaction processing applications involve the user reading some data out of the database, working with it, and then persisting changes. The largest chunk of time in this operation is usually where the user is digesting and acting upon the data. Thus, frequently it is advisable to break the connection after reading the data, and re-establish the connection when the user is ready to persist their changes. This results in a much more scalable system in general at the cost of a somewhat greater complexity.

This mechanism of data access constitutes the “disconnected paradigm” of data access. The various classes that exist within ADO.NET 2.0 to support the disconnected paradigm are usually in the System.Data namespace, and start with the word “Data”. Examples of such classes are DataSet, DataTable, DataRow, DataRelation etc.

There is also a sentry object between the connected and disconnected worlds, which is the DbDataAdapter object.

Given that both of these worlds are so different, concurrency management schemes for both these worlds are quite different as well.

Concurrency Management in a connected world

When using ADO.NET 2.0 with SQL Server, you would generally use objects under the System.Data.SqlClient namespace. In order to hold commands, you would use the SqlCommand object. SqlCommand has various methods on it that start with the letters, “Execute…” that allow you to execute the command in one manner or the other. These commands need an open connection at the SqlCommand.Connection property.

When a command is executed on SQL Server, it automatically wraps that command in an implicit transaction. An implicit transaction guarantees consistency of data during the execution lifetime of the command. Alternatively, if you need to execute multiple commands in one batch, you can use the SqlTransaction object, or System.Transactions.Transaction to wrap multiple operations in a single explicit transaction.

Thus, while your command(s) is/are executing, SQL Server guarantees the consistency of the data per the nature of the transaction those commands are running under. Thus, if your transaction deems exclusive locks on data being read or written, any parallel attempts to meddle with data that you are working with, will be prevented. But not every transaction deems exclusive locks in every kind of operation, and in many instances this isolation behavior may need to be “broken into” by another parallel attempt that absolutely must gain access to the data right this moment.

This isolation behavior of a transaction is governed by the isolation level of a running transaction. SQL Server has the following isolation levels:

1.        Read Uncommitted: Transactions running at the read uncommitted isolation level do not issue shared locks to prevent other transactions from modifying data read by the current transaction. Such transactions are also not blocked by exclusive locks that would prevent the current transaction from reading rows that have been modified but not committed by other transactions. This means, that in concurrent scenarios, if your command is modifying data, and running under the read uncommitted isolation level, other transactions will happily overwrite your changes. Similarly, if your command is reading data that other transactions have asked to be locked, then you will happily overwrite others hard work.

In other words, in concurrent scenarios, this isolation level will offer you good performance, but no data consistency. Concurrent requests are allowed to go through, without honoring or placing locks.

2.       Read Committed: This is the default isolation level. This specifies that statements that are attempting to read data that is locked by another transaction, will be blocked until the other command either rollsback, or commits the changes. However, if you are a reader and have read some data in the Read Committed isolation level – another concurrent transaction can change the data – since you have read it, but have not yet finished your transaction.

In other words, in concurrent scenarios, this isolation level offers some level of blocking for data that is being modified, but hardly any blocking for data that is being read. It is also important to point out that this is an over simplified picture – the data being locked isn’t just the data that you have modified. Depending upon the indexing scheme being used, or the current I/O the database is handling, you are probably locking a bit or a lot more than the rows you have explicitly modified in your transaction.

3.       Repeatable Read: Just like Read Committed, Transactions running under this isolation level cannot read data that has been modified but not yet committed by other transactions. However, in addition to Read Committed, other transactions cannot modify the data that you have only read, but not modified during the current transaction.

This means, in concurrent scenarios, this isolation level offers data consistency between both reads, and writes between multiple transactions. If both reads and writes are protected, what could go wrong in such a scenario? This isolation doesn’t prevent you from falling folly to potential data matching your read – that hasn’t appeared yet. That is a job for the Serializable isolation level.

4.      Serializable: Just like Repeatable Read, the Serializable isolation level causes the readers to not read data that is modified but not committed by other concurrent transactions, prevents other transactions from meddling with data that you have either read, or modified. However, in addition to Repeatable Read, any new rows being inserted from other transactions, that may match the “where” clause of reads done under the current transaction cannot be inserted until the current transaction completes.

This means, in concurrent scenarios, this isolation level offers perfect data consistency, but awful concurrent performance.

5.       Snapshot: It seems like all isolation levels make you choose between data consistency or good performance. The snapshot isolation level falls right in the middle by offering non-blocking reads. It does so by making a previous copy of modified data and offering the previous consistent state of the data to transactions that are requesting current data to be read, while in the middle of a transaction. This isolation level can be further tweaked to behave as repeatable read or read committed. Since this article is more about concurrency, rather than transactions, I would advise you to read details of this isolation level and transactions in general in Chapter 11 of Microsoft SQL Server 2005 by MS Press (Andrew Brust, Stephen Forte).

A practical example

Imagine a banking system. An account holder is trying to do two operations– withdraw $100 from an ATM machine, and transfer $75 from the account to an external account. The problem is, the user is doing these two operations concurrently, and the beginning account balance is only $150. Now, as humans it is obvious to us that this should fail. Let us see how a computer system deals with this problem.

Unfortunately, the system architect of the system decided to implement this as a connected system, and didn’t understand the nuances of isolation levels when implementing the system. He did wrap the commands in a transaction, but left them at the default “Read Committed” isolation level.

Here is how Operation 1 (Withdraw $100) would look like, if it was executing in its own world (say on a developer’s desktop)  

Command Executed

Data Read

Account Balance

Get Current Balance

$150

$150

Do I have enough balance to subtract $100?

Yes I do

$150

Subtract $100

 

$150 - $100 = $50

Commit

 

$50

Read Updated Balance

$50

$50

 

Here is how Operation 2 (transfer $75) would look like, if it was executing in its own world (say on a developer’s desktop) –

Command Executed

Data Read

Account Balance

Get Current Balance

$150

$150

Do I have enough balance to transfer $75?

Yes I do

$150

Subtract $75

 

$150 - $175 = $75

Commit

 

$75

Read Updated Balance

$75

$75

 

At this time, the system architect was convinced that he did a good job, and put this system into production – without thinking of concurrency. But here is what happened under the default read committed isolation level –

Operation 1 Command

Operation 2 Command

Data Read

Account Balance

Description

Get Current Balance

 

$150

$150

Since the transaction is running under Read Committed, locks are not placed yet.

 

Get Current Balance

$150

$150

Unfortunately, this read will succeed because data isn’t locked yet.

Can I Withdraw $100?

 

 

$150

Operation 1 thinks that the balance is $150, so it will issue an update for -$100.

 

Can I transfer $75?

 

$150

Operation 1 thinks that the balance is $150, so it will issue an update for -$75.

Subtract $100

 

 

$50

The data isn’t locked yet, so this will succeed.

Commit

 

 

$50

So far so good.

 

Subtract $75

 

-$25

Since no current transactions have locked the data (Operation 1 committed), this will succeed.

 

OUCH!

 

Commit

 

-$25

OUCH!

Read current balance

 

-$25

 

OUCH!

 

Read current balance

-$25

 

OUCH!

 

Have we all heard the phrase, “It works on my machine!” before?

The above is a poster child for such a scenario – where the system architect ignored the effect of isolation levels in a concurrent scenario. There are two solutions to the above problem:

1.        Run the transactions in Repeatable Read isolation level.

2.       Send in only the changes – let the database do the subtraction. During the update ensure that the data hasn’t changed since the last time you read it.

The second solution mentioned above, brings me to the second half of data access – the disconnected paradigm.

Concurrency Management in a disconnected world

In the above practical problem I presented, the second solution is so good, that I am going to repeat it.

Send in only the changes – let the database do the subtraction. During the update ensure that the data hasn’t changed since the last time you read it.

This has profound implications on your architecture, as it reflects on two major tenets of a data driven application architecture.

a)      Design your system with traffic lights, not ambulances. Sending in only the changes and letting the database do the math, demonstrates a singular example of where better database design can prevent many concurrency issues. Other such examples are journaling databases, GUID keys, keys that do not have a business meaning, specific kinds of database design for hierarchical updates, and many other such practices that a system architect can choose from. These act as traffic lights that prevent concurrency conflicts. You should design a system that has traffic lights – not ambulances.

b)     But the sad reality is that accidents will still occur. And that is when you will need ambulances. Even if the database is doing the math – a different user could have updated the field in the meanwhile. At this point, if you were somehow able to detect that somebody else has changed the data, since you last read it. If in case the data was indeed changed – you could fail the update, and ask the user to retry with a fresh read of the data.

What is interesting about being able to detect if somebody else has changed the data, is that – anyone could have only changed data, if you hadn’t placed a lock on it while reading that data.

 This presents a significant opportunity to the application architect, because, if your architecture doesn’t need to place locks during reads, it probably doesn’t need to maintain an open database connection either. Not for concurrency reasons at least. In other words, this mechanism of being able to detect someone else’s changes works perfectly well in disconnected scenarios.

There are 4 ways of detecting if someone else has changed data since the last time you read it. The best way to understand these is to look at a practical example.

A practical example

Imagine a pet store. There is a table called Pets with the following data in it.

Key

AnimalType

Weight

1

Puppy

5lbs

 

Now there are two users, Bill and Frans that wish to update this data in a disconnected environment. Both of them execute reads, disconnect, change some data, and then try and save by reconnecting at a later time.

Here are the 4 possible strategies of detecting and acting upon their concurrency conflicts in this disconnected paradigm.

Strategy #1: Check for only Primary Keys

In this mechanism, the update query’s where clause checks for only the primary key. Thus, the following would be the sequence of data between Bill and Frans.

Bill and Frans both read the following –

Key

AnimalType

Weight

1

Puppy

5lbs

 

Bill issues an update that looks like this –

Update Pets Set AnimalType = ‘Dog’, Weight = ‘50lbs’ where Key = 1

The data now looks like this –

Key

AnimalType

Weight

1

Dog

50lbs

 

Next, Frans issues an update that looks like this –

Update Pets Set AnimalType = ‘Monkey’, Weight = ‘20lbs’ where Key = 1

The data now looks like this –

Key

AnimalType

Weight

1

Monkey

20lbs

 

Thus this is a “Last in wins” approach. Effectively, you are not checking for concurrency conflicts, the performance is usually blazing fast, and this approach is cross database (i.e. not specific to SQL Server only).

Strategy #2: Check for Primary Keys and all other columns

In this mechanism, the update query’s where clause checks for the primary key and all other columns. Thus, the following would be the sequence of data between Bill and Frans.

Bill and Frans both read the following –

Key

AnimalType

Weight

1

Puppy

5lbs

 

Bill issues an update that looks like this –

Update Pets
  Set AnimalType = ‘Dog’, Weight = ‘50lbs’
where
   Key = 1 and AnimalType = ‘Puppy’ and Weight = ‘5lbs’

 

 

The data now looks like this –

Key

AnimalType

Weight

1

Dog

50lbs

 

Next, Frans issues an update that looks like this –

Update Pets
   Set AnimalType = ‘Monkey’, Weight = ‘20lbs’
where
   Key = 1 and AnimalType = ‘Puppy’ and Weight = ‘5lbs’

At this time, the above update query will end up affecting zero rows, because there are no puppies in the database – only dogs.

Thus this is a “First in wins” approach. In this scenario, you are checking for concurrency conflicts, the performance is usually terrible because of a complicated where clause, and this approach is cross database (i.e. not specific to SQL Server only).

Strategy #3: Check for Primary Keys and only changed columns

In this mechanism, the update query’s where clause checks for the primary key and all other columns that have changed. Thus, the following would be the sequence of data between Bill and Frans.

Bill and Frans both read the following –

Key

AnimalType

Weight

1

Puppy

5lbs

 

Bill issues an update that looks like this –

Update Pets
  Set AnimalType = ‘Dog’
where
   Key = 1 and AnimalType = ‘Puppy’ and Weight = ‘5lbs’

The data now looks like this –

Key

AnimalType

Weight

1

Dog

5lbs

 

Next, Frans issues an update that looks like this –

Update Pets
   Set AnimalType = ‘Monkey’, Weight = ‘20lbs’
where
   Key = 1 and AnimalType = ‘Puppy’ and Weight = ‘5lbs’

At this time, the above update query will end up affecting zero rows, because there are no puppies in the database – only dogs.

Thus this is a somewhat of a “First in wins” approach  - with a sinister hole.

So what is the sinister hole?

Now that Frans’ query has failed, what does he do next? He probably gets a fresh read out of the database that looks like this –

Key

AnimalType

Weight

1

Dog

5lbs

 

Frans, at this time smiles and agrees that the AnimalType should be dog, but the Weight should be 50lbs. However, before Frans can issue his update, another user called Tobin reads the above data, and issues the following update statement.

Update Pets
   Set AnimalType = ‘Monkey’
where
   Key = 1 and AnimalType = ‘Dog’

Next, Frans issues his update statement updating only the weight of the pet –

Update Pets
   Set Weight = ‘50lbs’
where
   Key = 1 and Weight = ‘5lbs’

So at the end of the above sequence of commands,

·         Bill thinks he has a Dog that weighs 5 lbs

·         Frans thinks that he has a dog that weighs 50 lbs

·         Tobin thinks that he has a monkey that weighs 5lbs

·         But Sahil, who queries the actual data in the end, knows that the actual data is a Monkey that weighs 50 lbs.

So this scenario isn’t foolproof, you do check for concurrency conflicts, and the performance is usually bad. It is arguable if the performance is any better than checking for all columns in a where clause, since while there is lesser data to check for, the structure of the query keeps changing – and thus cached query plans are not used effectively.

Strategy #4: Check for Primary Keys and the Timestamp column

In this strategy, we take advantage of the TimeStamp (RowVersion) column in SQL Server. In Oracle, you could achieve the same using ORA_ROWSCN. The TimeStamp column changes automatically every time an update is issued on a given row. Thus, Bill, Frans, Tobin and Sahil are never confused about each other’s updates, if they validate the timestamp they had read, with the timestamp on the row that they are updating.

In this mechanism, you are checking for concurrency conflicts, the performance is quite good, but the approach is not database independent.

Conclusion

Unless you make your living writing complicated computer systems that are used only by a single user, you cannot ignore the importance of concurrency. This article presented a quick overview of concurrency strategies and concerns in both connected and disconnected scenarios. For a more in-depth treatment on this topic, please read Chapters 9, 10 and 11 in my book – Pro ADO.NET 2.0.

Sound off but keep it civil:

Older comments..


On 6/5/2007 5:23:39 PM Colin Blair said ..
I love the article but you kind of wimp out at the end when you get the the RowVersion strategy. It is like you are assuming that everyone is actually familiar with that approach and just glossed over it. I would love to point people I am trying to help to this entry but not with the "correct" answer being glossed over like that. You can even make it database independent if you manually increment the rowversion yourself on each update. For example,

Update Pets


Set Weight = ‘50lbs’, RowVersion = RowVersion + 1


where


Key = 1 and Weight = ‘5lbs’ and RowVersion = 25

The other thing is that it should be RowVersion (TimeStamp), not the other way around. Calling it a timestamp is confusing, that is why that naming is depreciated.


On 6/5/2007 11:11:50 PM Sahil Malik said ..
Hey Colin - wow being picky are we? :).

Thanks for the inputs, glad you liked the article. Yeah the database independent rowversion is a great idea!

SM