Search
  Home
  Articles
  Backup
  Books
  Certification
  FAQ
  Products
  Replication
  Scripts
  Seminars
  Training
  TSQL

MSDN Fourms
Philippine SSUG

  Resume

MHS Enterprises
FilAm Software
AcrylicAcetate.com
Bargain Humidors
Western Humidor



Version: SQL Server 2005 RTM through SP2. Feature: Merge replication

SQL Server 2005 has the new capability to allow you to propagate DDL changes through the replication engine. While this is a very welcome addition, you must be extremely careful with the way that you deploy schema changes. A perfectly valid DDL script, which executes without an error on the publisher, can cause the merge replication engine to completely and irrevocably explode. Fixing this wonderful problem, sarcasm intended, means that you have to completely blow away your subscribers and start from scratch. While this might not be an issue if you have one or two subscribers, just consider having to do this with dozens, hundreds, or thousands of subscribers.

You might be wondering how you manage to explode the merge replication engine with a very simple and completely valid DDL script. The steps below will give you an exact reproduction of the problem.

The steps below require a single instance of SQL Server 2005 SP2 or lower and assume that replication is already enabled.

  1. Create a database named test1 - this will be your publication database
  2. Create a database named test2 - this will be your subscription database
  3. Execute the following command in test1:
    CREATE TABLE dbo.Table1 (Column1 int NULL)
  4. Now create a merge publication that includes dbo.Table1 as the only article, with no filtering
  5. Once the publication has been created, push a new subscription to test2
  6. Verify that test2 now contains your table named dbo.Table1, which means everything is synchronized
  7. Execute the following script against test1:
    USE test1
    GO

    CREATE TABLE dbo.Table2
    (Column2 int NOT NULL,
    CONSTRAINT pk_table2 PRIMARY KEY CLUSTERED (Column2))

    ALTER TABLE dbo.Table1
    ADD Column2 int NULL

    ALTER TABLE dbo.Table1 WITH CHECK ADD CONSTRAINT fk_table2totable1 FOREIGN KEY(Column2) REFERENCES dbo.Table2 (Column2)
    GO
  8. Go to the properties of your publication, select articles, and add dbo.Table2 to the publication.
  9. Regenerate the snapshot
  10. Synchronize the subscriber
  11. Observe that you are getting an error message that says replication can not create the foreign key

At this point, merge replication is completely broken and nothing is going to get it working again short of completely blowing away the subscriber and starting over

There is someone reading this who is thinking that you could just reinitialize the subscriber. True, but consider that for a moment. This was a simple demo. Real life is not so simple. You have changes sitting on the subscriber that you are getting ready to blow away with a reinit. You might have hundreds or thousands of these subscribers sitting on the other end of very slow WAN links or you might be dealing with a rather large data set. All of these situations dictate that reinitializing all of the subscribers means you are effectively taking your business completely offline until you are done.

Let us consider that you actually have the luxury of reinitializing. What about the subscribers with changes that have not uploaded? Well, there is a new option in SQL Server 2005 that allows you to upload changes first and then reinit. The undocumented fine print for this feature does NOT tell you that in order to upload changes first, the schema must match between the publisher and subscriber. Therefore, it is impossible to upload changes first and then blast the subscriber away with a reinit.

The only way out of this predicament is to perform the following steps:

  1. Connect subscriber #1 to your network
  2. Connect to subscriber #1 and backup the local database
  3. Reinitialize without uploading any changes
  4. Transfer the local backup to the publisher and restore
  5. Manually compare every row between the publisher and subscriber to identify any differences
  6. Generate a script containing those differences (these last two steps can be done using the tablediff command line utility - as long as you don't have a table that is too large for tablediff)
  7. Stuff the changes back into the publisher and allow to propagate back down to the subscriber
  8. Repeat the steps above for subscribers 2 - N

The steps above might look simple, but just consider doing this on a subscriber that might have hundreds of tables and millions of rows of data. Additionally, how are you going to reconcile changes that have occurred to rows on multiple subscribers. Add to that, the fact that if you are using auto-identity range management, the SET IDENITY_INSERT command is not allowed, meaning that you now have to generate a new identity value for a parent row and relink all child rows to it.

As you can tell, this is an extremely bad situation. There is exactly 1 workaround that does not entail reinitializing, but it does require a significant amount of manual effort. In order to get everything running again, you have to connect to each subscriber and manually create Table2 on the subscriber. All of the DDL changes along with the incremental snapshot will then execute. Those managing small implementations might think the workaround is pretty simple and straightforward, you are the lucky ones. This situation happened at one of my customers where we have 42 active subscribers, some of which might not connect for 10 - 15 days at a time. So, you have a broken system for weeks before you can get everything repiared and that means a LOT of sleepless nights while people are screaming at you because they can't work. One of my other customers was prepping a script very similar to the one above to deploy this weekend, except they have 17,352 subscribers that would have been exploded had this happened and fixing the problem would have meant literally millions of dollars.

So, the bottom line is that if you are using merge replication to deploy schema changes, you must place any table creations in their own script, add them to replication, and immediately generate a snapshot. Only then can you go back and add the columns to the child tables and create the foreign keys. Running all of this in a single script will explode your replication system.

Some of you are quickly reading through this to find out the reason why this happens, after all you keep contacting me, going to conference sessions that I'm at, and attending training to find out the gory details of all of this merge metadata stuff. So, this next section is for you. For those who do not want to warp their brains, you can stop reading now.

Many think that DDL changes are picked up from the transaction log by the replication engine. This is not true. DDL changes are propagated by means of a set of DDL triggers. These triggers are named: MSmerge_tr_alterschemaonly, MSmerge_tr_altertable, MSmerge_tr_altertrigger, and MSmerge_tr_alterview. Each database participating in merge replication will have these 4 DDL triggers. The purpose of the DDL triggers is to pick up any ALTER commands and place them in dbo.sysmergeschemachange.

Every time you change the schema for an object that is participating in merge replication, a new schema version will be generated. A new schema version will be generated for EVERY change that is made. For example, say you want to add 6 columns to a table that is replicated, You can do this either by creating an ALTER TABLE command that adds all 6 columns. You can also do this by creating 6 separate ALTER TABLE commands that add 1 column each. If you wrap all 6 columns into a single ALTER TABLE command, the schema version will increment by 1. If you execute 6 separate ALTER TABLE commands, the schema version will increment by 6 - one schema version for each ALTER TABLE command. Additionally, every time you add a new article to a publication, the schema version is incremented. Think about this very carefully before going on and you will start to see the problem.

Where this behavior creates a collision with merge replication is in the interaction of the merge DDL triggers and the sysmergeschemachange table. When we execute our demo script which explodes merge replication, the following actions occur underneath the covers:

  1. Table2 is created, but since it is not part of replication, does not cause the schema version to increment
  2. A column is added to Table1 and since it is part of a publication, MSmerge_tr_altertable is fired
  3. MSmerge_tr_altertable increments the schema version and writes a row into sysmergeschemachange containing the ALTER TABLE script
  4. The foreign key is added successfully on the publisher, since the parent table was previously created, causing MSmerge_tr_altertable to fire again
  5. MSmerge_tr_altertable increments the schema version and writes a row into sysmergeschemachange containing the ALTER TABLE script to create the foreign key
  6. You now add Table2 to the publication, causing sp_add_mergearticle to be executed
  7. Since a new table has been added, a new snapshot has to be generated in order for the incremental change to be sent to the subscriber
  8. When sp_add_mergearticle is executed, the schema version is incremented and an entry for the incremental snapshot is added to sysmergeschemachange
  9. You now try to synchronize the subscriber and it explodes, because the merge engine is applying the schema changes in the order of the schema version contained within sysmergeschemachange, which causes the merge engine to attempt to create the foreign key before the incremental snapshot can create the parent table that is referenced.

You can verify the actions for the above steps by selecting from sysmergeschemachange. The ALTER TABLE commands will have a schematype of 300. The incremental snapshot will have a schematype of 2. You will find that the artid (article ID - found in sysmergearticles) corresponds to the table being altered. The artid for the snapshot will correspond to the newly created table, the parent.

The problem is that the replication engine is sequential. It has to iterate through each schema version and apply it in order without having any intelligence whatsoever. All of the data exists in the table for the replication engine to be able to figure out that it is trying to create a foreign key to a table which is created within an incremental snapshot and therefore to add the parent table first before it attempts to create the foreign key.

The sad part about this major oversight, is that the merge engine doesn't even have to create the foreign key in this case. If you were to manually stuff Table2 into the test2 database, the merge engine would happily execute down through the sysmergeschemachange table. It would perform the following actions:

  1. ALTER dbo.Table1 and add the column
  2. ALTER dbo.Table1 and add the foreign key
  3. Execute the incremental snapshot which SHOULD do the following:
    • Drop the foreign key just created
    • Drop the table that you stuff into test2
    • Recreate dbo.Table2
    • Recreate the foreign key on dbo.Table1

Step 3 does not work properly in this case, because of the incremental snapshot. Since only Table2 is sent down, the application of the snapshot will explode because it cannot drop a table with a foreign key constraint. So, you have to manually shove Table2 into the test2 database. Then synch, which will result in a failure on the incremental snapshot. You then have to manually drop the foreign key on the subscriber. Run the synch again to get through the incremental snapshot. Then drop and recreate the foreign key on the publisher in order to get it sent down.

So, you now have one more restriction on applying DDL changes with replication that is not documented by Microsoft at this time. It is illegal to execute a single script that creates a foreign key to a table already participating in replication if the parent table has not already been replicated to the subscribers. In order to make DDL changes propagate properly, you have to manipulate your DDL scripts so they work with the DDL triggers and are cognizant of the way the schema version is incremented. Therefore, all scripts that you create must ensure that the parent table has been added to the publication and a new snapshot generated which creates the entry in sysmergeschemachange for the parent table. Once the entry for the parent table is in sysmergeschemachange, you can then perform the remainder of your actions necessary to add the foreign key.

One of the claims for replication in SQL Server 2005 was, "it just works". Well, after opening a half dozen support cases on the merge engine blowing up due to the state of the metadata (you will see several more of these articles posted over the next several days), I would amend this as follows: "it just works, provided you understand the precise order in which metadata entries are made into the tables and you ensure that everything you do obeys all of the restrictions, just because a script is legal to execute does not mean that it won't completely blow up the replication engine".

Michael R. Hotek

All content on this site, except where noted, represents an original work of Michael R. Hotek and is protected by applicable copyright laws. The SQL Server FAQ is the sole work of Neil Pike. No page, portion of a page, or download may be used for commercial purposes in whole or in part without the express, written permission of the applicable author.