Thursday, October 7, 2010

Deleting Duplicate Rows in SQL Server Based on a Single Column

An import from a client's product list resulted in some incorrectly duplicated products. Normally this would be easy to sort out, except that the duplicates differed by a category column, so the good ol' DISTINCT keyword was of no help. This would also have been easy to do if working with MySQL by using a GROUP BY clause to select a row by distinct instances of a particular column. SQL Server, however, is another story. As I found, if you try the GROUP BY trick, SQL Server will whine.
Server: Msg 8120, Level 16, State 1, Line 1
Column 'ItemNumber' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

I found some reasoning for this online, which is all well and good, except that MySQL does this like a champ and makes life easier as a result. So off I went to find a solution for SQL Server... I eventually stumbled on a page from the SQL Authority, by Pinal Dave (a very humble guy, judging by the title of his site). The gem, however, wasn't the post itself but a comment by Madhivanan further down the page. Thanks to Madhivanan, I came up with the following solution.

DELETE t FROM

(

select (row_number() over (partition by ItemNumber

order by ItemID)) as cnt, * from Item

) as t

WHERE t.cnt > 1

This snippet will give you a count of how many times a particular item number appears in my database's Item table. We isolate only the ones that appear more than once and delete them from the table. Not as simple as using a GROUP BY, but it gets the job done.