开发者

Find all duplicate documents in sharepoint 2010

When we perform some searches on our sharepoint instance, we see the "View Duplicates" link in the search results for a few files.

Is there a way to report on all of these duplicates?

I've seen that there's this SQL here to find duplicates based on their md5 hash: http://social.technet.microsoft.com/forums/en-US/sharepointsearch/thread/8a8b25d9-a3ac-45df-86de-2a3a7838a534 开发者_运维问答and I have corrected the SQL for SharePoint 2010 compatibility here:

-- Step1 : get all files with short names, md5 signatures, and size
SELECT  md5 ,
        RIGHT(DisplayURL, CHARINDEX('/', REVERSE(DisplayURL)) - 1) AS ShortFileName ,
        DisplayURL AS Url ,
        llVal / 1024 AS FileSizeKb
INTO    #listingFilesMd5Size
FROM    SearchServiceApplication_CrawlStore.dbo.MSSCrawlURL y
        INNER JOIN SearchServiceApplication_PropertyStore.dbo.MSSDocProps dp ON ( y.DocID = dp.DocID )
WHERE   dp.pid = 58 -- File size
        AND llVal > 1024 * 10 -- 10 Kb minimum in size
        AND md5 <> 0
        AND CHARINDEX('/', REVERSE(DisplayURL)) > 1

-- Step 2: Filter duplicated items

SELECT  COUNT(*) AS NbDuplicates ,
        md5 ,
        ShortFileName ,
        FileSizeKb
INTO    #duplicates
FROM    #listingFilesMd5Size
GROUP BY md5 ,
        ShortFileName ,
        FileSizeKb
HAVING  COUNT(*) > 1
ORDER BY COUNT(*) DESC

DROP TABLE #listingFilesMd5Size

-- Step3 : show the report with search URLs

SELECT  *,
        NbDuplicates * FileSizeKb AS TotalSpaceKb ,
        'http://srv-moss/SearchCenter/Pages/results.aspx?k=' + ShortFileName AS SearchUrl
FROM    #duplicates
--ORDER BY NbDuplicates * FileSizeKb DESC

DROP TABLE #duplicates

But this only matches exact duplicates, whereas I'm interested in the ones SharePoint thinks are duplicates based on the "View Duplicates" link in the search results.

I've seen that there's the managed property "DuplicateHash" but this is not documented anywhere and I cannot find a way of accessing it through the object model.

Thanks


You should not query the database directly, you may put yourself in a unsupported state.

About the duplicate: "search duplicate" have nothing to do with hash. This is handled by the search engine index comparing document vector (mostly terms and number of terms).

You may try to find a FQL (if using Fast otherwise a Search QL) query which give you the result, but I'm not sure this is possible.


Use the powershell scripts mentioned here:

https://www.nothingbutsharepoint.com/sites/devwiki/articles/pages/finding-duplicate-documents-in-sharepoint-2010.aspx

http://blog.pointbeyond.com/2011/08/24/finding-duplicate-documents-in-sharepoint-using-powershell/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜