Searching a folder (recursively) for duplicate photos using Coldfusion?

2023-03-13 18:24 问答作者：

After moving and backing up my photo collection a few times I have several duplicate photos, with different filenames in various folders scattered across my PC. So I thought I would write a quick CF (9) page to find the duplicates (and can then add code later to allow me to delete them).

I have a couple of queries:-

At the moment I am just using file size to match the image file, but I presume matching EXIF data or matching hash of image file binary would be more reliable?
The code I lashed together sort of works, but how could this be done to search outside web root?
Is there a better way?

开发者_JS百科<cfdirectory 
name="myfiles" 
directory="C:\ColdFusion9\wwwroot\images\photos" 
filter="*.jpg"
recurse="true"
sort="size DESC"
type="file" >


<cfset matchingCount=0>
<cfset duplicatesFound=0>
<table border=1>
<cfloop query="myFiles" endrow="#myfiles.recordcount#-1">

    <cfif myfiles.size is myfiles.size[currentrow + 1]>
        <!---this file is the same size as the next row--->
        <cfset matchingCount = matchingCount + 1>
        <cfset duplicatesFound=1>
    <cfelse>
        <!--- the next file is a different size --->

        <!--- if there have been matches, display them now ---> 
        <cfif matchingCount gt 0>   

            <cfset sRow=#currentrow#-#matchingCount#>
            <cfoutput><tr>
            <cfloop index="i" from="#sRow#" to="#currentrow#"> 
                    <cfset imgURL=#replace(directory[i], "C:\ColdFusion9\wwwroot\", "http://localhost:8500/")#>
                    <td><a href="#imgURL#\#name[i]#"><img height=200 width=200 src="#imgURL#\#name[i]#"></a></td>
            </cfloop></tr><tr>
            <cfloop index="i" from="#sRow#" to="#currentrow#"> 
                <td width=200>#name[i]#<br>#directory[i]#</td>
            </cfloop>
            </tr>
            </cfoutput>

            <cfset matchingCount = 0>

        </cfif> 
    </cfif>
</cfloop>
</table>
<cfif duplicatesFound is 0><cfoutput>No duplicate jpgs found</cfoutput></cfif>

This is pretty fun task, so I've decided to give it a try.

First, some testing results on my laptop with 4GB RAM, 2x2.26Ghz CPU and SSD: 1,143 images, total 263.8MB.

ACF9: 8 duplicates, took ~2.3 s

Railo 3.3: 8 duplicates, took ~2.0 s (yay!)

I've used great tip from this SO answer to pick the best hashing option.

So, here is what I did:

<cfsetting enablecfoutputonly="true" />

<cfset ticks = getTickCount() />

<!--- this is great set of utils from Apache --->
<cfset digestUtils = CreateObject("java","org.apache.commons.codec.digest.DigestUtils") />

<!--- cache containers --->
<cfset checksums = {} />
<cfset duplicates = {} />

<cfdirectory
    action="list"
    name="images"
    directory="/home/trovich/images/"
    filter="*.png|*.jpg|*.jpeg|*.gif"
    recurse="true" />

<cfloop query="images">

    <!--- change delimiter to \ if you're on windoze --->
    <cfset ipath = images.directory & "/" & images.name />

    <cffile action="readbinary" file="#ipath#" variable="binimage" />

    <!---
        This is slow as hell with any encoding!
        <cfset checksum = BinaryEncode(binimage, "Base64") />
     --->

    <cfset checksum = digestUtils.md5hex(binimage) />

    <cfif StructKeyExists(checksums, checksum)>

        <!--- init cache using original on 1st position when duplicate found --->
        <cfif NOT StructKeyExists(duplicates, checksum)>
            <cfset duplicates[checksum] = [] />
            <cfset ArrayAppend(duplicates[checksum], checksums[checksum]) />
        </cfif>

        <!--- append current duplicate --->
        <cfset ArrayAppend(duplicates[checksum], ipath) />

    <cfelse>

        <!--- save originals only into the cache --->
        <cfset checksums[checksum] = ipath />

    </cfif>

</cfloop>

<cfset time = NumberFormat((getTickcount()-ticks)/1000, "._") />


<!--- render duplicates without resizing (see options of cfimage for this) --->

<cfoutput>

<h1>Found #StructCount(duplicates)# duplicates, took ~#time# s</h1>

<cfloop collection="#duplicates#" item="checksum">
<p>
    <!--- display all found paths of duplicate --->
    <cfloop array="#duplicates[checksum]#" index="path">
        #HTMLEditFormat(path)#<br/>
    </cfloop>
    <!--- render only last duplicate, they are the same image any way --->
    <cfimage action="writeToBrowser" source="#path#" />
</p>
</cfloop>

</cfoutput>

Obviously, you can easily use duplicates array to review the results and/or run some cleanup job.

Have fun!

I would recommend split up the checking code into a function which only accepts a filename.

Then use a global struct for checking for duplicates, the key would be "size" or "size_hash" and the value could be an array which will contain all filenames that matches this key.

Run the function on all jpeg files in all different directories, after that scan the struct and report all entries that have more than one file in it's array.

If you want to show an image outside your webroot you can serve it via < cfcontent file="#filename#" type="image/jpeg">

继续阅读：coldfusion duplicates

Searching a folder (recursively) for duplicate photos using Coldfusion?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？