Discover new files
I have a network storage device that contains a few hundred thousand mp3 files, organized by [artist]/[album]
hierarchy. I need to identify newly added artist folders and/or newly added album folders programmatically on demand (not monitoring, but by request).
Our dev server is Windows-based, the production server will be FreeBSD. A cross-platform solution is optimal because the production server may not always be *nix, and I'd like to spend as little time on reconciling the (unavoidable) differences between the dev and production server as possible.
I have a working proof-of-concept that is Windows platform-dependent: using a Scripting.FileSystemObject
COM object I am iterating through all top-level (artist) directories and checking the size of the directory. If there is a change, then the directory is further explored to find new album folders. As the directories are iterated, the path and file size is collected into an array, which I write serialized into a file for next time. This array is used on a subsequent call, both to identify changed artist directories (new album added) as well as identifying completely new artist directories.
This feels convoluted, and as I mentioned it is platform-dependent. To boil it down, my goals are:
- Identify new top-tier directories
- Identify new second-tier directories
- Identify new loose files within the top-tier directories
Execution time is not a concern here, and security is not an obstacle: this is an internal-only project using only intranet assets, so we can do whatever has to be done to facilitate the desired end result.
Here's my working proof-of-concept:
// read the cached list of artist folders
$folder_list_cache_file = 'seartistfolderlist.pctf';
$fh = fopen($folder_list_cache_file, 'r');
$folder_list_cache = fread($fh, filesize($folder_list_cache_file));
fclose($fh);
if (!$folder_list_cache)
$folder_list_cache = '';
$folder_list_cache = unserialize($folder_list_cache);
if (!is_array($folder_list_cache))
$folder_list_cache = array();
// container arrays
$found_artist_folders = array();
$newly_found_artist_folders = array();
$changed_artist_folders = array();
$filesystem = new COM('Scripting.FileSystemObject');
$dir = "//network_path_to_folders/";
if ($handle = opendir($dir)) {
// loop the directories
while (false !== ($file = readdir($handle))) {
// skip non-entities
if ($file == '.' || $file == '..')
continue;
// make a key-friendly version of the artist name, skip invalids
// ie 10000-maniacs
$file_t = trim(post_slug($file));
if (strlen($file_t) < 1)
continue;
// build the full path
$pth = $dir.$file;
// skip loose top-level files
if (!is_dir($pth))
continue;
// attempt to get the size of the directory
$size = 'ERR';
try {
$f = $filesystem->getfolder($pth);
$size = $f->Size();
} catch (Exception $e) {
/* failed to get size */
}
// if the artist is not known, they are newly added
if (!array_key_exists($file_t, $folder_list_cache)) {
$newly_found_artist_folders[$file_t] = $file;
} elseif (array_key_exists($file_t, $folder_list_cache) && $size != $folder_list_cache[$file_t]['siz开发者_JAVA技巧e']) {
// if the artist is known but the size is different, a new album is added
$changed_artist_folders[] = $file;
}
// build a list of everything, along with file size to write into the cache file
$found_artist_folders[$file_t] = array (
'path'=>$file,
'size'=>$size
);
}
closedir($handle);
}
// write the list to a file for next time
$fh = fopen($folder_list_cache_file, 'w') or die("can't open file");
fwrite($fh, serialize($found_artist_folders));
fclose($fh);
// deal with discovered additions and changes....
Another thing to mention: because these are MP3s, the sizes I'm dealing with are big. So big, in fact, that I have to watch out for PHP's limitation on unsized integers. The drive is currently at 90% utilization of 1.7TB (yes, SATA in RAID), a new set of multi-TB drives will be added soon only to be filled up in short order.
EDIT
I did not mention the database because I thought it would be a needless detail, but there IS a database. This script is seeking new additions to the digital portion of our music library; at the end of the code where it says "deal with discovered additions and changes", it is reading ID3 tags and doing Amazon lookups, then adding the new stuff to a database table. Someone will come along and review the new additions and screen the data, then it will be added it to the "official" database of albums available for play. Many of the songs we're dealing with are by local artists, so the ID3 and Amazon lookups don't give the track titles, album name, etc. In that case, the human intervention is critical to fill in the missing data.
Simplest thing for the BSD-side is a find
script that simply looks for inodes with a ctime greater than the last time it ran.
Leave a sentinel file somewhere to 'store' the last run time, which you can do with a simple
touch /tmp/find_sentinel
and then
find /top/of/mp3/tree --cnewer /tmp/find_sentinel
which will produce a list of files/directory which have been changed since the find_sentinel file was touched. Running this via cron will get you regular updates, and the script doing the find can them digest the returned file data into your database for processing.
You could accomplish something similar on the Windows-side with Cygwin, which'd provide an identical 'find' app.
DirectoryIterator
will help you walk the filesystem. You should consider putting the information in a database though.
I'd go with a solution that enumerates the contents of each folder in a MySQL database; your scan can quickly check against the contents listed in the database, and add entries that aren't already there. This gives you nice enumeration and searchability of the contents, and should be plenty fast for your needs.
精彩评论