Importing, resizing and uploading millions of images to Amazon S3
We are using PHP with CodeIgniter to import millions of images from hundreds of sources, resizing them locally and then uploading the resized version to Amazon S3. The process is however taking much longer than expected, and we're looking for alternatives to speed things up. For more details:
- A lookup is made in our MySQL database table for images which have not yet been resized. The result is a set of images.
- Each image is imported individually using cURL, and temporarily hosted on our server during processing. They are imported locally because the library doesn't allow resizing/cropping of external开发者_StackOverflow中文版 images. According to some tests the speed difference when importing from different external sources have been between 80-140 seconds (for the entire process, using 200 images per test), so the external source can definitely slow things down.
- The current image is resized using the image_moo library, which creates a copy of the image
- The resized image is uploaded to Amazon S3 using a CodeIgniter S3 library
- The S3 URL for the new resized image is then saved in the database table, before starting with the next image
The process is taking 0.5-1 second per image, meaning all current images would take a month to resize and upload to S3. The major problem with that is that we are constantly adding new sources for images, and expect to have at least 30-50 million images before the end of 2011, compared to current 4 million at the start of May.
I have noticed one answer in StackOverflow which might be a good complement to our solution, where images are resized and uploaded on the fly, but since we don't want any unnecessary delay when people visit pages, we need to make certain that as many images as possible are already uploaded. Besides this, we want multiple size formats of the images, and currently only upload the most important one because of this speed issue. Ideally, we would have at least three size formats (for example one thumbnail, one normal and one large) for each imported image.
Someone suggested making bulk uploads to S3 a few days ago - any experience in how much this could save would be helpful.
Replies to any part of the question would be helpful if you have some experience of similar process. Part of the code (simplified)
$newpic=$picloc.'-'.$width.'x'.$height.'.jpg';
$pic = $this->image_moo
->load($picloc.'.jpg')
->resize($width,$height,TRUE)
->save($newpic,'jpg');
if ($this->image_moo->errors) {
// Do stuff if something goes wrong, for example if image no longer exists - this doesn't happen very often so is not a great concern
}
else {
if (S3::putObject(
S3::inputFile($newpic),
'someplace',
str_replace('./upload/','', $newpic),
S3::ACL_PUBLIC_READ,
array(),
array(
"Content-Type" => "image/jpeg",
)))
{ // save URL to resized image in database, unlink files etc, then start next image
Why not add some wrapping logic that lets you define ranges or groups of images and then run the script several times on the server. If you can have four of these processes running at the same time on different sets of images then it'll finish four times faster!
If you're stuck trying to get through a really big backlog at the moment you could look at spinning up some Amazon EC2 instances and using them to further parallelize the process.
I suggest you split your script into 2 scripts which run concurrently. One would fetch remote images to a local source, simply doing so for any/all images that have not yet been processed or cached locally yet. Since the remote sources add a fair bit of delay to your requests you will benefit from constantly fetching remote images, not only doing so as you process each one.
Concurrently you use a second script to resize any locally cached images and upload them to Amazon S3. Alternately you can split this part of the process as well using one script for resizing to a local file then another to upload any resized files to S3.
The first part (fetch remote source image) would greatly benefit from running multiple concurrent instances like James C suggests above.
精彩评论