开发者

Fetch HTML page and store it in MYSQL- How to

  • What's the best way to store a formatted html page with CSS on to MYSQL database? Is it possible?
  • What the column type should be? How to retrieve the stored formatted HTML and display it correctly using PHP?

  • What if the page I would like to fetch has pics and videos, show I store the page as blob

  • What's the best way to fetch a page using PHP-CURL,fopen,..-?

Many questions g开发者_如何学运维uys but I really need your help to put me on the right way to do it.

Thanks a lot.


Quite simple, try this code I made for you.

It's the basics to grab and save the source in a DB.

I didn't put error handling or whatever else, just keep it simple for the moment...

I didn't made the function to show the result, but you can print the $source to view the result.

Hope this will help you.

<?php

function GetPage($URL)
{
    #Get the source content of the URL
    $source = file_get_contents($URL);

    #Extract the raw URl from the current one
    $scheme = parse_url($URL, PHP_URL_SCHEME); //Ex: http
    $host = parse_url($URL, PHP_URL_HOST); //Ex: www.google.com
    $raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

    #Replace the relative link by an absolute one
    $relative = array();
    $absolute = array();

    #String to search
    $relative[0] = '/src="\//';
    $relative[1] = '/href="\//';

    #String to remplace by
    $absolute[0] = 'src="' . $raw_url . '/';
    $absolute[1] = 'href="' . $raw_url . '/';

    $source = preg_replace($relative, $absolute, $source); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

    return $source;
}

function SaveToDB($source)
{
    #Connect to the DB
    $db = mysql_connect('localhost', 'root', '');

    #Select the DB name
    mysql_select_db('test');

    #Ask for UTF-8 encoding
    mysql_query("SET NAMES 'utf8'");

    #Escape special chars
    $source = mysql_real_escape_string($source);

    #Set the Query
    $query = "INSERT INTO website (source) VALUES ('$source')"; //Save it in a text row, that's it...

    #Run the query
    mysql_query($query);

    #Close the connection
    mysql_close($db);
}

$source = GetPage('http://www.google.com');

SaveToDB($source);

?>


Pull down the whole page using fopen and parse out any URLs (like images and css). You'll want to run a loop to grab each of the urls for files that generate the page. Store these as well, and replace the urls that used to link to the other sites files with your new links. (this will avoid any issues if the files should change or be removed in the future).

I'd recomend using a blob datatype just because it would allow you store all the files in one table, but you could do a table for the pages with a text datatype and another with blob to store images and other files.

Edit: If you are storing as a blob datatype look into base64_encode() it will increase the storage footprint on the server but you'll avoid any issues with quotes and special characters.


Don't use a relation database to store files. Use a filesystem or a NoSQL solution.

You might want to look into the various open source spider that are available (htdig and httrack come to mind).


I'd store the URLs in a database, and make a cron job to wget the pages regularly, storing them in their own keyed local directories. Using wget will allow you to cache the page, and optionally cache its images, scripts, etc... as well. You can also have your wget command change the embedded URLs so that you don't have to cache everything.

Here is the man page for wget, you may also consider searching for "wget backup website" or similar.

(By "keyed directories" I mean that your database table would have 2 fields, a 'key' and a 'url', the [unique] 'key' would then be the path where you archive the website to using wget.)


You can store the data as text datatype in mysql
but you have to convert the data bcz page may content many quotes and special characters.
you can see this question THIS Its not exact to your question but it will help when you will store the data in database.
about that images and videos...if you are storing page content then there will be only paths of that images and videos.. so no problem will come when you will store in database.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜