开发者

How to create a Python script that grabs text from one site and reposts it to another?

I would like to create a Python script that grabs digits of Pi from this site: http://www.piday.org/开发者_运维知识库million.php and reposts them to this site: http://www.freelove-forum.com/index.php I am NOT spamming or playing a prank, it is an inside joke with the creator and webmaster, a belated Pi day celebration, if you will.


Import urllib2 and BeautifulSoup

import urllib2
from BeautifulSoup import BeautifulSoup

specify the url and fetch using urllib2

url = 'http://www.piday.org/million.php'
response = urlopen(url)

and then use BeautifulSoup which uses the tags in the page to build a dictionary, and then you can query the dictionary with the relevant tags that define the data to extract what you want.

soup = BeautifulSoup(response)

pi = soup.findAll('TAG')

where 'TAG' is the relevant tag you want to find that identifies where pi is.

Specify what you want to print out

out = '<html><body>'+pi+'</html></body>

You can then write this to a HTML file that you serve, using pythons inbuilt file operations.

f = open('file.html', 'w')
f.write(out)
f.close()

You then serve the file 'file.html' using your webserver.

If you don't want to use BeautifulSoup you could use re and urllib, but it is not as 'pretty' as BeautifulSoup.


When you post a post, it's done with a POST request which is sent to the server. Look at the code on your site:

<form action="enter.php" method="post">
  <textarea name="post">Enter text here</textarea> 
</form>

You are going to send a POST request with a parameter of post (bad object naming IMO), which is your text.

As for the site you are grabbing from, if you look at the source code, the Pi is actually inside of an <iframe> with this URL:

 http://www.piday.org/includes/pi_to_1million_digits_v2.html

Looking at that source code, you can see that the page is just a single <p> tag directly descending from a <body> tag (the site doesn't have the <!DOCTYPE>, but I'll include one):

<!DOCTYPE html>

<html>
  <head>
    ...
  </head>

  <body>
    <p>3.1415926535897932384...</p>
  </body>
</html>

Since HTML is a form of XML, you will need to use an XML parser to parse the webpage. I use BeautifulSoup, as it works very well with malformed or invalid XML, but even better with perfectly valid HTML.

To download the actual page, which you would feed into the XML parser, you can use Python's built-in urllib2. For the POST request, I'd use Python's standard httplib.

So a complete example would be this:

import urllib, httplib
from BeautifulSoup import BeautifulSoup

# Downloads and parses the webpage with Pi
page = urllib.urlopen('http://www.piday.org/includes/pi_to_1million_digits_v2.html')
soup = BeautifulSoup(page)

# Extracts the Pi. There's only one <p> tag, so just select the first one
pi_list = soup.findAll('p')[0].contents
pi = ''.join(str(s).replace('\n', '') for s in pi_list).replace('<br />', '')

# Creates the POST request's body. Still bad object naming on the creator's part...
parameters = urllib.urlencode({'post':      pi, 
                               'name':      'spammer',
                               'post_type': 'confession',
                               'school':    'all'})

# Crafts the POST request's header.
headers = {'Content-type': 'application/x-www-form-urlencoded',
           'Accept':       'text/plain'}

# Creates the connection to the website
connection = httplib.HTTPConnection('freelove-forum.com:80')
connection.request('POST', '/enter.php', parameters, headers)

# Sends it out and gets the response
response = connection.getresponse()
print response.status, response.reason

# Finishes the connections
data = response.read()
connection.close()

But if you are using this for a malicious purpose, do know that the server logs all IP addresses.


You could use the urllib2 module which come in any Python distribution.

It allows you to open an URL as you were opening a file on the filesystem. So you can fetch the PI data with

pi_million_file = urllib2.urlopen("http://www.piday.org/million.php")

parse the resulting file which will be the HTML code of the webpage you see in your browser.

Then you should use the right URL for your website to POST with PI.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜