How to create a Python script that grabs text from one site and reposts it to another?

2023-03-08 02:22 问答作者：

I would like to create a Python script that grabs digits of Pi from this site: http://www.piday.org/开发者_运维知识库million.php and reposts them to this site: http://www.freelove-forum.com/index.php I am NOT spamming or playing a prank, it is an inside joke with the creator and webmaster, a belated Pi day celebration, if you will.

Import urllib2 and BeautifulSoup

import urllib2
from BeautifulSoup import BeautifulSoup

specify the url and fetch using urllib2

url = 'http://www.piday.org/million.php'
response = urlopen(url)

and then use BeautifulSoup which uses the tags in the page to build a dictionary, and then you can query the dictionary with the relevant tags that define the data to extract what you want.

soup = BeautifulSoup(response)

pi = soup.findAll('TAG')

where 'TAG' is the relevant tag you want to find that identifies where pi is.

Specify what you want to print out

out = '<html><body>'+pi+'</html></body>

You can then write this to a HTML file that you serve, using pythons inbuilt file operations.

f = open('file.html', 'w')
f.write(out)
f.close()

You then serve the file 'file.html' using your webserver.

If you don't want to use BeautifulSoup you could use re and urllib, but it is not as 'pretty' as BeautifulSoup.

When you post a post, it's done with a POST request which is sent to the server. Look at the code on your site:

<form action="enter.php" method="post">
  <textarea name="post">Enter text here</textarea> 
</form>

You are going to send a POST request with a parameter of post (bad object naming IMO), which is your text.

As for the site you are grabbing from, if you look at the source code, the Pi is actually inside of an <iframe> with this URL:

 http://www.piday.org/includes/pi_to_1million_digits_v2.html

Looking at that source code, you can see that the page is just a single <p> tag directly descending from a <body> tag (the site doesn't have the <!DOCTYPE>, but I'll include one):

<!DOCTYPE html>

<html>
  <head>
    ...
  </head>

  <body>
    <p>3.1415926535897932384...</p>
  </body>
</html>

Since HTML is a form of XML, you will need to use an XML parser to parse the webpage. I use BeautifulSoup, as it works very well with malformed or invalid XML, but even better with perfectly valid HTML.

To download the actual page, which you would feed into the XML parser, you can use Python's built-in urllib2. For the POST request, I'd use Python's standard httplib.

So a complete example would be this:

import urllib, httplib
from BeautifulSoup import BeautifulSoup

# Downloads and parses the webpage with Pi
page = urllib.urlopen('http://www.piday.org/includes/pi_to_1million_digits_v2.html')
soup = BeautifulSoup(page)

# Extracts the Pi. There's only one <p> tag, so just select the first one
pi_list = soup.findAll('p')[0].contents
pi = ''.join(str(s).replace('\n', '') for s in pi_list).replace('<br />', '')

# Creates the POST request's body. Still bad object naming on the creator's part...
parameters = urllib.urlencode({'post':      pi, 
                               'name':      'spammer',
                               'post_type': 'confession',
                               'school':    'all'})

# Crafts the POST request's header.
headers = {'Content-type': 'application/x-www-form-urlencoded',
           'Accept':       'text/plain'}

# Creates the connection to the website
connection = httplib.HTTPConnection('freelove-forum.com:80')
connection.request('POST', '/enter.php', parameters, headers)

# Sends it out and gets the response
response = connection.getresponse()
print response.status, response.reason

# Finishes the connections
data = response.read()
connection.close()

But if you are using this for a malicious purpose, do know that the server logs all IP addresses.

You could use the urllib2 module which come in any Python distribution.

It allows you to open an URL as you were opening a file on the filesystem. So you can fetch the PI data with

pi_million_file = urllib2.urlopen("http://www.piday.org/million.php")

parse the resulting file which will be the HTML code of the webpage you see in your browser.

Then you should use the right URL for your website to POST with PI.

继续阅读：python scripting

How to create a Python script that grabs text from one site and reposts it to another?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？