Inexpensive ways to add seek to a filetype object

2022-12-27 07:32 问答作者：

PdfFileReader reads the content from a pdf file to create an object.

I am querying the pdf from a cdn via urllib.urlopen(), this provides me a file like object, which has no seek. PdfFileReader, however uses seek.

What is the simple wa开发者_StackOverflowy to create a PdfFileReader object from a pdf downloaded via url.

Now, what can I do to avoid writing to disk and reading it again via file().

Thanks in advance.

There isn't really an inexpensive, ready-to-use way to do this. The simplest way is to read all data and put it into a StringIO object. That does, however, require you read everything first, which may or may not be what you want.

If you want something that only reads as necessary, and then stores what was read (or perhaps just a portion of what was read) then you will have to write it yourself. You may want to see the source for the StringIO module (or the io module, in Python 2.6) for some examples.

You could use the .read() method to read in the entire data of the file, and then create your own File-like object (most likely via StringIO) to provide access to it.

I suspect you may be optimising prematurely here.

Most modern systems will cache files in memory for a significant period of time before they flush them to disk, so if you write the data to a temporary file, read it back in, then close and delete the file you may find that there's no significant disc traffic (unless it really is 100MB).

You might want to look at using tempfile.TemporaryFile() which creates a temporary file that is automatically deleted when closed, or else tempfile.SpooledTemporaryFile() which explicitly holds it all in memory until it exceeds a particular size.

Use io.BytesIO as shown in the docs. Slightly adapted:

from io import BytesIO
import urllib.request

import pypdf

url = "https://wiso.uni-hohenheim.de/fileadmin/einrichtungen/wiso/PDF/Lehre/Anleitung_zum_OEffnen_von_PDF-Formularen.pdf"
data = urllib.request.urlopen(url).read()

# creating a pdf reader object 
reader = pypdf.PdfReader(BytesIO(data)) 
    
# printing number of pages in pdf file 
print(len(reader.pages)) 
    
# creating a page object 
page = reader.pages[0] 
    
# extracting text from page 
print(page.extract_text())

继续阅读：file file-type pypdf python urllib

Inexpensive ways to add seek to a filetype object

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？