开发者

How to i split an HTML string into shorter HTML string in python? (added some intersing stuff)

I'm having a really hard time with this one,

EDIT: I'm putting this edit at the top: if any one want to read the problem and more, you are very welcome, I kind of starting to solve is really hard issue, but getting into a new problem, the way I thought of is to just return all the long HTML page divided by the paragraphs ("p" tags). Up to here every thing is working and when i do assert False, i am getting every thing as i want it. then in the template i go over the list I've sent in the response and for each value (a paragraph) for now i am creating a div (a page in the book), here is the problem. I am getting every paragraph three times! code below...

assert (part of it):
<p style="text-align: center;">
<span style="font-size:24px;"><strong><u>The Ten Foot Stop</u>&l开发者_如何学Got;/strong></span></p>,
<p  style="margin-bottom: 0.2in; text-align: center;">
<span style="font-size:18px;"><font style="font-size: 7pt;">NEWS AND OCCASIONAL ITEMS 
ABOUT THE MEDICAL ASPECTS OF SCUBA DIVING.<br />
POSTED BY ERN CAMPBELL, MD</font></span></p>

template:
{% for article_page in article_pages %}
    {% if article_page %} <!-- don't show an empty paragraph -->
       {{ article_page|safe }}
    {% endif %}
{% endfor %}

show this in page:
[The Ten Foot Stop, The Ten Foot Stop, The Ten Foot Stop]
<!-- first paragraph has: The Ten Foot Stop -->

from here is my original posts with all the issue description: I have a very long HTML like string (no head or body and stuff, but has tags and style, img tags and every thing else in it) and i need to split the string to smaller strings by number of words (need the string to fit into divs of certain sizes - lets say every 165 words more or less or even better to fit to certain height do it will fit the dive size- but i think that the second is much more complicated).

The problem i am having and tried every thing, including BeautifulSoup and other methods, is that i can't find a way to split the string while keeping the tags safe.... if i have a style tag for example, and the stag starts at the 160 char and go to the 170 char, the second page (div) will treat the styles as a regular string and BeautifulSoup only close "bad" tags as i saw, doesn't open the tags for the "bad" text in the second/third and so on divs....

And thought about using the truncate_html_words from text.py, but as the name implied, this only truncate words, doesn't save the rest of the text for the next page (or am i wrong)?

Any one has an idea about how to do this?

OK, Starting to figure this out slowly, i will publish it when it is done, i think people need this kind of thing. Next step is, I broke the html string by tags (in my case every HTML "p" tag. now how do i count the text and only the text in the tag? (ps. the tag might have child tags that wrap the text and might have multiple child tags also eg:

  • a
  • bcd

need to return only count of 2 - two words in tap)?

10x, Erez


Try starting small, define for yourself some sane, limited number of cases that you want to handle (like break on <p> tags, just show alt strings in place of images, and no divs), and see how that works. Then see if you want to tackle image sizing, or just show a hotspot for the use to select to see the image. Then the biggie is detecting divs. Start with just unnested divs, and get things working so that as you break up <p>s, you carry forward the current div's formatting. Then add nesting with a stack of formatting directives, pushing and popping off the stack as you encounter <div> and </div> tags.

But while your beginnings are simple, I would not be surprised if before long you find you are on the way to developing a complete browser.

  • repagination of text within screen size constraints
  • must handle modal style and formatting tags
  • must handle embedded images of varying size, presumably wrapping text around them

You didn't mention needing support for tables. If anchor tags with hrefs are defined, are these supposed to act as clickable hotspots? And God help you if you have to do something meaningful with JavaScript.

While you are carving off your simple starting point, see just how broad the end product requirements/expectations will have to be. If you start adding tables, frames, fonts, complex style directives, then you are essentially reinventing the web browser. At that point, try to inject some sanity back into the discussion - you are just one person and writing a browser is not a weekend task. Try to get the requirements down to a constrained set of supported tags. Alternatively, look into publicly available/open source browser engines (such as Chromium), which you might be able to adapt, especially in light of your simplified subset of features.


I see you are splitting trying to keep the html tags intact. I was simply looking for a solution to split a very long html string every n charachter and create some smaller strings adding them to a .txt file every new line. Then in my application I use these smaller strings to send chunks of a webpage from server to client. I have posted my working script here: https://stackoverflow.com/a/70287092/13795525

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜