Speed up requests process on large pandas series
I have a function that takes a list of urls from a dataframe series, then looks for all shortened urls, then sends a request getting the full url replacing the old values in the dataframe series, however it is very very slow. Any Python wizards able to help me increase the speed of this function? Most of my programs waiting time is waiting for this bit to finish. Please excuse my terrible use of exceptions, I learned Python from savages.
def expand_moreover(dataframe):
"""Expand the moreover based links to the full url"""
with requests.Session() as session:
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X开发者_Go百科11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0',
}
try:
print("Expanding the Moreover URLs please wait this can take a few minutes...")
shortlinks = dataframe[dataframe['ContentUrl'].str.contains('moreover',na=False, case=False)].copy()
shortlinks.drop_duplicates(subset=["ContentUrl"], keep="first", inplace=True)
expand = shortlinks['ContentUrl'].tolist()
errors = 0
success = 0
for shortlinks in expand:
try:
r = session.get(shortlinks, timeout=2, headers=headers).url
dataframe['ContentUrl'] = dataframe['ContentUrl'].str.replace(shortlinks, r, regex=False)
success += 1
except Exception:
errors += 1
print(f"Done {success} links expanded and {errors} links failed to expand")
except:
print("Something went wrong with expanding the Moreover URLs - SKIPPING")
return dataframe
精彩评论