Multithreaded screen scraping help needed
I'm relatively new to python, and I'm working through a screen- scraping
application that gathers data from multiple financial sites. I have four
procedures for now. Two run in just a couple minutes, and the other two...
hours each. These two look up information on particular stock symbols that
I have in a csv file. There are 4,000+ symbols that I'm using. I know
enough to know that the vast majority of the time spent is in IO over the
wire. It's essential that I get these down to 1/2 hour each (or, better.
Is that too ambitious?) for this to be of any practical use to me. I'm
using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated
conceptually non essential sections. I'm reading many threads on multiple
calls/ threads at once to speed things up, and it seems like there are a
lot of options. Can anyone point me in the right direction that I should
pursue, based on the structure of what I have so far? It'd be a huge help.
I'm sure it's obvious, but this procedure gets called along with the other
data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True,
attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]),
int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()
No comments:
Post a Comment