Program Structure. miscfuncs.toppath is the name of the directory all the data is built into. It can be c:\pwdata ************************************** Lords scraping: -------------------------------------- lordscreatehansardindex.py goes to 'http://www.publications.parliament.uk/pa/ ld199900/ldhansrd/pdvn/allddays.htm' and makes the index file in lordindex.xml Where each line is of the form: (the page of the index for that day) -------------------------------------- lordspullgluepages.py loops through all the lines in lordindex.xml and iterates through the "Next Page" links from the first page of the day, and glues them together, removing header and footer, into single files of the form: lordspages/daylord2004-01-28.html -------------------------------------- lordsfilters.py loops through all the files lordspages/*.html and writes them into files of the form scrapedxml/lords/daylord2004-01-28.xml It does this by passing the text through a series of filters, each of which takes a text string of the whole file, and creates output into an output-stream. The intermediate filters write to a stringstream, and the final filter writes to a filestream. -------------------- LordsFilterColtime.py takes lines of the form 19 Nov 2003 : Column WS1926

and converts to... It hopes to extract all the spurious paragraph marker inserts in the process. Todo. get the names extracted filters the times extracted. The chunking up thing running.