The easiest way to tackle this is to use the appropriate data structure with the parameters. In most cases this is a dict. Here I used a dict of dicts, but you can also use a list of tuples, etc. This is a balance between simplicity and extensibility.
There are a few things that need to happen: - look for the zip-files in the data directory - look for the html-files inside each zip-file - parse the html-file - write the results to a csv.
If you delineate each of these parts to it's own function, doing multithreading, multiprocessing or async will be a lot simpler.
This uses a new parser for each html-string. If you want to reuse the parser, something as this can work:. If you ever want to multiprocess each individual html-file, only smaller changes are needed in this function. This uses pathlib.
Path for the files, which makes handling the extension and opening the file a bit easier. Also here, using a context-manager with to prevent problems when something throws an exception.
Now you have separated the reading, parsing and writing the results, profiling will be easier, and which step to tackle first will depend on the results of the profiling. If the bottleneck is the IO, the physical reading of the file, throwing more threads at it will not speed up the process, but using loading the zip files into memory might. The upload's very useful, thanks. So it looks like the files aren't that messy, like what already was said, an approach based on regular expressions might be sufficient, if there's no line breaks or similar stuff it certainly could be pretty fast.
Parser-wise the only other option, that isn't really going to be quicker Again, if you're already going for regex this won't matter. Edit: Nevermind. I was going to suggest to skip parsing as soon as there's nothing more interesting in the file, but clearly the data is contained all over. Lastly, this is Python, you could look whether PyPy improves speed, but with CPython I wouldn't expect any high performance by itself to be honest.
Edit: I tried the SAX approach and now that I look at it more closely I'm noticing some bugs, specifically there are multiple tags with similar names and the if statements are overwriting some data, e.
Edit: Oh yeah and also if you write CSV, make sure that you fix the order of keys, otherwise it can be completely random what you get from a dict , which makes comparing output files diff icult. Sign up to join this community. The best answers are voted up and rise to the top.
Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. Parsing contents of a large zip file into a html parser into a. Asked 3 years, 6 months ago. Active 3 years, 5 months ago. Viewed times. My code reads the file without extracting them, Passes the resultant html string into a custom HTMLParser object, And then writes a summary of all the zip files into a CSV for that particular zipfile. DictWriter f, fileCollection[0].
Improve this question. Adrian Coutsoftides. You can try zipfile standard library. Here zipfile. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams?
Collectives on Stack Overflow. Learn more. Python library to parse. Asked 3 years, 10 months ago. Active 3 years, 10 months ago. Viewed times. Improve this question. Aamir Shaikh Aamir Shaikh 1 1 gold badge 1 1 silver badge 11 11 bronze badges. Why did you tag this question as a Python question? Do you have some Python code you are having a problem with? I am trying parse it in python, if any libraries already are present would help my cause — Aamir Shaikh.
Please include the Python code you have tried. Where does it not work? What exact error do you get? Add a comment. Active Oldest Votes. Improve this answer. Aviral Verma Aviral Verma 2 2 silver badges 10 10 bronze badges.
0コメント