[Computing Power] Does the University have Access to Computing Power?
Dr. Hart mentioned back when I took MSDS610 that there was a high performance work station in the department available for optional use with that course. Please I would like to know whether that work station could possibly be used for this class. I would need Python 2.7 and the pandas library to run my script.
I have a computing horsepower problem for a data assembly script. Partly it is that I tried an inefficient but simple, easy to code algorithm, partly it is the size of the problem. Currently the data assembly script looks like it will need around 100 hours to complete.
My project is that I am modeling home sales in Boulder county, CO using data from the county assessor's office. I have 16k+ sales records in my main table, this covers four years worth of sales. For each year there is a features table with 130k+ records. A home may have zero, one or more entries in the features table which covers things like decks, patios, porches, sheds, etc. I need to match a home with its features and create appropriate variables in the main table for existence of the feature and its size in square feet.
Because of the zero to many relationship between the tables and my amalgamating some of the less common features together I don't know how to vectorize the problem. So I am currently using 'for' loops. The data has a number of issues that make vectorization difficult for me. For example, a home may have 2 or more decks. I aggregate these into a deck existence variable and a deck total square footage feature. The same thing is true for the other categories that I am extracting (deck, enclosed porch, patio, porch and a general 'other' category). I am creating five categories from the eleven or so categories in the features table because some things like sheds, shop areas and pools are rare so I put them in the 'other' category.
In the main table I pre-allocate all five categorical feature variables and their matching 5 square footage variables to zero, this runs fast and sets all homes without a given feature to the correct value.
Currently I am brute force searching a sorted features table and stopping when I find the current home. But, the brute force code is too slow. I am going to try putting the features table into a Python dictionary for faster lookup. This is a little tricky because each dictionary key may maps to multiple items so I will need to write a class or something to serve as the dictionary value item. This should speed up the code. I could also look into multi-threading, but I have not done this in Python so that learning curve is risky for me in terms of class time.
If there are any suggestions as to how I can vectorize this complex a merge to speed it up, I am open to ideas. Or, other ideas to speed up the code.
While I think I can speed up my script with the dictionary approach, I need something like an 6-10x improvement in speed for the script just to get it to finish overnight. So while I need a better algorithm - I also think that I need a faster computer – hence the question – does the department have a fast computer that I could run the script on?
We have a machine, cobalt.ccis.site, that students may use. It has two 6-core Xeon processors. With Intel hyper-threading this gives you 24 execution threads. The machine also has 128 GB of RAM and a little over 2 TB of RAID 5 disk, also. The machine OS is running CentOS 7.
The machine can be access via ssh and via a remote desktop type product, NoMachine.
Contact me by email if you want credentials to access the machine.
Thanks for your answers. I solved the problem for now by using a Python dictionary - it is so much faster for searching than the brute force approach that it is almost unbelievable. The improvement in speed was far beyond my expectations or hopes - the data setup script now runs in 15 minutes or so (instead of ~100 hours).
I will post more when I have more time,