A python tool for extracting the data from the ThermoML archive (https://trc.nist.gov/ThermoML) into excel files for further processing. This tool only focusses the extraction of PureOrMixture Data as labeled by the ThermoML archive and does not have any other functionality.
This tool was developed for the sole intent of transforming the data into a more readable format.
Note: I do not make any claim to the data obtained from the ThermoML Data archive, so if you do use this tool, make sure to reference the archive as per https://doi.org/10.18434/mds2-2422
An environment.yml file is included to easily create a conda installation for python. The versions listed is what was used in the development of this tool, but may be relaxed (not tested).
Download both the environment.yml and ThermoML_Data_Extraction_Tool.py and place it where convenient. Navigate to this path and create the conda env by executing
conda env create --file environment.yml
Activate the environment once created, and run the script using python
conda activate thermoml-extraction
python ThermoML_Data_Extraction_Tool.py
The interface has a few prompts:
- The first is the number of cpu cores to use. By default it uses half the number of available cpu cores, may not suit your purposes. Adjust this accordingly. This is used to execute the extraction and saving of data in parallel.
- The path where you want to save the extracted excel files. If the directory does not exists it will be created. Relative paths will be used if full paths are not provided
- Whether to download and extract the ThermoML Data. If you select "n" for this, you will be prompted for a valid path where you have extracted the ThermoML data (i.e. we are at this point assuming you have downloaded and extracted this yourself). The program checks the first subdirectories and checks if
.jsonfiles are present, so if the path you give is incorrect but contains json files, the program will error. We expect the directory structureYou should provide the Top level directory as inputTop Level Lower level (10.1021, 10.1016 and 10.1007 as folders) Contains .json and .xml files
After these prompts the extraction process should proceed. Note: We do not check memory requirements. When using multiple cpus, you may run into a memory error when saving the excel files (some files contain large ammounts of data and writing in parallel may overload the memory).
Output during extraction:
Enter the number of CPU cores to use for processing (default: 4): 4
Current number of CPU cores to be used: 4
You sure? This choice can't be changed (y/n): y
Enter the path to the directory where you want to save the processed data: Downloads\ThermoMLData
Downloading the files:
Do you want to download the ThermoML files? [y/n] y
Enter the path to the directory where you want to save the downloaded files: Downloads
Self downloaded and extracted data files:
Do you want to download the ThermoML files? [y/n] n
Enter the path to the directory containing the ThermoML files: Downloads\ThermoML.v2020-09-30
Output after downloading:
Processing citations first...
Extracting citations (process 1): 100%|████████████████████████████████████████████████████████████████████████| 2980/2980 [00:59<00:00, 49.49it/s]
Extracting citations (process 2): 100%|████████████████████████████████████████████████████████████████████████| 2980/2980 [00:46<00:00, 53.10it/s]
Extracting citations (process 3): 100%|████████████████████████████████████████████████████████████████████████| 2980/2980 [00:45<00:00, 71.38it/s]
Extracting citations (process 4): 100%|████████████████████████████████████████████████████████████████████████| 2983/2983 [00:38<00:00, 75.46it/s]
Citations written to Excel file: Downloads\ThermoMLData\Citations.xlsx
Extracting data from all files...
This may take a while depending on the number of files and their sizes...
Merging all data in process 1: 100%|███████████████████████████████████████████████████████████████████████████| 2968/2968 [00:56<00:00, 41.11it/s]
Merging all data in process 2: 100%|███████████████████████████████████████████████████████████████████████████| 2917/2917 [01:18<00:00, 31.10it/s]
Merging all data in process 3: 100%|███████████████████████████████████████████████████████████████████████████| 2963/2963 [01:18<00:00, 28.25it/s]
Merging all data in process 4: 100%|███████████████████████████████████████████████████████████████████████████| 2980/2980 [01:25<00:00, 25.50it/s]
Merging data across different processes...: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.33it/s]
Writing data to Excel files...
Note larger files may take a while to write...
Writing data to Excel files...: 28%|█████████████████████▋ | 27/97 [02:12<07:17, 6.25s/it]
Additional Note: I have written this tool such that it can easily be converted into an executable using pyinstaller