WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

A tool for extracting the data from the ThermoML archive (https://trc.nist.gov/ThermoML) into excel files for further processing. This tool only focusses the extraction of PureOrMixture Data as labeled by the ThermoML archive and does have any other functionality.

License

Notifications You must be signed in to change notification settings

Garren-H/ThermoML_Data_Extraction_Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ThermoML_Data_Extraction_Tool

A python tool for extracting the data from the ThermoML archive (https://trc.nist.gov/ThermoML) into excel files for further processing. This tool only focusses the extraction of PureOrMixture Data as labeled by the ThermoML archive and does not have any other functionality.

This tool was developed for the sole intent of transforming the data into a more readable format.

Note: I do not make any claim to the data obtained from the ThermoML Data archive, so if you do use this tool, make sure to reference the archive as per https://doi.org/10.18434/mds2-2422

Instructions

An environment.yml file is included to easily create a conda installation for python. The versions listed is what was used in the development of this tool, but may be relaxed (not tested).

Download both the environment.yml and ThermoML_Data_Extraction_Tool.py and place it where convenient. Navigate to this path and create the conda env by executing

conda env create --file environment.yml

Activate the environment once created, and run the script using python

conda activate thermoml-extraction
python ThermoML_Data_Extraction_Tool.py

Interface

The interface has a few prompts:

  1. The first is the number of cpu cores to use. By default it uses half the number of available cpu cores, may not suit your purposes. Adjust this accordingly. This is used to execute the extraction and saving of data in parallel.
  2. The path where you want to save the extracted excel files. If the directory does not exists it will be created. Relative paths will be used if full paths are not provided
  3. Whether to download and extract the ThermoML Data. If you select "n" for this, you will be prompted for a valid path where you have extracted the ThermoML data (i.e. we are at this point assuming you have downloaded and extracted this yourself). The program checks the first subdirectories and checks if .json files are present, so if the path you give is incorrect but contains json files, the program will error. We expect the directory structure
    Top Level
       Lower level (10.1021, 10.1016 and 10.1007 as folders)
           Contains .json and .xml files
    
    You should provide the Top level directory as input

After these prompts the extraction process should proceed. Note: We do not check memory requirements. When using multiple cpus, you may run into a memory error when saving the excel files (some files contain large ammounts of data and writing in parallel may overload the memory).

Output during extraction:

Enter the number of CPU cores to use for processing (default: 4): 4
Current number of CPU cores to be used:  4
You sure? This choice can't be changed (y/n): y


Enter the path to the directory where you want to save the processed data: Downloads\ThermoMLData

Downloading the files:

Do you want to download the ThermoML files? [y/n] y
Enter the path to the directory where you want to save the downloaded files: Downloads

Self downloaded and extracted data files:

Do you want to download the ThermoML files? [y/n] n
Enter the path to the directory containing the ThermoML files: Downloads\ThermoML.v2020-09-30

Output after downloading:

Processing citations first...
Extracting citations (process 1): 100%|████████████████████████████████████████████████████████████████████████| 2980/2980 [00:59<00:00, 49.49it/s]
Extracting citations (process 2): 100%|████████████████████████████████████████████████████████████████████████| 2980/2980 [00:46<00:00, 53.10it/s]
Extracting citations (process 3): 100%|████████████████████████████████████████████████████████████████████████| 2980/2980 [00:45<00:00, 71.38it/s]
Extracting citations (process 4): 100%|████████████████████████████████████████████████████████████████████████| 2983/2983 [00:38<00:00, 75.46it/s]
Citations written to Excel file:  Downloads\ThermoMLData\Citations.xlsx


Extracting data from all files...
This may take a while depending on the number of files and their sizes...
Merging all data in process 1: 100%|███████████████████████████████████████████████████████████████████████████| 2968/2968 [00:56<00:00, 41.11it/s]
Merging all data in process 2: 100%|███████████████████████████████████████████████████████████████████████████| 2917/2917 [01:18<00:00, 31.10it/s]
Merging all data in process 3: 100%|███████████████████████████████████████████████████████████████████████████| 2963/2963 [01:18<00:00, 28.25it/s]
Merging all data in process 4: 100%|███████████████████████████████████████████████████████████████████████████| 2980/2980 [01:25<00:00, 25.50it/s]
Merging data across different processes...: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.33it/s]


Writing data to Excel files...
Note larger files may take a while to write...
Writing data to Excel files...:  28%|█████████████████████▋                                                        | 27/97 [02:12<07:17,  6.25s/it]

Additional Note: I have written this tool such that it can easily be converted into an executable using pyinstaller

About

A tool for extracting the data from the ThermoML archive (https://trc.nist.gov/ThermoML) into excel files for further processing. This tool only focusses the extraction of PureOrMixture Data as labeled by the ThermoML archive and does have any other functionality.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages