Post extraction analysis
This section outlines the default post-extraction analysis tools. The purpose of this tool is to perform qualitative and quantitative analysis of the structure, fragment, contact and hydrogen bond data for the selected group of structures. The tool is designed to create scatter plots for pairs of parameters and histograms for the parameters extracted during the data extraction process.
For the scatter plots, the algorithm calculates the correlation coefficients for the selected set of variables, while for the histograms, it offers the option to fit distributions to the selected data, and report the characteristics of the fitted curve.
The Data Analysis Input File
The first step is to modify the input_data_analysis.txt file based on the required criteria. The general format of the file and descriptions of each parameter are as follows:
Input File Format
The configuration should be specified in JSON format as shown below:
{
"plots_directory": "../csd_db_analysis/visualize/",
"data_directory": "../csd_db_analysis/db_data/",
"data_prefix": "homomolecular",
"folder": "contacts_carboxylic-acid_carboxylic-acid_OH_hb_Zprime_1",
"figure_size": [5,3.75],
"save_figs": false,
"data_filters": {
"space_group": {
"is_active": false,
"type": "single",
"values": ["P21/c","P21/n"],
"operator": "or",
"refine_data": false
},
"z_crystal": {
"is_active": false,
"type": "single",
"values": [4,8],
"operator": "or",
"refine_data": false
},
"z_prime": {
"is_active": true,
"type": "single",
"values": [1],
"operator": "or",
"refine_data": false
},
"species": {
"is_active": false,
"type": "multiple",
"values": ["C","H","N","O"],
"operator": "or",
"refine_data": false
},
"fragments": {
"is_active": true,
"type": "multiple",
"values": [
"carboxylic_acid",
// ...
],
"operator": "and",
"refine_data": true
},
"contact_pairs": {
"is_active": true,
"type": "multiple_lists",
"values": [
["O","H","hbond",true],
// ...
],
"operator": "or",
"refine_data": true
},
"contact_central_fragments": {
"is_active": true,
"type": "multiple_lists",
"values": [
["carboxylic_acid","hbond",true]
// ...
],
"operator": "or",
"refine_data": true
},
"contact_fragment_pairs": {
"is_active": true,
"type": "multiple_lists",
"values": [
["carboxylic_acid","carboxylic_acid","hbond",true],
// ...
],
"operator": "and",
"refine_data": true
}
},
"plot_data_options": {
"individual_space_groups_plots": true,
"interactive": true,
"percentiles": [[10,25,50,75,90],true,true,true],
"2D_scatter": [
["cell_length_b_sc","cell_length_c_sc",null],
// ...
],
"2D_scatter_marker": "o",
"2D_scatter_facecolor": "whitesmoke",
"2D_scatter_edgecolor": "black",
"2D_scatter_opacity": 1.0,
"3D_scatter": [
["cc_contact_atom_ref_bv_x","cc_contact_atom_ref_bv_y","cc_contact_atom_ref_bv_z",null],
// ...
],
"3D_scatter_marker": "o",
"3D_scatter_facecolor": "whitesmoke",
"3D_scatter_edgecolor": "black",
"3D_scatter_opacity": 1.0,
"histogram": [
["cc_length",null,false],
// ...
],
"histogram_density": false,
"titles": false
}
}
Key Descriptions
plots_directorySpecifies the directory where plots will be saved. Using the default option is recommended.
data_directoryThe directory where the extracted data is stored. It must match the
"save_directory"specified in theinput_data_extraction.jsonfile.
data_prefixA prefix applied to output files to facilitate their identification. This must be consistent with the
"data_prefix"in theinput_data_extraction.jsonfile.
figure_sizeDefines the dimensions of exported figures in inches, formatted as \((W \times H)\). The default Matplotlib size is \((6.4 \times 4.8)\). To place two figures side by side in a 12-inch wide document using an 11pt font, the optimal size is \((5.0 \times 3.75)\). Adjust dimensions according to your document’s specific requirements.
data_filtersDetails for filtering structures for the analysis. Structures can be filtered based on:
- Space group
The space group of the structure.
- \(Z\) value
The total number of molecules in the unit cell (Number of symmetry operations) \(\times\) (Number of molecules in the asymmetric unit).
- \(Z^{\prime}\) value
The number of molecules in the asymmetric unit.
- Atomic species
The different atomic species found in the structure.
- Fragments
The different fragments found in the structure.
- Contact atomic pairs
The different atomic pairs found for the contacts in the structure.
- Contact central fragments
The different central fragments for the contacts in the structure.
- Contact fragment pairs
The different fragment pairs found for the contacts in the structure.
Each filter has 5 options:
is_activeSet to
trueto activate the filter. Setting tofalsewill deactivate the filter.
typeThe type of the filter. The available options are
singleA structure is characterized by a single specific value for the variable (for example the space group).
multipleA structure is characterized by a list of values for the specific variable (for example the atomic species in the structure).
multiple_listA structure is characterized by a list of values for the specific variable, but each value is now a list (for example the contact pairs in the structure, where each contact pair is characterized by the species of the cetnral atom, the species of the contact atom, the type of the contact and a boolean that states if the contact is in line of sight).
valuesA list (or a list of lists) for the allowed values.
operatorThe available options are
"or"The filter will check for structures that have any of the declared values,
"and"The filter will check for structures that have all the declared values,
refine_dataSet to
trueto refine the data for all the components in the structure based on the values of the filter.
plot_data_optionsDetails the plotting options:
individual_space_groups_plotsSet to
trueto create plots across all space groups and for each pace group sepaately.
interactiveSet to
trueto create interactive *.html` plots with the plotly package. (Currently this is the only option supported. Currently developing a routine to generate publication-ready*.pngplots).
percentilesThe options to calculate the kde density for the 2D and 3D scatter plots. The format for the values includes a list of integerss (of floats) representing the desired percentiles followed by 3 booleans. Each boolean activates the creation of the lowest percentine (in the example the 10%), the middle percentines (25%, 50%, 75%), and the top percentile (90%). For the interactive
*.html`plots, it is recommended to set all options totrueas the interactive plots allow to toggle on/off the different percentiles. For static*.pngimages, the booleans should be adjusted to include the desired percentiles in the plots.
2D_scatter/3D_scatterA list of the requested 2D/3D scatter plots to be generated. Each entry has the format
[variable_1, variable_2, group_variable]/[variable_1, variable_2, variable_3, group_variable]. Thevariable_1,variable_2andvariable_3are the variables used for the scatter plots. The entrygroup_variabledeclares the variable to group data and plot them separately based on the values of the group variable. Settinggroup_variabletonullgenerates a single plot for the full set of selected data. The group variable can take different values depending on the nature ofvariable_1,variable_2,variable_3.
2D_scatter_marker/3D_scatter_markerThe marker for the data points (static images only). For the available options please refer to the official matplotlib documentation.
2D_scatter_facecolor/3D_scatter_facecolorThe marker face color for the data points (static images only). For the available options please refer to the official matplotlib documentation.
2D_scatter_edgecolor/3D_scatter_edgecolorThe marker edge color for the data points (static images only). For the available options please refer to the official matplotlib documentation.
2D_scatter_opacity/3D_scatter_opacityThe marker opacity for the data points (static images only). Can take a value in the range \([0,1]\).
histogramA list of the requested histograms to be generated. Each entry has the format
[variable, group_variable, fit_kde_curve]. Thegroup_variableworks in a similar was as for the 2D/3D scatter plots. thefit_kde_curvecan be set totruewhen we require to fit a kde curve to the histogram data.
histogram_densitySetting to
falsewill plot on theyaxis the occurences. Setting totruewill plot the frequency.
List of available variables
The available variables are included in the file variables.json located in the source_data folder. Currently, the algorithm supports 127 different variables grouped into 5 families (See details below). Details for each variable can be found in the Data Extraction Procedure section. Each variable is described using a dictionary entry in the following format.
"variable_name": {
"latex_name": string,
"html_name": string,
"family": string,
"path": [list of strings],
"position_symmetry": [boolean,boolean,boolean,integer]
}
Key Descriptions
variable_nameThe name of the variable. Currently 127 variables are supported.
latex_nameThe name of the variable in LaTeX format used to render static
*.pngimages.
html_nameThe name of the variable in html format used to render interactive
*.htmlplots.
familyThe family of the variable. Currently the available variables are grouped into 5 different families based on the nature of the variable:
structurevariable family (27 variables)Includes all the variables related to the geeral characteristics of the structure.
str_idspace_groupz_crystalz_primeformulaspeciescell_length_acell_length_bcell_length_ccell_length_a_sccell_length_b_sccell_length_c_sccell_angle_alphacell_angle_betacell_angle_gammacell_volumecell_densityvdWFVSASE_totE_elE_vdWE_vdW_atE_vdW_repE_hbE_hb_atE_hb_rep
fragmentvariable family (52 variables)Includes all the variables related to the general characteristics of the fragments in the structure.
fragmentfragment_xfragment_yfragment_zfragment_ufragment_vfragment_wfragment_e1_xfragment_e1_yfragment_e1_zfragment_e1_ufragment_e1_vfragment_e1_wfragment_w11_ufragment_w11_vfragment_w11_wfragment_w12_ufragment_w12_vfragment_w12_wfragment_w1_angle_1fragment_w1_angle_2fragment_e1_d_minfragment_e2_xfragment_e2_yfragment_e2_zfragment_e2_ufragment_e2_vfragment_e2_wfragment_w21_ufragment_w21_vfragment_w21_wfragment_w22_ufragment_w22_vfragment_w22_wfragment_w2_angle_1fragment_w2_angle_2fragment_e2_d_minfragment_e3_xfragment_e3_yfragment_e3_zfragment_e3_ufragment_e3_vfragment_e3_wfragment_w31_ufragment_w31_vfragment_w31_wfragment_w32_ufragment_w32_vfragment_w32_wfragment_w3_angle_1fragment_w3_angle_2fragment_e3_d_min
fragment_atomvariable family (14 variables)Includes all the variables related to the characteristics of the atoms in each fragment.
fragment_atom_speciesfragment_atom_xfragment_atom_yfragment_atom_zfragment_atom_ufragment_atom_vfragment_atom_wfragment_atom_bv_xfragment_atom_bv_yfragment_atom_bv_zfragment_atom_bv_ufragment_atom_bv_vfragment_atom_bv_wfragment_atom_dzzp_min
contactvariable family (3 variables)Includes all the variables related to the general characteristics of the close contacts in the structure.
cc_lengthcc_typecc_is_in_los
contact_atomvariable family (31 variables)Includes all the variables related to theatoms forming the close contacts in the structure.
cc_central_atom_speciescc_central_atom_fragmentcc_central_atom_xcc_central_atom_ycc_central_atom_zcc_central_atom_ucc_central_atom_vcc_central_atom_wcc_central_atom_bv_xcc_central_atom_bv_ycc_central_atom_bv_zcc_central_atom_ref_bv_xcc_central_atom_ref_bv_ycc_central_atom_ref_bv_zcc_contact_atom_speciescc_contact_atom_fragmentcc_contact_atom_xcc_contact_atom_ycc_contact_atom_zcc_contact_atom_ucc_contact_atom_vcc_contact_atom_wcc_contact_atom_bv_xcc_contact_atom_bv_ycc_contact_atom_bv_zcc_contact_atom_ref_bv_xcc_contact_atom_ref_bv_ycc_contact_atom_ref_bv_zcc_contact_atom_ref_bv_rcc_contact_atom_ref_bv_thetacc_contact_atom_ref_bv_phi
pathList of strings pointing to the location of the value for each variable within each structure dictionary.
position_symmetryThe symmetry operations that are applied to get the complete set of values for a crystal. The first boolean declares if a rotation operation is applied to the variable and is
trueonly for \((x,y,z)\) or \((u,v,w)\) related coordinates. The second boolean istruewhen translational symmetry is applied and the third istruefor variables that are restricted within the limits of the unit cell (such as the fractional atomic coordinates). The fourth entry in the list, is an integer declaring the group ID for each variable. If set to-1the variable is not part of a group. If is set to0the variable is memebr of the structure geometry variables \((a,b,c,\alpha,\beta,\gamma,\Omega)\) that are required to apply coordinate transformations to any positional variable. If set to an integer :math:>0, the variable is part of a specific group of connected positional variables, such as the coordinates of an atom. There are 24 groups of variables:1.['cc_central_atom_x', 'cc_central_atom_y', 'cc_central_atom_z']2.['cc_central_atom_u', 'cc_central_atom_v', 'cc_central_atom_w']3.['cc_central_atom_bv_x', 'cc_central_atom_bv_y', 'cc_central_atom_bv_z']4.['cc_contact_atom_x', 'cc_contact_atom_y', 'cc_contact_atom_z']5.['cc_contact_atom_u', 'cc_contact_atom_v', 'cc_contact_atom_w']6.['cc_contact_atom_bv_x', 'cc_contact_atom_bv_y', 'cc_contact_atom_bv_z']7.['fragment_x', 'fragment_y', 'fragment_z']8.['fragment_u', 'fragment_v', 'fragment_w']9.['fragment_e1_x', 'fragment_e1_y', 'fragment_e1_z']10.['fragment_e1_u', 'fragment_e1_v', 'fragment_e1_w']11.['fragment_w11_u', 'fragment_w11_v', 'fragment_w11_w']12.['fragment_w12_u', 'fragment_w12_v', 'fragment_w12_w']13.['fragment_e2_x', 'fragment_e2_y', 'fragment_e2_z']14.['fragment_e2_u', 'fragment_e2_v', 'fragment_e2_w']15.['fragment_w21_u', 'fragment_w21_v', 'fragment_w21_w']16.['fragment_w22_u', 'fragment_w22_v', 'fragment_w22_w']17.['fragment_e3_x', 'fragment_e3_y', 'fragment_e3_z']18.['fragment_e3_u', 'fragment_e3_v', 'fragment_e3_w']19.['fragment_w31_u', 'fragment_w31_v', 'fragment_w31_w']20.['fragment_w32_u', 'fragment_w32_v', 'fragment_w32_w']21.['fragment_atom_x', 'fragment_atom_y', 'fragment_atom_z']22.['fragment_atom_u', 'fragment_atom_v', 'fragment_atom_w']23.['fragment_atom_bv_x', 'fragment_atom_bv_y', 'fragment_atom_bv_z']24.['fragment_atom_bv_u', 'fragment_atom_bv_v', 'fragment_atom_bv_w']
In case a positional variable from the above lists is selected to be displayed in any 2D/3D scatter plot, the algorithm adds the values for all the variables in the same group as well as the variables in group
0to the analysis data to be able to perform the necessary coordinate transformations.
Example usage of the filters
The filters for the analysis are designed in a way to facilitate detailed analysis of any of the available variables in refined sets of data consistent with the needs of every user. The correct combination of the filters is crucial in order to analyze the correct set of data. Below we provide examples on how to use the filters in different scenarios: