data_reader module
Module: data_reader.py
Provides RawDataReader for extracting raw NumPy arrays from an input HDF5 file generated by the StructureDataExtractor. Supports reading crystal parameters, atom data, bond data, and both inter‐ and intra‐molecular contacts/H‐bonds, with zero‐padding to fixed maximum dimensions.
Dependencies
h5py numpy
- class data_reader.RawDataReader(h5_in)[source]
Bases:
objectRead and pad raw per‐structure data from an HDF5 file for batch processing.
- h5_in
Open HDF5 file containing raw structure datasets under ‘/structures/<refcode>’.
- Type:
- read_crystal_parameters(batch: List[str]) Dict[str, np.ndarray][source]
Read unit‐cell lengths, angles, and scalar crystal properties.
- read_atoms(batch: List[str], N_max: int) Dict[str, Any][source]
Read and pad per‐atom labels, symbols, coordinates, weights, charges, SYBYL types, neighbor lists, and mask.
- read_bonds(batch: List[str], max_bonds: int) Dict[str, Any][source]
Read and pad per‐bond endpoint indices, bond‐type strings, rotatability flags, cyclic flags, and bond lengths.
- read_intermolecular_contacts(batch: List[str], max_contacts: int) Dict[str, Any][source]
Read and pad raw inter‐molecular contact labels, symmetry ops, Cartesian/fractional coords, lengths, strengths, and in‐LOS flags.
- read_intermolecular_hbonds(batch: List[str], max_hbonds: int) Dict[str, Any][source]
Read and pad raw inter‐molecular H‐bond labels, symmetry ops, Cartesian/fractional coords, lengths, angles, and in‐LOS flags.
- read_intramolecular_contacts(batch: List[str], max_contacts: int) Dict[str, Any][source]
Read and pad intra‐molecular contact data.
- read_intramolecular_hbonds(batch: List[str], max_hbonds: int) Dict[str, Any][source]
Read and pad intra‐molecular H‐bond data.
- __init__(h5_in)[source]
Initialize RawDataReader.
- Parameters:
h5_in (h5py.File) – Open HDF5 file handle containing ‘/structures’ groups.
- read_crystal_parameters(batch)[source]
Read unit‐cell lengths, angles, and scalar crystal metrics for a batch.
- Parameters:
batch (List[str]) – Refcode strings for structures to read.
- Returns:
result –
- {
‘cell_lengths’: np.ndarray, shape (B,3), dtype float32, ‘cell_angles’: np.ndarray, shape (B,3), dtype float32, ‘z_value’: np.ndarray, shape (B,), dtype float32, ‘z_prime’: np.ndarray, shape (B,), dtype float32, ‘cell_volume’: np.ndarray, shape (B,), dtype float32, ‘cell_density’: np.ndarray, shape (B,), dtype float32, ‘packing_coefficient’: np.ndarray, shape (B,), dtype float32, ‘identifier’: np.ndarray, shape (B,), dtype object, ‘space_group’: np.ndarray, shape (B,), dtype object
}
- Return type:
- read_atoms(batch, N_max)[source]
Read and pad per‐atom data for a batch of structures.
- Parameters:
- Returns:
result –
- {
‘n_atoms’: List[int], ‘atom_label’: List[List[str]], ‘atom_symbol’: List[List[str]], ‘atom_number’: np.ndarray, shape (B, N_max), dtype int32, ‘atom_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_frac_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_weight’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_charge’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_sybyl_type’: List[List[str]], ‘atom_neighbour_list’: List[List[str]], ‘atom_mask’: np.ndarray, shape (B, N_max), dtype bool
}
- Return type:
- read_bonds(batch, max_bonds)[source]
Read and pad per‐bond data for a batch of structures.
- Parameters:
- Returns:
result –
- {
‘n_bonds’: List[int], ‘bond_id’: List[List[str]], ‘bond_atom1’: List[List[str]], ‘bond_atom2’: List[List[str]], ‘bond_atom1_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_atom2_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_type’: List[List[str]], ‘bond_is_rotatable_raw’: List[List[bool]], ‘bond_is_cyclic’: np.ndarray, shape (B, max_bonds), dtype bool, ‘bond_length’: np.ndarray, shape (B, max_bonds), dtype float32, ‘bond_mask’: np.ndarray, shape (B, max_bonds), dtype bool
}
- Return type:
- read_intermolecular_contacts(batch, max_contacts)[source]
Read and pad raw intermolecular contact data.
- Parameters:
- Returns:
result –
- {
‘n_inter_cc’: List[int], ‘inter_cc_id’: List[List[str]], ‘inter_cc_central_atom’: List[List[str]], ‘inter_cc_contact_atom’: List[List[str]], ‘inter_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_symmetry’: List[List[str]], ‘inter_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool
}
- Return type:
- read_intermolecular_hbonds(batch, max_hbonds)[source]
Read and pad raw intermolecular H‐bond data.
- Parameters:
- Returns:
result –
- {
‘n_inter_hb’: List[int], ‘inter_hb_id’: List[List[str]], ‘inter_hb_central_atom’: List[List[str]], ‘inter_hb_hydrogen_atom’: List[List[str]], ‘inter_hb_contact_atom’: List[List[str]], ‘inter_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_symmetry’: List[List[str]], ‘inter_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool
}
- Return type:
- read_intramolecular_contacts(batch, max_contacts)[source]
Read and pad raw intramolecular contact data.
- Parameters:
- Returns:
result –
- {
‘n_intra_cc’: List[int], ‘intra_cc_id’: List[List[str]], ‘intra_cc_central_atom’: List[List[str]], ‘intra_cc_contact_atom’: List[List[str]], ‘intra_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool
}
- Return type:
- read_intramolecular_hbonds(batch, max_hbonds)[source]
Read and pad raw intramolecular H‐bond data.
- Parameters:
- Returns:
result –
- {
‘n_intra_hb’: List[int], ‘intra_hb_id’: List[List[str]], ‘intra_hb_central_atom’: List[List[str]], ‘intra_hb_hydrogen_atom’: List[List[str]], ‘intra_hb_contact_atom’: List[List[str]], ‘intra_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool
}
- Return type:
HDF5 Raw Data Reading and Batch Loading
The data_reader module provides efficient interfaces for reading raw crystallographic structure data from HDF5 files into memory for batch processing. It handles padding, type conversion, and organization of heterogeneous structural data into uniform arrays suitable for GPU processing.
Key Features:
Batch data loading - Read multiple structures simultaneously for efficient processing
Automatic padding - Uniform array dimensions for ragged data (atoms, bonds, contacts)
Type conversion - Convert HDF5 datasets to appropriate NumPy/PyTorch formats
Memory optimization - Efficient data loading patterns for large datasets
Error handling - Robust handling of missing data and format inconsistencies
Flexible interfaces - Support for different data types and structure organizations
RawDataReader Class
- class data_reader.RawDataReader(h5_in)[source]
Read and pad raw per‐structure data from an HDF5 file for batch processing.
- h5_in
Open HDF5 file containing raw structure datasets under ‘/structures/<refcode>’.
- Type:
- read_crystal_parameters(batch: List[str]) Dict[str, np.ndarray][source]
Read unit‐cell lengths, angles, and scalar crystal properties.
- read_atoms(batch: List[str], N_max: int) Dict[str, Any][source]
Read and pad per‐atom labels, symbols, coordinates, weights, charges, SYBYL types, neighbor lists, and mask.
- read_bonds(batch: List[str], max_bonds: int) Dict[str, Any][source]
Read and pad per‐bond endpoint indices, bond‐type strings, rotatability flags, cyclic flags, and bond lengths.
- read_intermolecular_contacts(batch: List[str], max_contacts: int) Dict[str, Any][source]
Read and pad raw inter‐molecular contact labels, symmetry ops, Cartesian/fractional coords, lengths, strengths, and in‐LOS flags.
- read_intermolecular_hbonds(batch: List[str], max_hbonds: int) Dict[str, Any][source]
Read and pad raw inter‐molecular H‐bond labels, symmetry ops, Cartesian/fractional coords, lengths, angles, and in‐LOS flags.
- read_intramolecular_contacts(batch: List[str], max_contacts: int) Dict[str, Any][source]
Read and pad intra‐molecular contact data.
- read_intramolecular_hbonds(batch: List[str], max_hbonds: int) Dict[str, Any][source]
Read and pad intra‐molecular H‐bond data.
Primary Interface for Raw HDF5 Data Access
The
RawDataReaderclass provides methods to read and organize raw structural data from HDF5 files created by theStructureDataExtractor. It handles the complexities of variable-length datasets and converts them into uniform arrays for batch processing.Attributes:
h5_in (
h5py.File) - Open HDF5 file handle containing structure data under ‘/structures/<refcode>’
Data Organization:
The reader expects HDF5 files with the following structure:
/structures/ ├── <refcode1>/ │ ├── identifier │ ├── cell_lengths, cell_angles │ ├── atom_label, atom_symbol, atom_coords, ... │ ├── bond_atom1_idx, bond_atom2_idx, ... │ ├── inter_cc_*, inter_hb_* │ └── intra_cc_*, intra_hb_* ├── <refcode2>/ └── ...
- __init__(h5_in)[source]
Initialize RawDataReader.
- Parameters:
h5_in (h5py.File) – Open HDF5 file handle containing ‘/structures’ groups.
Initialize Raw Data Reader
- Parameters:
h5_in (
h5py.File) - Open HDF5 file handle with ‘/structures’ groups
Usage Example:
import h5py from data_reader import RawDataReader # Open HDF5 file with h5py.File('structures_raw.h5', 'r') as h5_file: reader = RawDataReader(h5_file) # Use reader methods... batch_refcodes = ['AABBCC', 'DDEEGG', 'HHIIJJ'] crystal_data = reader.read_crystal_parameters(batch_refcodes)
- __init__(h5_in)[source]
Initialize RawDataReader.
- Parameters:
h5_in (h5py.File) – Open HDF5 file handle containing ‘/structures’ groups.
- read_crystal_parameters(batch)[source]
Read unit‐cell lengths, angles, and scalar crystal metrics for a batch.
- Parameters:
batch (List[str]) – Refcode strings for structures to read.
- Returns:
result –
- {
‘cell_lengths’: np.ndarray, shape (B,3), dtype float32, ‘cell_angles’: np.ndarray, shape (B,3), dtype float32, ‘z_value’: np.ndarray, shape (B,), dtype float32, ‘z_prime’: np.ndarray, shape (B,), dtype float32, ‘cell_volume’: np.ndarray, shape (B,), dtype float32, ‘cell_density’: np.ndarray, shape (B,), dtype float32, ‘packing_coefficient’: np.ndarray, shape (B,), dtype float32, ‘identifier’: np.ndarray, shape (B,), dtype object, ‘space_group’: np.ndarray, shape (B,), dtype object
}
- Return type:
- read_atoms(batch, N_max)[source]
Read and pad per‐atom data for a batch of structures.
- Parameters:
- Returns:
result –
- {
‘n_atoms’: List[int], ‘atom_label’: List[List[str]], ‘atom_symbol’: List[List[str]], ‘atom_number’: np.ndarray, shape (B, N_max), dtype int32, ‘atom_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_frac_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_weight’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_charge’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_sybyl_type’: List[List[str]], ‘atom_neighbour_list’: List[List[str]], ‘atom_mask’: np.ndarray, shape (B, N_max), dtype bool
}
- Return type:
- read_bonds(batch, max_bonds)[source]
Read and pad per‐bond data for a batch of structures.
- Parameters:
- Returns:
result –
- {
‘n_bonds’: List[int], ‘bond_id’: List[List[str]], ‘bond_atom1’: List[List[str]], ‘bond_atom2’: List[List[str]], ‘bond_atom1_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_atom2_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_type’: List[List[str]], ‘bond_is_rotatable_raw’: List[List[bool]], ‘bond_is_cyclic’: np.ndarray, shape (B, max_bonds), dtype bool, ‘bond_length’: np.ndarray, shape (B, max_bonds), dtype float32, ‘bond_mask’: np.ndarray, shape (B, max_bonds), dtype bool
}
- Return type:
- read_intermolecular_contacts(batch, max_contacts)[source]
Read and pad raw intermolecular contact data.
- Parameters:
- Returns:
result –
- {
‘n_inter_cc’: List[int], ‘inter_cc_id’: List[List[str]], ‘inter_cc_central_atom’: List[List[str]], ‘inter_cc_contact_atom’: List[List[str]], ‘inter_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_symmetry’: List[List[str]], ‘inter_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool
}
- Return type:
- read_intermolecular_hbonds(batch, max_hbonds)[source]
Read and pad raw intermolecular H‐bond data.
- Parameters:
- Returns:
result –
- {
‘n_inter_hb’: List[int], ‘inter_hb_id’: List[List[str]], ‘inter_hb_central_atom’: List[List[str]], ‘inter_hb_hydrogen_atom’: List[List[str]], ‘inter_hb_contact_atom’: List[List[str]], ‘inter_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_symmetry’: List[List[str]], ‘inter_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool
}
- Return type:
- read_intramolecular_contacts(batch, max_contacts)[source]
Read and pad raw intramolecular contact data.
- Parameters:
- Returns:
result –
- {
‘n_intra_cc’: List[int], ‘intra_cc_id’: List[List[str]], ‘intra_cc_central_atom’: List[List[str]], ‘intra_cc_contact_atom’: List[List[str]], ‘intra_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool
}
- Return type:
- read_intramolecular_hbonds(batch, max_hbonds)[source]
Read and pad raw intramolecular H‐bond data.
- Parameters:
- Returns:
result –
- {
‘n_intra_hb’: List[int], ‘intra_hb_id’: List[List[str]], ‘intra_hb_central_atom’: List[List[str]], ‘intra_hb_hydrogen_atom’: List[List[str]], ‘intra_hb_contact_atom’: List[List[str]], ‘intra_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool
}
- Return type:
Crystal Parameter Reading
- RawDataReader.read_crystal_parameters(batch)[source]
Read unit‐cell lengths, angles, and scalar crystal metrics for a batch.
- Parameters:
batch (List[str]) – Refcode strings for structures to read.
- Returns:
result –
- {
‘cell_lengths’: np.ndarray, shape (B,3), dtype float32, ‘cell_angles’: np.ndarray, shape (B,3), dtype float32, ‘z_value’: np.ndarray, shape (B,), dtype float32, ‘z_prime’: np.ndarray, shape (B,), dtype float32, ‘cell_volume’: np.ndarray, shape (B,), dtype float32, ‘cell_density’: np.ndarray, shape (B,), dtype float32, ‘packing_coefficient’: np.ndarray, shape (B,), dtype float32, ‘identifier’: np.ndarray, shape (B,), dtype object, ‘space_group’: np.ndarray, shape (B,), dtype object
}
- Return type:
Load Unit Cell and Crystal Properties
Reads crystallographic unit cell parameters and derived properties for a batch of structures.
Parameters:
batch (
List[str]) - Refcode strings for structures to read
Returns:
dict - Dictionary containing crystal-level data:
cell_lengths (
np.ndarray, shape (B, 3)) - Unit cell lengths [a, b, c] in Ångstromscell_angles (
np.ndarray, shape (B, 3)) - Unit cell angles [α, β, γ] in degreesz_value (
np.ndarray, shape (B,)) - Number of formula units per unit cellz_prime (
np.ndarray, shape (B,)) - Number of symmetry-independent moleculescell_volume (
np.ndarray, shape (B,)) - Unit cell volume in Ųcell_density (
np.ndarray, shape (B,)) - Crystal density in g/cm³packing_coefficient (
np.ndarray, shape (B,)) - Space-filling efficiencyidentifier (
np.ndarray, shape (B,)) - Structure identifiersspace_group (
np.ndarray, shape (B,)) - Space group symbols
Usage Example:
# Read crystal parameters for a batch batch_refcodes = ['ALANIN', 'GLYCIN', 'BENZEN'] crystal_data = reader.read_crystal_parameters(batch_refcodes) print(f"Unit cell volumes: {crystal_data['cell_volume']}") print(f"Crystal densities: {crystal_data['cell_density']}") print(f"Space groups: {crystal_data['space_group']}") # Access individual structure data for i, refcode in enumerate(batch_refcodes): print(f"{refcode}:") print(f" a={crystal_data['cell_lengths'][i, 0]:.2f} Å") print(f" b={crystal_data['cell_lengths'][i, 1]:.2f} Å") print(f" c={crystal_data['cell_lengths'][i, 2]:.2f} Å") print(f" α={crystal_data['cell_angles'][i, 0]:.1f}°") print(f" β={crystal_data['cell_angles'][i, 1]:.1f}°") print(f" γ={crystal_data['cell_angles'][i, 2]:.1f}°") print(f" Volume: {crystal_data['cell_volume'][i]:.1f} Ų") print(f" Density: {crystal_data['cell_density'][i]:.2f} g/cm³")
Atomic Data Reading
- RawDataReader.read_atoms(batch, N_max)[source]
Read and pad per‐atom data for a batch of structures.
- Parameters:
- Returns:
result –
- {
‘n_atoms’: List[int], ‘atom_label’: List[List[str]], ‘atom_symbol’: List[List[str]], ‘atom_number’: np.ndarray, shape (B, N_max), dtype int32, ‘atom_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_frac_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_weight’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_charge’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_sybyl_type’: List[List[str]], ‘atom_neighbour_list’: List[List[str]], ‘atom_mask’: np.ndarray, shape (B, N_max), dtype bool
}
- Return type:
Load Padded Atomic Data Arrays
Reads atomic coordinates, properties, and connectivity information with automatic padding to uniform dimensions.
Parameters:
batch (
List[str]) - Refcode strings for structures to readN_max (
int) - Maximum number of atoms for padding
Returns:
dict - Dictionary containing atom-level data:
atom_label (
List[List[str]]) - Atom labels per structureatom_symbol (
List[List[str]]) - Element symbols per structureatom_coords (
np.ndarray, shape (B, N_max, 3)) - Cartesian coordinatesatom_frac_coords (
np.ndarray, shape (B, N_max, 3)) - Fractional coordinatesatom_weight (
np.ndarray, shape (B, N_max)) - Atomic massesatom_charge (
np.ndarray, shape (B, N_max)) - Formal chargesatom_sybyl_type (
List[List[str]]) - SYBYL atom typesatom_neighbors (
List[List[List[int]]]) - Connectivity listsatom_mask (
np.ndarray, shape (B, N_max)) - Valid atom indicators
Padding Behavior:
Real atoms are placed at the beginning of each array
Padding slots are filled with zeros/empty strings
atom_mask indicates valid vs. padded positions
Usage Example:
# Read atomic data with padding max_atoms = 50 # Determined from dimension scanning atom_data = reader.read_atoms(batch_refcodes, max_atoms) print(f"Atomic data shapes:") print(f" Coordinates: {atom_data['atom_coords'].shape}") print(f" Masses: {atom_data['atom_weight'].shape}") print(f" Mask: {atom_data['atom_mask'].shape}") # Analyze atomic composition for i, refcode in enumerate(batch_refcodes): mask = atom_data['atom_mask'][i] n_atoms = mask.sum() symbols = atom_data['atom_symbol'][i][:n_atoms] print(f"{refcode}: {n_atoms} atoms") # Count elements from collections import Counter element_counts = Counter(symbols) formula = ''.join(f"{elem}{count}" if count > 1 else elem for elem, count in sorted(element_counts.items())) print(f" Formula: {formula}")
Bond Data Reading
- RawDataReader.read_bonds(batch, max_bonds)[source]
Read and pad per‐bond data for a batch of structures.
- Parameters:
- Returns:
result –
- {
‘n_bonds’: List[int], ‘bond_id’: List[List[str]], ‘bond_atom1’: List[List[str]], ‘bond_atom2’: List[List[str]], ‘bond_atom1_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_atom2_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_type’: List[List[str]], ‘bond_is_rotatable_raw’: List[List[bool]], ‘bond_is_cyclic’: np.ndarray, shape (B, max_bonds), dtype bool, ‘bond_length’: np.ndarray, shape (B, max_bonds), dtype float32, ‘bond_mask’: np.ndarray, shape (B, max_bonds), dtype bool
}
- Return type:
Load Molecular Bond Information
Reads bond connectivity, types, and properties with padding for batch processing.
Parameters:
batch (
List[str]) - Refcode strings for structures to readmax_bonds (
int) - Maximum number of bonds for padding
Returns:
dict - Dictionary containing bond-level data:
bond_atom1_idx (
np.ndarray, shape (B, max_bonds)) - First atom indicesbond_atom2_idx (
np.ndarray, shape (B, max_bonds)) - Second atom indicesbond_type (
List[List[str]]) - Bond type strings (‘single’, ‘double’, etc.)bond_is_rotatable_raw (
np.ndarray, shape (B, max_bonds)) - Raw rotatability flagsbond_is_cyclic (
np.ndarray, shape (B, max_bonds)) - Ring membership flagsbond_length (
np.ndarray, shape (B, max_bonds)) - Bond lengths in Ångstromsbond_mask (
np.ndarray, shape (B, max_bonds)) - Valid bond indicators
Usage Example:
# Read bond data max_bonds = 80 # From dimension scanning bond_data = reader.read_bonds(batch_refcodes, max_bonds) # Analyze bond statistics for i, refcode in enumerate(batch_refcodes): mask = bond_data['bond_mask'][i] n_bonds = mask.sum() if n_bonds > 0: bond_types = bond_data['bond_type'][i][:n_bonds] bond_lengths = bond_data['bond_length'][i][mask] rotatable_bonds = bond_data['bond_is_rotatable_raw'][i][mask].sum() print(f"{refcode}: {n_bonds} bonds") print(f" Types: {set(bond_types)}") print(f" Length range: {bond_lengths.min():.2f}-{bond_lengths.max():.2f} Å") print(f" Rotatable bonds: {rotatable_bonds}")
Contact Data Reading
- RawDataReader.read_intermolecular_contacts(batch, max_contacts)[source]
Read and pad raw intermolecular contact data.
- Parameters:
- Returns:
result –
- {
‘n_inter_cc’: List[int], ‘inter_cc_id’: List[List[str]], ‘inter_cc_central_atom’: List[List[str]], ‘inter_cc_contact_atom’: List[List[str]], ‘inter_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_symmetry’: List[List[str]], ‘inter_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool
}
- Return type:
Load Intermolecular Contact Data
Reads intermolecular atomic contacts with symmetry operations and geometric properties.
Parameters:
batch (
List[str]) - Refcode strings for structures to readmax_contacts (
int) - Maximum number of contacts for padding
Returns:
dict - Dictionary containing intermolecular contact data:
inter_cc_id (
List[List[str]]) - Contact identifiersinter_cc_central_atom (
List[List[str]]) - Central atom labelsinter_cc_contact_atom (
List[List[str]]) - Contact atom labelsinter_cc_central_atom_idx (
np.ndarray) - Central atom indicesinter_cc_contact_atom_idx (
np.ndarray) - Contact atom indicesinter_cc_*_coords (
np.ndarray) - Cartesian coordinatesinter_cc_*_frac_coords (
np.ndarray) - Fractional coordinatesinter_cc_length (
np.ndarray) - Contact distancesinter_cc_strength (
np.ndarray) - Contact strength metricsinter_cc_symmetry (
List[List[str]]) - Symmetry operator stringsinter_cc_in_los (
np.ndarray) - Line-of-sight flagsinter_cc_mask (
np.ndarray) - Valid contact indicators
- RawDataReader.read_intermolecular_hbonds(batch, max_hbonds)[source]
Read and pad raw intermolecular H‐bond data.
- Parameters:
- Returns:
result –
- {
‘n_inter_hb’: List[int], ‘inter_hb_id’: List[List[str]], ‘inter_hb_central_atom’: List[List[str]], ‘inter_hb_hydrogen_atom’: List[List[str]], ‘inter_hb_contact_atom’: List[List[str]], ‘inter_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_symmetry’: List[List[str]], ‘inter_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool
}
- Return type:
Load Intermolecular Hydrogen Bond Data
Reads hydrogen bond interactions with donor-hydrogen-acceptor triplet information.
Parameters:
batch (
List[str]) - Refcode strings for structures to readmax_hbonds (
int) - Maximum number of hydrogen bonds for padding
Returns:
dict - Dictionary containing hydrogen bond data:
inter_hb_id (
List[List[str]]) - H-bond identifiersinter_hb_central_atom (
List[List[str]]) - Donor atom labelsinter_hb_hydrogen_atom (
List[List[str]]) - Hydrogen atom labelsinter_hb_contact_atom (
List[List[str]]) - Acceptor atom labelsinter_hb_*_idx (
np.ndarray) - Atom indices for donor/H/acceptorinter_hb_*_coords (
np.ndarray) - Coordinates for all three atomsinter_hb_length (
np.ndarray) - Donor-acceptor distancesinter_hb_angle (
np.ndarray) - Donor-H-acceptor anglesinter_hb_symmetry (
List[List[str]]) - Symmetry operationsinter_hb_mask (
np.ndarray) - Valid H-bond indicators
Usage Example:
# Read intermolecular interactions max_contacts = 200 max_hbonds = 50 contact_data = reader.read_intermolecular_contacts(batch_refcodes, max_contacts) hbond_data = reader.read_intermolecular_hbonds(batch_refcodes, max_hbonds) # Analyze interaction patterns for i, refcode in enumerate(batch_refcodes): n_contacts = contact_data['inter_cc_mask'][i].sum() n_hbonds = hbond_data['inter_hb_mask'][i].sum() print(f"{refcode}:") print(f" Intermolecular contacts: {n_contacts}") print(f" Hydrogen bonds: {n_hbonds}") if n_hbonds > 0: hb_lengths = hbond_data['inter_hb_length'][i][:n_hbonds] hb_angles = hbond_data['inter_hb_angle'][i][:n_hbonds] print(f" H-bond lengths: {hb_lengths.mean():.2f} ± {hb_lengths.std():.2f} Å") print(f" H-bond angles: {hb_angles.mean():.1f} ± {hb_angles.std():.1f}°")
Intramolecular Data Reading
- RawDataReader.read_intramolecular_contacts(batch, max_contacts)[source]
Read and pad raw intramolecular contact data.
- Parameters:
- Returns:
result –
- {
‘n_intra_cc’: List[int], ‘intra_cc_id’: List[List[str]], ‘intra_cc_central_atom’: List[List[str]], ‘intra_cc_contact_atom’: List[List[str]], ‘intra_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool
}
- Return type:
Load Intramolecular Contact Data
Reads contacts within individual molecules, typically for conformational analysis.
- RawDataReader.read_intramolecular_hbonds(batch, max_hbonds)[source]
Read and pad raw intramolecular H‐bond data.
- Parameters:
- Returns:
result –
- {
‘n_intra_hb’: List[int], ‘intra_hb_id’: List[List[str]], ‘intra_hb_central_atom’: List[List[str]], ‘intra_hb_hydrogen_atom’: List[List[str]], ‘intra_hb_contact_atom’: List[List[str]], ‘intra_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool
}
- Return type:
Load Intramolecular Hydrogen Bond Data
Reads hydrogen bonds within individual molecules for internal structure analysis.
Usage Pattern:
# Read intramolecular interactions intra_contacts = reader.read_intramolecular_contacts(batch_refcodes, max_intra_contacts) intra_hbonds = reader.read_intramolecular_hbonds(batch_refcodes, max_intra_hbonds) # These follow the same data structure as intermolecular versions # but analyze internal molecular geometry
Advanced Usage Patterns
Efficient Batch Processing
def process_large_dataset_efficiently(h5_file_path, batch_size=32):
"""Process large datasets with memory-efficient batching."""
with h5py.File(h5_file_path, 'r') as h5_file:
reader = RawDataReader(h5_file)
# Get all refcodes
all_refcodes = [key for key in h5_file['structures'].keys()]
n_structures = len(all_refcodes)
print(f"Processing {n_structures} structures in batches of {batch_size}")
# Process in batches
results = []
for start in range(0, n_structures, batch_size):
end = min(start + batch_size, n_structures)
batch_refcodes = all_refcodes[start:end]
print(f"Processing batch {start//batch_size + 1}: structures {start+1}-{end}")
# Read data for this batch
crystal_data = reader.read_crystal_parameters(batch_refcodes)
atom_data = reader.read_atoms(batch_refcodes, max_atoms=100)
# Process data (placeholder for actual analysis)
batch_results = analyze_batch(crystal_data, atom_data)
results.extend(batch_results)
return results
Data Quality Assessment
def assess_data_quality(reader, batch_refcodes, max_atoms, max_bonds):
"""Assess quality and completeness of structural data."""
crystal_data = reader.read_crystal_parameters(batch_refcodes)
atom_data = reader.read_atoms(batch_refcodes, max_atoms)
bond_data = reader.read_bonds(batch_refcodes, max_bonds)
quality_report = {}
for i, refcode in enumerate(batch_refcodes):
# Check basic completeness
n_atoms = atom_data['atom_mask'][i].sum()
n_bonds = bond_data['bond_mask'][i].sum()
# Check for reasonable values
cell_volume = crystal_data['cell_volume'][i]
density = crystal_data['cell_density'][i]
# Assess quality flags
quality_flags = []
if n_atoms < 5:
quality_flags.append("too_few_atoms")
if n_atoms > max_atoms * 0.9:
quality_flags.append("near_padding_limit")
if cell_volume < 100 or cell_volume > 10000:
quality_flags.append("unusual_volume")
if density < 0.5 or density > 5.0:
quality_flags.append("unusual_density")
if n_bonds < n_atoms - 1:
quality_flags.append("disconnected_structure")
quality_report[refcode] = {
'n_atoms': n_atoms,
'n_bonds': n_bonds,
'cell_volume': cell_volume,
'density': density,
'quality_flags': quality_flags,
'quality_score': len(quality_flags) # Lower is better
}
return quality_report
Integration with Processing Pipeline
def integrated_data_loading(h5_file_path, batch_refcodes, dimensions):
"""Complete data loading for processing pipeline."""
with h5py.File(h5_file_path, 'r') as h5_file:
reader = RawDataReader(h5_file)
# Load all required data types
crystal_data = reader.read_crystal_parameters(batch_refcodes)
atom_data = reader.read_atoms(batch_refcodes, dimensions['atoms'])
bond_data = reader.read_bonds(batch_refcodes, dimensions['bonds'])
contact_data = reader.read_intermolecular_contacts(
batch_refcodes, dimensions['contacts_inter']
)
hbond_data = reader.read_intermolecular_hbonds(
batch_refcodes, dimensions['hbonds_inter']
)
# Convert to tensors for GPU processing
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Convert numeric arrays to tensors
for key, value in crystal_data.items():
if isinstance(value, np.ndarray) and value.dtype.kind in 'if':
crystal_data[key] = torch.from_numpy(value).to(device)
for key, value in atom_data.items():
if isinstance(value, np.ndarray) and value.dtype.kind in 'ifb':
atom_data[key] = torch.from_numpy(value).to(device)
# Similar conversion for other data types...
return {
'crystal': crystal_data,
'atoms': atom_data,
'bonds': bond_data,
'contacts': contact_data,
'hbonds': hbond_data
}
Error Handling and Diagnostics
Missing Data Handling
def robust_data_reading(reader, batch_refcodes, dimensions):
"""Robust data reading with error handling."""
successful_reads = []
failed_reads = []
for refcode in batch_refcodes:
try:
# Try to read each structure individually first
crystal_data = reader.read_crystal_parameters([refcode])
atom_data = reader.read_atoms([refcode], dimensions['atoms'])
# Check for required fields
required_fields = ['cell_lengths', 'cell_angles', 'atom_coords']
for field in required_fields:
if field not in crystal_data and field not in atom_data:
raise KeyError(f"Missing required field: {field}")
successful_reads.append(refcode)
except Exception as e:
print(f"Warning: Failed to read {refcode}: {e}")
failed_reads.append((refcode, str(e)))
if failed_reads:
print(f"Failed to read {len(failed_reads)} structures:")
for refcode, error in failed_reads:
print(f" {refcode}: {error}")
# Process only successful reads
if successful_reads:
return load_successful_structures(reader, successful_reads, dimensions)
else:
raise ValueError("No structures could be read successfully")
Data Validation
def validate_loaded_data(data_dict):
"""Validate loaded data for consistency and completeness."""
print("Data Validation Report:")
# Check crystal data
crystal_data = data_dict['crystal']
n_structures = len(crystal_data['cell_lengths'])
print(f" Loaded {n_structures} structures")
# Validate cell parameters
lengths = crystal_data['cell_lengths']
angles = crystal_data['cell_angles']
if np.any(lengths <= 0):
print(" Warning: Non-positive cell lengths detected")
if np.any((angles <= 0) | (angles >= 180)):
print(" Warning: Invalid cell angles detected")
# Check atomic data consistency
atom_data = data_dict['atoms']
atom_coords = atom_data['atom_coords']
atom_mask = atom_data['atom_mask']
# Verify mask consistency
for i in range(n_structures):
n_atoms = atom_mask[i].sum()
valid_coords = atom_coords[i][atom_mask[i]]
if np.any(np.isnan(valid_coords)):
print(f" Warning: NaN coordinates in structure {i}")
if np.any(np.abs(valid_coords) > 1000):
print(f" Warning: Unusually large coordinates in structure {i}")
print(" Validation complete")
Performance Optimization
Memory-Efficient Loading
def memory_efficient_data_loading(h5_file_path, total_structures, batch_size=32):
"""Load data with careful memory management."""
import psutil
import gc
def get_memory_usage():
return psutil.Process().memory_info().rss / 1024 / 1024 # MB
initial_memory = get_memory_usage()
print(f"Initial memory usage: {initial_memory:.1f} MB")
with h5py.File(h5_file_path, 'r') as h5_file:
reader = RawDataReader(h5_file)
all_refcodes = list(h5_file['structures'].keys())[:total_structures]
results = []
for start in range(0, len(all_refcodes), batch_size):
batch_refcodes = all_refcodes[start:start+batch_size]
# Load batch data
crystal_data = reader.read_crystal_parameters(batch_refcodes)
# Process immediately to avoid memory accumulation
batch_results = process_crystal_data(crystal_data)
results.extend(batch_results)
# Clear references and force garbage collection
del crystal_data
gc.collect()
current_memory = get_memory_usage()
print(f"Batch {start//batch_size + 1}: {current_memory:.1f} MB "
f"(+{current_memory - initial_memory:.1f} MB)")
return results
Cross-References
Related CSA Modules:
data_writer module - Writing processed data back to HDF5
dimension_scanner module - Determining optimal padding dimensions
structure_data_extractor module - Creating raw HDF5 files
structure_post_extraction_processor module - Using loaded data for processing
External Dependencies:
HDF5 - Hierarchical data format
h5py - Python interface to HDF5
NumPy - Array operations and data structures
File Format References:
HDF5 User Guide: https://docs.hdfgroup.org/hdf5/develop/_u_g.html
h5py Documentation: https://docs.h5py.org/en/stable/