data_reader module

Module: data_reader.py

Provides RawDataReader for extracting raw NumPy arrays from an input HDF5 file generated by the StructureDataExtractor. Supports reading crystal parameters, atom data, bond data, and both inter‐ and intra‐molecular contacts/H‐bonds, with zero‐padding to fixed maximum dimensions.

Dependencies

h5py numpy

class data_reader.RawDataReader(h5_in)[source]

Bases: object

Read and pad raw per‐structure data from an HDF5 file for batch processing.

h5_in

Open HDF5 file containing raw structure datasets under ‘/structures/<refcode>’.

Type:: h5py.File

read_crystal_parameters(batch: List[str]) → Dict[str, np.ndarray][source]

Read unit‐cell lengths, angles, and scalar crystal properties.

read_atoms(batch: List[str], N_max: int) → Dict[str, Any][source]

Read and pad per‐atom labels, symbols, coordinates, weights, charges, SYBYL types, neighbor lists, and mask.

read_bonds(batch: List[str], max_bonds: int) → Dict[str, Any][source]

Read and pad per‐bond endpoint indices, bond‐type strings, rotatability flags, cyclic flags, and bond lengths.

read_intermolecular_contacts(batch: List[str], max_contacts: int) → Dict[str, Any][source]

Read and pad raw inter‐molecular contact labels, symmetry ops, Cartesian/fractional coords, lengths, strengths, and in‐LOS flags.

read_intermolecular_hbonds(batch: List[str], max_hbonds: int) → Dict[str, Any][source]

Read and pad raw inter‐molecular H‐bond labels, symmetry ops, Cartesian/fractional coords, lengths, angles, and in‐LOS flags.

read_intramolecular_contacts(batch: List[str], max_contacts: int) → Dict[str, Any][source]

Read and pad intra‐molecular contact data.

read_intramolecular_hbonds(batch: List[str], max_hbonds: int) → Dict[str, Any][source]

Read and pad intra‐molecular H‐bond data.

__init__(h5_in)[source]

Initialize RawDataReader.

Parameters:: h5_in (h5py.File) – Open HDF5 file handle containing ‘/structures’ groups.

read_crystal_parameters(batch)[source]

Read unit‐cell lengths, angles, and scalar crystal metrics for a batch.

Parameters:

batch (List[str]) – Refcode strings for structures to read.

Returns:

result –

{: ‘cell_lengths’: np.ndarray, shape (B,3), dtype float32, ‘cell_angles’: np.ndarray, shape (B,3), dtype float32, ‘z_value’: np.ndarray, shape (B,), dtype float32, ‘z_prime’: np.ndarray, shape (B,), dtype float32, ‘cell_volume’: np.ndarray, shape (B,), dtype float32, ‘cell_density’: np.ndarray, shape (B,), dtype float32, ‘packing_coefficient’: np.ndarray, shape (B,), dtype float32, ‘identifier’: np.ndarray, shape (B,), dtype object, ‘space_group’: np.ndarray, shape (B,), dtype object

}

Return type:

dict

read_atoms(batch, N_max)[source]

Read and pad per‐atom data for a batch of structures.

Parameters:

batch (List[str]) – Refcode strings to read from.
N_max (int) – Maximum atom count for padding.

Returns:

result –

{: ‘n_atoms’: List[int], ‘atom_label’: List[List[str]], ‘atom_symbol’: List[List[str]], ‘atom_number’: np.ndarray, shape (B, N_max), dtype int32, ‘atom_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_frac_coords’: np.ndarray, shape (B, N_max, 3), dtype float32, ‘atom_weight’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_charge’: np.ndarray, shape (B, N_max), dtype float32, ‘atom_sybyl_type’: List[List[str]], ‘atom_neighbour_list’: List[List[str]], ‘atom_mask’: np.ndarray, shape (B, N_max), dtype bool

}

Return type:

dict

read_bonds(batch, max_bonds)[source]

Read and pad per‐bond data for a batch of structures.

Parameters:

batch (List[str]) – Refcode strings to read bonds from.
max_bonds (int) – Maximum bond count for padding.

Returns:

result –

{: ‘n_bonds’: List[int], ‘bond_id’: List[List[str]], ‘bond_atom1’: List[List[str]], ‘bond_atom2’: List[List[str]], ‘bond_atom1_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_atom2_idx’: np.ndarray, shape (B, max_bonds), dtype int32, ‘bond_type’: List[List[str]], ‘bond_is_rotatable_raw’: List[List[bool]], ‘bond_is_cyclic’: np.ndarray, shape (B, max_bonds), dtype bool, ‘bond_length’: np.ndarray, shape (B, max_bonds), dtype float32, ‘bond_mask’: np.ndarray, shape (B, max_bonds), dtype bool

}

Return type:

dict

read_intermolecular_contacts(batch, max_contacts)[source]

Read and pad raw intermolecular contact data.

Parameters:

batch (List[str]) – Refcode strings to read contacts from.
max_contacts (int) – Maximum contact count for padding.

Returns:

result –

{: ‘n_inter_cc’: List[int], ‘inter_cc_id’: List[List[str]], ‘inter_cc_central_atom’: List[List[str]], ‘inter_cc_contact_atom’: List[List[str]], ‘inter_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘inter_cc_symmetry’: List[List[str]], ‘inter_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘inter_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘inter_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool

}

Return type:

dict

read_intermolecular_hbonds(batch, max_hbonds)[source]

Read and pad raw intermolecular H‐bond data.

Parameters:

batch (List[str]) – Refcode strings to read H‐bonds from.
max_hbonds (int) – Maximum H‐bond count for padding.

Returns:

result –

{: ‘n_inter_hb’: List[int], ‘inter_hb_id’: List[List[str]], ‘inter_hb_central_atom’: List[List[str]], ‘inter_hb_hydrogen_atom’: List[List[str]], ‘inter_hb_contact_atom’: List[List[str]], ‘inter_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘inter_hb_symmetry’: List[List[str]], ‘inter_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘inter_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘inter_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool

}

Return type:

dict

read_intramolecular_contacts(batch, max_contacts)[source]

Read and pad raw intramolecular contact data.

Parameters:

batch (List[str]) – Structure identifiers to read intra‐contacts from.
max_contacts (int) – Maximum intra‐contact count for padding.

Returns:

result –

{: ‘n_intra_cc’: List[int], ‘intra_cc_id’: List[List[str]], ‘intra_cc_central_atom’: List[List[str]], ‘intra_cc_contact_atom’: List[List[str]], ‘intra_cc_central_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_contact_atom_idx’: np.ndarray, shape (B, max_contacts), dtype int32, ‘intra_cc_central_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_central_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_contact_atom_frac_coords’: np.ndarray, shape (B, max_contacts, 3), dtype float32, ‘intra_cc_length’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_strength’: np.ndarray, shape (B, max_contacts), dtype float32, ‘intra_cc_in_los’: np.ndarray, shape (B, max_contacts), dtype bool

}

Return type:

dict

read_intramolecular_hbonds(batch, max_hbonds)[source]

Read and pad raw intramolecular H‐bond data.

Parameters:

batch (List[str]) – Structure identifiers to read intra‐H‐bonds from.
max_hbonds (int) – Maximum intra‐H‐bond count for padding.

Returns:

result –

{: ‘n_intra_hb’: List[int], ‘intra_hb_id’: List[List[str]], ‘intra_hb_central_atom’: List[List[str]], ‘intra_hb_hydrogen_atom’: List[List[str]], ‘intra_hb_contact_atom’: List[List[str]], ‘intra_hb_central_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_hydrogen_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_contact_atom_idx’: np.ndarray, shape (B, max_hbonds), dtype int32, ‘intra_hb_central_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_central_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_hydrogen_atom_frac_coords’:np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_contact_atom_frac_coords’: np.ndarray, shape (B, max_hbonds, 3), dtype float32, ‘intra_hb_length’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_angle’: np.ndarray, shape (B, max_hbonds), dtype float32, ‘intra_hb_in_los’: np.ndarray, shape (B, max_hbonds), dtype bool

}

Return type:

dict

HDF5 Raw Data Reading and Batch Loading

The data_reader module provides efficient interfaces for reading raw crystallographic structure data from HDF5 files into memory for batch processing. It handles padding, type conversion, and organization of heterogeneous structural data into uniform arrays suitable for GPU processing.

Key Features:

Batch data loading - Read multiple structures simultaneously for efficient processing
Automatic padding - Uniform array dimensions for ragged data (atoms, bonds, contacts)
Type conversion - Convert HDF5 datasets to appropriate NumPy/PyTorch formats
Memory optimization - Efficient data loading patterns for large datasets
Error handling - Robust handling of missing data and format inconsistencies
Flexible interfaces - Support for different data types and structure organizations

RawDataReader Class

class data_reader.RawDataReader(h5_in)[source]

Read and pad raw per‐structure data from an HDF5 file for batch processing.

h5_in

Open HDF5 file containing raw structure datasets under ‘/structures/<refcode>’.

Type:: h5py.File