data_writer module

Module: data_writer.py

Provides RawDataWriter and ComputedDataWriter to write raw and computed datasets into a processed HDF5 file. All per‐structure data is written slice‐by‐slice, with variable‐length (vlen) datasets for atoms, bonds, contacts, and H‐bonds.

Dependencies

h5py numpy torch

class data_writer.RawDataWriter(h5_out)[source]

Bases: object

Write raw crystal, atom, bond, intramolecular and intermolecular contact, and H-bond data into the output HDF5 file, slice-by-slice.

h5_out

Open HDF5 file for writing processed data.

Type:

h5py.File

write_raw_crystal_data(start, crystal_parameters)[source]

Write raw crystal parameters into the HDF5 datasets.

write_raw_atom_data(start, atom_parameters)[source]

Write raw per-atom data into the HDF5 datasets.

write_raw_bond_data(start, bond_parameters)[source]

Write raw per-bond data into the HDF5 datasets.

write_raw_intramolecular_contact_data(start, intra_cc_parameters)[source]

Write raw intra-molecular contact data into the HDF5 datasets.

write_raw_intramolecular_hbond_data(start, intra_hb_parameters)[source]

Write raw intra-molecular H-bond data into the HDF5 datasets.

__init__(h5_out)[source]
Parameters:

h5_out (h5py.File) – Open HDF5 file for writing processed data.

write_raw_crystal_data(start, crystal_parameters)[source]

Write raw crystal parameters into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • crystal_parameters (Dict[str, Any]) – Dictionary of raw crystal-level arrays or tensors: ‘cell_lengths’, ‘cell_angles’, and any scalar metrics.

Return type:

None

write_raw_atom_data(start, atom_parameters)[source]

Write raw per‐atom data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • atom_parameters (Dict[str, Any]) – Dictionary containing per-atom raw data: ‘atom_label’, ‘atom_symbol’, ‘atom_number’, ‘atom_coords’, ‘atom_frac_coords’, ‘atom_weight’, ‘atom_charge’, ‘atom_sybyl_type’, ‘atom_neighbour_list’, and ‘atom_mask’.

Return type:

None

write_raw_bond_data(start, bond_parameters)[source]

Write raw per‐bond data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • bond_parameters (Dict[str, Any]) – Dictionary containing per-bond raw data: ‘n_bonds’, ‘bond_atom1_idx’, ‘bond_atom2_idx’, ‘bond_atom1’, ‘bond_atom2’, ‘bond_type’, ‘bond_is_rotatable_raw’, ‘bond_is_cyclic’, and ‘bond_length’.

Return type:

None

write_raw_intramolecular_contact_data(start, intra_cc_parameters)[source]

Write raw intramolecular contact data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • intra_cc_parameters (Dict[str, Any]) – Dictionary containing raw intra-molecular contact data: ‘intra_cc_id’, ‘intra_cc_central_atom’, ‘intra_cc_contact_atom’, ‘intra_cc_central_atom_idx’, ‘intra_cc_contact_atom_idx’, ‘intra_cc_central_atom_coords’, ‘intra_cc_contact_atom_coords’, ‘intra_cc_central_atom_frac_coords’, ‘intra_cc_contact_atom_frac_coords’, ‘intra_cc_length’, ‘intra_cc_strength’, and ‘intra_cc_in_los’.

Return type:

None

write_raw_intramolecular_hbond_data(start, intra_hb_parameters)[source]

Write raw intramolecular H‐bond data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • intra_hb_parameters (Dict[str, Any]) – Dictionary containing raw intra-molecular H-bond data: ‘intra_hb_id’, ‘intra_hb_central_atom’, ‘intra_hb_hydrogen_atom’, ‘intra_hb_contact_atom’, ‘intra_hb_central_atom_idx’, ‘intra_hb_hydrogen_atom_idx’, ‘intra_hb_contact_atom_idx’, ‘intra_hb_central_atom_coords’, ‘intra_hb_hydrogen_atom_coords’, ‘intra_hb_contact_atom_coords’, ‘intra_hb_central_atom_frac_coords’, ‘intra_hb_hydrogen_atom_frac_coords’, ‘intra_hb_contact_atom_frac_coords’, ‘intra_hb_length’, ‘intra_hb_angle’, and ‘intra_hb_in_los’.

Return type:

None

class data_writer.ComputedDataWriter(h5_out)[source]

Bases: object

Write computed crystal, atom, bond, molecule, and contact/H-bond features into the output HDF5 file.

h5_out

Open HDF5 file for writing processed data.

Type:

h5py.File

write_computed_crystal_data(start: int, crystal_parameters: Dict[str, Any]) None[source]

Write computed crystal parameters (‘scaled_cell’, ‘cell_matrix’) into the HDF5 datasets.

write_computed_atom_data(start: int, atom_parameters: Dict[str, Any]) None[source]

Write computed atom-level features (‘atom_fragment_id’, ‘atom_dist_to_special_planes’) into the HDF5 datasets.

write_computed_bond_data(start: int, bond_parameters: Dict[str, Any]) None[source]

Write computed bond-level features (‘bond_is_rotatable’, ‘bond_vector_angles_to_special_planes’) into the HDF5 datasets.

write_computed_molecule_data(start: int, molecule_parameters: Dict[str, Any]) None[source]

Write computed intra-molecular bond angles and torsion features into the HDF5 datasets.

write_computed_intermolecular_contact_data(start: int, inter_cc_parameters: Dict[str, Any]) None[source]

Write computed intermolecular contact features (IDs, indices, coords, lengths, strengths, h-bond flags, fragment mappings, vectors) into the HDF5 datasets.

write_computed_intermolecular_hbond_data(start: int, inter_hb_parameters: Dict[str, Any]) None[source]

Write computed intermolecular H-bond features (IDs, donor/acceptor labels, indices, coords, lengths, angles, masks, symmetry ops) into the HDF5 datasets.

__init__(h5_out)[source]

Initialize ComputedDataWriter.

Parameters:

h5_out (h5py.File) – Open HDF5 file for writing processed data.

write_computed_crystal_data(start, crystal_parameters)[source]

Write computed crystal parameters into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • crystal_parameters (Dict[str, Any]) – Dictionary containing computed crystal data: ‘scaled_cell’ (shape (B, 6)) and ‘cell_matrix’ (shape (B, 3, 3)).

Return type:

None

write_computed_atom_data(start, atom_parameters)[source]

Write computed atom-level features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • atom_parameters (Dict[str, Any]) – Dictionary containing computed atom data: ‘atom_fragment_id’ (list or array of shape (B, N)), ‘atom_dist_to_special_planes’ (shape (B, N, P)).

Return type:

None

write_computed_bond_data(start, bond_parameters)[source]

Write computed bond-level features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • bond_parameters (Dict[str, Any]) – Dictionary containing computed bond data: ‘bond_is_rotatable’ (shape (B, M)) and ‘bond_vector_angles_to_special_planes’ (shape (B, M, K)).

Return type:

None

write_computed_molecule_data(start, molecule_parameters)[source]

Write computed intra-molecular bond angles and torsion features.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • molecule_parameters (Dict[str, Any]) – Dictionary containing computed molecule data: ‘bond_angle_id’, ‘bond_angle’, ‘bond_angle_mask’, ‘bond_angle_atom_idx’, ‘torsion_id’, ‘torsion’, ‘torsion_mask’, and ‘torsion_atom_idx’.

Return type:

None

write_computed_intermolecular_contact_data(start, inter_cc_parameters)[source]

Write computed intermolecular contact features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • inter_cc_parameters (Dict[str, Any]) – Dictionary containing computed intermolecular contact data, including IDs, atom labels, indices, coords, frac_coords, lengths, strengths, masks, symmetry ops, hbond flags, and fragment‐mapping and vector fields.

Return type:

None

write_computed_intermolecular_hbond_data(start, inter_hb_parameters)[source]

Write computed intermolecular H‐bond features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • inter_hb_parameters (Dict[str, Any]) – Dictionary containing computed intermolecular H‐bond data, including IDs, donor/acceptor labels, indices, coords, frac_coords, lengths, angles, masks, symmetry ops, etc.

Return type:

None

write_computed_fragment_data(start, fragment_parameters)[source]

Write computed fragment-level properties into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • fragment_parameters (Dict[str, Any]) –

    Dictionary containing computed fragment data: - ‘n_fragments’ : (B,) or list of ints - ‘fragment_local_id’ : (B, nF) or list of lists of ints - ‘fragment_formula’ : list of lists of str - ‘fragment_n_atoms’ : (B, nF) or list of lists of ints - all other keys as numpy/Tensor arrays:

    ’fragment_com_coords’, ‘fragment_com_frac_coords’, ‘fragment_cen_coords’, ‘fragment_cen_frac_coords’, ‘fragment_inertia_tensors’, ‘fragment_inertia_eigvals’, ‘fragment_inertia_eigvecs’, ‘fragment_inertia_quaternions’, ‘fragment_quadrupole_tensors’, ‘fragment_quadrupole_eigvals’, ‘fragment_quadrupole_eigvecs’, ‘fragment_quadrupole_quaternions’, ‘fragment_atom_to_com_dist’, ‘fragment_atom_to_com_frac_dist’, ‘fragment_atom_to_com_vec’, ‘fragment_atom_to_com_frac_vec’, ‘fragment_Ql’, ‘fragment_plane_centroid’, ‘fragment_plane_normal’, ‘fragment_planarity_rmsd’, ‘fragment_planarity_max_dev’, ‘fragment_planarity_score’

Return type:

None

HDF5 Data Writing and Storage Management

The data_writer module provides efficient interfaces for writing both raw and computed structural data to HDF5 files. It handles the complexities of variable-length datasets, batch writing operations, and proper data type management for crystallographic structure analysis pipelines.

Key Features:

  • Dual writer classes - Separate interfaces for raw and computed data

  • Batch writing operations - Efficient slice-by-slice data storage

  • Variable-length datasets - Proper handling of ragged arrays (atoms, bonds, contacts)

  • Type management - Automatic conversion between PyTorch tensors and NumPy arrays

  • Memory optimization - Efficient writing patterns for large datasets

  • Data integrity - Consistent formatting and validation during writes

Writer Classes Overview

The module provides two main writer classes:

  • RawDataWriter - Handles raw structural data from extraction pipeline

  • ComputedDataWriter - Handles processed features and computed descriptors

Both classes follow similar patterns but handle different data types and formatting requirements.

RawDataWriter Class

class data_writer.RawDataWriter(h5_out)[source]

Write raw crystal, atom, bond, intramolecular and intermolecular contact, and H-bond data into the output HDF5 file, slice-by-slice.

h5_out

Open HDF5 file for writing processed data.

Type:

h5py.File

write_raw_crystal_data(start, crystal_parameters)[source]

Write raw crystal parameters into the HDF5 datasets.

write_raw_atom_data(start, atom_parameters)[source]

Write raw per-atom data into the HDF5 datasets.

write_raw_bond_data(start, bond_parameters)[source]

Write raw per-bond data into the HDF5 datasets.

write_raw_intramolecular_contact_data(start, intra_cc_parameters)[source]

Write raw intra-molecular contact data into the HDF5 datasets.

write_raw_intramolecular_hbond_data(start, intra_hb_parameters)[source]

Write raw intra-molecular H-bond data into the HDF5 datasets.

Raw Structural Data Writing Interface

The RawDataWriter class handles writing raw crystallographic data extracted from CSD structures into HDF5 files with proper formatting and organization for downstream processing.

Attributes:

  • h5_out (h5py.File) - Open HDF5 file handle for writing processed data

Data Organization Pattern:

Raw data is organized into distinct categories: - Crystal-level parameters (fixed-size arrays) - Atomic data (variable-length with padding) - Bond connectivity (variable-length with padding) - Intermolecular interactions (contacts and H-bonds) - Intramolecular interactions (contacts and H-bonds)

__init__(h5_out)[source]
Parameters:

h5_out (h5py.File) – Open HDF5 file for writing processed data.

Initialize Raw Data Writer

Parameters:
  • h5_out (h5py.File) - Open HDF5 file for writing processed data

Usage Example:

import h5py
from data_writer import RawDataWriter

# Create or open output file
with h5py.File('structures_processed.h5', 'w') as h5_out:
    writer = RawDataWriter(h5_out)

    # Write data in batches...
    writer.write_raw_crystal_data(start_index, crystal_parameters)
__init__(h5_out)[source]
Parameters:

h5_out (h5py.File) – Open HDF5 file for writing processed data.

write_raw_crystal_data(start, crystal_parameters)[source]

Write raw crystal parameters into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • crystal_parameters (Dict[str, Any]) – Dictionary of raw crystal-level arrays or tensors: ‘cell_lengths’, ‘cell_angles’, and any scalar metrics.

Return type:

None

write_raw_atom_data(start, atom_parameters)[source]

Write raw per‐atom data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • atom_parameters (Dict[str, Any]) – Dictionary containing per-atom raw data: ‘atom_label’, ‘atom_symbol’, ‘atom_number’, ‘atom_coords’, ‘atom_frac_coords’, ‘atom_weight’, ‘atom_charge’, ‘atom_sybyl_type’, ‘atom_neighbour_list’, and ‘atom_mask’.

Return type:

None

write_raw_bond_data(start, bond_parameters)[source]

Write raw per‐bond data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • bond_parameters (Dict[str, Any]) – Dictionary containing per-bond raw data: ‘n_bonds’, ‘bond_atom1_idx’, ‘bond_atom2_idx’, ‘bond_atom1’, ‘bond_atom2’, ‘bond_type’, ‘bond_is_rotatable_raw’, ‘bond_is_cyclic’, and ‘bond_length’.

Return type:

None

write_raw_intramolecular_contact_data(start, intra_cc_parameters)[source]

Write raw intramolecular contact data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • intra_cc_parameters (Dict[str, Any]) – Dictionary containing raw intra-molecular contact data: ‘intra_cc_id’, ‘intra_cc_central_atom’, ‘intra_cc_contact_atom’, ‘intra_cc_central_atom_idx’, ‘intra_cc_contact_atom_idx’, ‘intra_cc_central_atom_coords’, ‘intra_cc_contact_atom_coords’, ‘intra_cc_central_atom_frac_coords’, ‘intra_cc_contact_atom_frac_coords’, ‘intra_cc_length’, ‘intra_cc_strength’, and ‘intra_cc_in_los’.

Return type:

None

write_raw_intramolecular_hbond_data(start, intra_hb_parameters)[source]

Write raw intramolecular H‐bond data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • intra_hb_parameters (Dict[str, Any]) – Dictionary containing raw intra-molecular H-bond data: ‘intra_hb_id’, ‘intra_hb_central_atom’, ‘intra_hb_hydrogen_atom’, ‘intra_hb_contact_atom’, ‘intra_hb_central_atom_idx’, ‘intra_hb_hydrogen_atom_idx’, ‘intra_hb_contact_atom_idx’, ‘intra_hb_central_atom_coords’, ‘intra_hb_hydrogen_atom_coords’, ‘intra_hb_contact_atom_coords’, ‘intra_hb_central_atom_frac_coords’, ‘intra_hb_hydrogen_atom_frac_coords’, ‘intra_hb_contact_atom_frac_coords’, ‘intra_hb_length’, ‘intra_hb_angle’, and ‘intra_hb_in_los’.

Return type:

None

Crystal Data Writing

RawDataWriter.write_raw_crystal_data(start, crystal_parameters)[source]

Write raw crystal parameters into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • crystal_parameters (Dict[str, Any]) – Dictionary of raw crystal-level arrays or tensors: ‘cell_lengths’, ‘cell_angles’, and any scalar metrics.

Return type:

None

Write Crystal-Level Parameters

Stores unit cell parameters and crystal properties in fixed-size datasets.

Parameters:

  • start (int) - Index offset in output datasets for this batch

  • crystal_parameters (Dict[str, Any]) - Dictionary of crystal-level data including:

    • cell_lengths - Unit cell parameters [a, b, c]

    • cell_angles - Unit cell angles [α, β, γ]

    • z_value - Number of formula units per unit cell

    • z_prime - Number of symmetry-independent molecules

    • cell_volume - Unit cell volume

    • cell_density - Crystal density

    • space_group - Space group symbol

    • identifier - Structure identifier

Data Types and Conversion:

The method automatically handles conversion between different data formats:

# Automatic tensor to array conversion
if torch.is_tensor(value):
    array_data = value.detach().cpu().numpy()
else:
    array_data = np.asarray(value)

# Write to appropriate dataset slice
h5_dataset[start:start+batch_size] = array_data

Usage Example:

# Prepare crystal parameter data
crystal_params = {
    'cell_lengths': np.array([[5.0, 5.0, 5.0], [6.0, 8.0, 10.0]]),
    'cell_angles': np.array([[90.0, 90.0, 90.0], [75.0, 85.0, 95.0]]),
    'cell_volume': np.array([125.0, 450.2]),
    'cell_density': np.array([1.45, 1.32]),
    'space_group': ['P1', 'P-1']
}

# Write to HDF5 starting at structure index 10
writer.write_raw_crystal_data(start=10, crystal_parameters=crystal_params)

Atomic Data Writing

RawDataWriter.write_raw_atom_data(start, atom_parameters)[source]

Write raw per‐atom data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • atom_parameters (Dict[str, Any]) – Dictionary containing per-atom raw data: ‘atom_label’, ‘atom_symbol’, ‘atom_number’, ‘atom_coords’, ‘atom_frac_coords’, ‘atom_weight’, ‘atom_charge’, ‘atom_sybyl_type’, ‘atom_neighbour_list’, and ‘atom_mask’.

Return type:

None

Write Padded Atomic Information

Stores atomic coordinates, properties, and connectivity in variable-length datasets with proper padding management.

Parameters:

  • start (int) - Index offset in output datasets for this batch

  • atom_parameters (Dict[str, Any]) - Dictionary containing atomic data:

    • atom_label - Atom labels per structure

    • atom_symbol - Element symbols

    • atom_coords - Cartesian coordinates (Å)

    • atom_frac_coords - Fractional coordinates

    • atom_weight - Atomic masses (Da)

    • atom_charge - Formal charges

    • atom_sybyl_type - SYBYL atom types

    • atom_neighbors - Connectivity information

    • atom_mask - Valid atom indicators

Variable-Length Data Handling:

The method handles variable-length datasets using HDF5’s VLen (Variable Length) data types:

# For each structure in the batch
for i, refcode in enumerate(batch):
    # Get number of real atoms
    n_atoms = atom_mask[i].sum()

    # Extract valid data only
    valid_coords = atom_coords[i, :n_atoms, :].flatten()  # (n_atoms * 3,)
    valid_symbols = atom_symbols[i][:n_atoms]

    # Write to variable-length datasets
    h5_dataset['atom_coords'][structure_index] = valid_coords
    h5_dataset['atom_symbol'][structure_index] = valid_symbols

Usage Example:

# Prepare atomic data with proper padding
batch_size = 3
max_atoms = 50

atom_params = {
    'atom_coords': np.random.rand(batch_size, max_atoms, 3),
    'atom_symbol': [['C', 'C', 'H', 'H'], ['N', 'O', 'H'], ['C', 'O']],
    'atom_weight': np.random.rand(batch_size, max_atoms),
    'atom_mask': np.array([[True, True, True, True] + [False]*46,
                          [True, True, True] + [False]*47,
                          [True, True] + [False]*48])
}

writer.write_raw_atom_data(start=0, atom_parameters=atom_params)

Bond Data Writing

RawDataWriter.write_raw_bond_data(start, bond_parameters)[source]

Write raw per‐bond data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • bond_parameters (Dict[str, Any]) – Dictionary containing per-bond raw data: ‘n_bonds’, ‘bond_atom1_idx’, ‘bond_atom2_idx’, ‘bond_atom1’, ‘bond_atom2’, ‘bond_type’, ‘bond_is_rotatable_raw’, ‘bond_is_cyclic’, and ‘bond_length’.

Return type:

None

Write Molecular Bond Information

Stores bond connectivity, types, and properties with variable-length formatting.

Parameters:

  • start (int) - Index offset for this batch

  • bond_parameters (Dict[str, Any]) - Dictionary containing bond data:

    • bond_atom1_idx - First atom indices

    • bond_atom2_idx - Second atom indices

    • bond_type - Bond type strings

    • bond_is_rotatable_raw - Rotatability flags

    • bond_is_cyclic - Ring membership

    • bond_length - Bond lengths

    • bond_mask - Valid bond indicators

Usage Example:

bond_params = {
    'bond_atom1_idx': np.array([[0, 1, 2], [0, 1, 2, 3]]),
    'bond_atom2_idx': np.array([[1, 2, 3], [1, 2, 3, 4]]),
    'bond_type': [['single', 'double', 'single'], ['single', 'single', 'aromatic', 'single']],
    'bond_length': np.array([[1.54, 1.34, 1.45], [1.47, 1.42, 1.39, 1.51]]),
    'bond_mask': np.array([[True, True, True, False], [True, True, True, True]])
}

writer.write_raw_bond_data(start=0, bond_parameters=bond_params)

Contact Data Writing

RawDataWriter.write_raw_intramolecular_contact_data(start, intra_cc_parameters)[source]

Write raw intramolecular contact data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • intra_cc_parameters (Dict[str, Any]) – Dictionary containing raw intra-molecular contact data: ‘intra_cc_id’, ‘intra_cc_central_atom’, ‘intra_cc_contact_atom’, ‘intra_cc_central_atom_idx’, ‘intra_cc_contact_atom_idx’, ‘intra_cc_central_atom_coords’, ‘intra_cc_contact_atom_coords’, ‘intra_cc_central_atom_frac_coords’, ‘intra_cc_contact_atom_frac_coords’, ‘intra_cc_length’, ‘intra_cc_strength’, and ‘intra_cc_in_los’.

Return type:

None

Write Intramolecular Contact Information

Stores contacts within individual molecules.

RawDataWriter.write_raw_intramolecular_hbond_data(start, intra_hb_parameters)[source]

Write raw intramolecular H‐bond data into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • intra_hb_parameters (Dict[str, Any]) – Dictionary containing raw intra-molecular H-bond data: ‘intra_hb_id’, ‘intra_hb_central_atom’, ‘intra_hb_hydrogen_atom’, ‘intra_hb_contact_atom’, ‘intra_hb_central_atom_idx’, ‘intra_hb_hydrogen_atom_idx’, ‘intra_hb_contact_atom_idx’, ‘intra_hb_central_atom_coords’, ‘intra_hb_hydrogen_atom_coords’, ‘intra_hb_contact_atom_coords’, ‘intra_hb_central_atom_frac_coords’, ‘intra_hb_hydrogen_atom_frac_coords’, ‘intra_hb_contact_atom_frac_coords’, ‘intra_hb_length’, ‘intra_hb_angle’, and ‘intra_hb_in_los’.

Return type:

None

Write Intramolecular Hydrogen Bond Data

Stores internal molecular hydrogen bonds.

ComputedDataWriter Class

class data_writer.ComputedDataWriter(h5_out)[source]

Write computed crystal, atom, bond, molecule, and contact/H-bond features into the output HDF5 file.

h5_out

Open HDF5 file for writing processed data.

Type:

h5py.File

write_computed_crystal_data(start: int, crystal_parameters: Dict[str, Any]) None[source]

Write computed crystal parameters (‘scaled_cell’, ‘cell_matrix’) into the HDF5 datasets.

write_computed_atom_data(start: int, atom_parameters: Dict[str, Any]) None[source]

Write computed atom-level features (‘atom_fragment_id’, ‘atom_dist_to_special_planes’) into the HDF5 datasets.

write_computed_bond_data(start: int, bond_parameters: Dict[str, Any]) None[source]

Write computed bond-level features (‘bond_is_rotatable’, ‘bond_vector_angles_to_special_planes’) into the HDF5 datasets.

write_computed_molecule_data(start: int, molecule_parameters: Dict[str, Any]) None[source]

Write computed intra-molecular bond angles and torsion features into the HDF5 datasets.

write_computed_intermolecular_contact_data(start: int, inter_cc_parameters: Dict[str, Any]) None[source]

Write computed intermolecular contact features (IDs, indices, coords, lengths, strengths, h-bond flags, fragment mappings, vectors) into the HDF5 datasets.

write_computed_intermolecular_hbond_data(start: int, inter_hb_parameters: Dict[str, Any]) None[source]

Write computed intermolecular H-bond features (IDs, donor/acceptor labels, indices, coords, lengths, angles, masks, symmetry ops) into the HDF5 datasets.

Computed Feature Data Writing Interface

The ComputedDataWriter class handles writing processed features and computed descriptors derived from raw structural data.

Key Differences from RawDataWriter:

  • Handles computed tensors and derived features

  • Different data organization patterns

  • Optimized for machine learning-ready formats

  • Includes fragment-level and interaction-level features

__init__(h5_out)[source]

Initialize ComputedDataWriter.

Parameters:

h5_out (h5py.File) – Open HDF5 file for writing processed data.

Initialize Computed Data Writer

Parameters:
  • h5_out (h5py.File) - Open HDF5 file for writing processed data

__init__(h5_out)[source]

Initialize ComputedDataWriter.

Parameters:

h5_out (h5py.File) – Open HDF5 file for writing processed data.

write_computed_crystal_data(start, crystal_parameters)[source]

Write computed crystal parameters into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • crystal_parameters (Dict[str, Any]) – Dictionary containing computed crystal data: ‘scaled_cell’ (shape (B, 6)) and ‘cell_matrix’ (shape (B, 3, 3)).

Return type:

None

write_computed_atom_data(start, atom_parameters)[source]

Write computed atom-level features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • atom_parameters (Dict[str, Any]) – Dictionary containing computed atom data: ‘atom_fragment_id’ (list or array of shape (B, N)), ‘atom_dist_to_special_planes’ (shape (B, N, P)).

Return type:

None

write_computed_bond_data(start, bond_parameters)[source]

Write computed bond-level features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • bond_parameters (Dict[str, Any]) – Dictionary containing computed bond data: ‘bond_is_rotatable’ (shape (B, M)) and ‘bond_vector_angles_to_special_planes’ (shape (B, M, K)).

Return type:

None

write_computed_molecule_data(start, molecule_parameters)[source]

Write computed intra-molecular bond angles and torsion features.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • molecule_parameters (Dict[str, Any]) – Dictionary containing computed molecule data: ‘bond_angle_id’, ‘bond_angle’, ‘bond_angle_mask’, ‘bond_angle_atom_idx’, ‘torsion_id’, ‘torsion’, ‘torsion_mask’, and ‘torsion_atom_idx’.

Return type:

None

write_computed_intermolecular_contact_data(start, inter_cc_parameters)[source]

Write computed intermolecular contact features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • inter_cc_parameters (Dict[str, Any]) – Dictionary containing computed intermolecular contact data, including IDs, atom labels, indices, coords, frac_coords, lengths, strengths, masks, symmetry ops, hbond flags, and fragment‐mapping and vector fields.

Return type:

None

write_computed_intermolecular_hbond_data(start, inter_hb_parameters)[source]

Write computed intermolecular H‐bond features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • inter_hb_parameters (Dict[str, Any]) – Dictionary containing computed intermolecular H‐bond data, including IDs, donor/acceptor labels, indices, coords, frac_coords, lengths, angles, masks, symmetry ops, etc.

Return type:

None

write_computed_fragment_data(start, fragment_parameters)[source]

Write computed fragment-level properties into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • fragment_parameters (Dict[str, Any]) –

    Dictionary containing computed fragment data: - ‘n_fragments’ : (B,) or list of ints - ‘fragment_local_id’ : (B, nF) or list of lists of ints - ‘fragment_formula’ : list of lists of str - ‘fragment_n_atoms’ : (B, nF) or list of lists of ints - all other keys as numpy/Tensor arrays:

    ’fragment_com_coords’, ‘fragment_com_frac_coords’, ‘fragment_cen_coords’, ‘fragment_cen_frac_coords’, ‘fragment_inertia_tensors’, ‘fragment_inertia_eigvals’, ‘fragment_inertia_eigvecs’, ‘fragment_inertia_quaternions’, ‘fragment_quadrupole_tensors’, ‘fragment_quadrupole_eigvals’, ‘fragment_quadrupole_eigvecs’, ‘fragment_quadrupole_quaternions’, ‘fragment_atom_to_com_dist’, ‘fragment_atom_to_com_frac_dist’, ‘fragment_atom_to_com_vec’, ‘fragment_atom_to_com_frac_vec’, ‘fragment_Ql’, ‘fragment_plane_centroid’, ‘fragment_plane_normal’, ‘fragment_planarity_rmsd’, ‘fragment_planarity_max_dev’, ‘fragment_planarity_score’

Return type:

None

Computed Data Writing Methods

ComputedDataWriter.write_computed_crystal_data(start, crystal_parameters)[source]

Write computed crystal parameters into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • crystal_parameters (Dict[str, Any]) – Dictionary containing computed crystal data: ‘scaled_cell’ (shape (B, 6)) and ‘cell_matrix’ (shape (B, 3, 3)).

Return type:

None

Write Computed Crystal Features

Stores derived crystal properties and transformation matrices.

Parameters:

  • start (int) - Index offset for this batch

  • crystal_parameters (Dict[str, Any]) - Dictionary containing:

    • scaled_cell (torch.Tensor, shape (B, 6)) - Normalized cell parameters

    • cell_matrix (torch.Tensor, shape (B, 3, 3)) - Transformation matrices

Tensor Conversion:

# Automatic PyTorch tensor conversion
for key, vals in crystal_parameters.items():
    if torch.is_tensor(vals):
        arrays[key] = vals.detach().cpu().numpy()
    else:
        arrays[key] = np.asarray(vals)

# Write different dimensionalities appropriately
if array.ndim == 2:
    dataset[start:start+B, :] = array
elif array.ndim == 3:
    dataset[start:start+B, :, :] = array

Usage Example:

import torch

# Computed crystal features
computed_crystal = {
    'scaled_cell': torch.tensor([[1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
                                [1.0, 1.33, 1.67, 0.83, 0.94, 1.06]]),
    'cell_matrix': torch.eye(3).unsqueeze(0).repeat(2, 1, 1)
}

writer.write_computed_crystal_data(start=0, crystal_parameters=computed_crystal)
ComputedDataWriter.write_computed_atom_data(start, atom_parameters)[source]

Write computed atom-level features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • atom_parameters (Dict[str, Any]) – Dictionary containing computed atom data: ‘atom_fragment_id’ (list or array of shape (B, N)), ‘atom_dist_to_special_planes’ (shape (B, N, P)).

Return type:

None

Write Computed Atomic Features

Stores atom-level computed properties and descriptors.

Parameters:

  • start (int) - Index offset for this batch

  • atom_parameters (Dict[str, Any]) - Dictionary containing:

    • atom_fragment_id - Fragment assignments

    • atom_dist_to_special_planes - Distances to crystallographic planes

ComputedDataWriter.write_computed_bond_data(start, bond_parameters)[source]

Write computed bond-level features into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • bond_parameters (Dict[str, Any]) – Dictionary containing computed bond data: ‘bond_is_rotatable’ (shape (B, M)) and ‘bond_vector_angles_to_special_planes’ (shape (B, M, K)).

Return type:

None

Write Computed Bond Features

Stores bond-level computed properties and geometric descriptors.

ComputedDataWriter.write_computed_molecule_data(start, molecule_parameters)[source]

Write computed intra-molecular bond angles and torsion features.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • molecule_parameters (Dict[str, Any]) – Dictionary containing computed molecule data: ‘bond_angle_id’, ‘bond_angle’, ‘bond_angle_mask’, ‘bond_angle_atom_idx’, ‘torsion_id’, ‘torsion’, ‘torsion_mask’, and ‘torsion_atom_idx’.

Return type:

None

Write Computed Molecular Features

Stores intramolecular geometric features like bond angles and torsions.

ComputedDataWriter.write_computed_fragment_data(start, fragment_parameters)[source]

Write computed fragment-level properties into the HDF5 datasets.

Parameters:
  • start (int) – Index offset in the output datasets corresponding to this batch.

  • fragment_parameters (Dict[str, Any]) –

    Dictionary containing computed fragment data: - ‘n_fragments’ : (B,) or list of ints - ‘fragment_local_id’ : (B, nF) or list of lists of ints - ‘fragment_formula’ : list of lists of str - ‘fragment_n_atoms’ : (B, nF) or list of lists of ints - all other keys as numpy/Tensor arrays:

    ’fragment_com_coords’, ‘fragment_com_frac_coords’, ‘fragment_cen_coords’, ‘fragment_cen_frac_coords’, ‘fragment_inertia_tensors’, ‘fragment_inertia_eigvals’, ‘fragment_inertia_eigvecs’, ‘fragment_inertia_quaternions’, ‘fragment_quadrupole_tensors’, ‘fragment_quadrupole_eigvals’, ‘fragment_quadrupole_eigvecs’, ‘fragment_quadrupole_quaternions’, ‘fragment_atom_to_com_dist’, ‘fragment_atom_to_com_frac_dist’, ‘fragment_atom_to_com_vec’, ‘fragment_atom_to_com_frac_vec’, ‘fragment_Ql’, ‘fragment_plane_centroid’, ‘fragment_plane_normal’, ‘fragment_planarity_rmsd’, ‘fragment_planarity_max_dev’, ‘fragment_planarity_score’

Return type:

None

Write Fragment-Level Features

Stores molecular fragment properties and descriptors.

Parameters:

  • start (int) - Index offset for this batch

  • fragment_parameters (Dict[str, Any]) - Dictionary containing extensive fragment data:

    • n_fragments - Number of fragments per structure

    • fragment_local_id - Local fragment identifiers

    • fragment_formula - Molecular formulas

    • fragment_n_atoms - Atom counts per fragment

    • fragment_com_coords - Centers of mass

    • fragment_inertia_tensors - Moment of inertia tensors

    • fragment_inertia_eigvals - Principal moments

    • fragment_inertia_eigvecs - Principal axes

    • fragment_quaternions - Orientation quaternions

    • fragment_Ql - Steinhardt order parameters

    • fragment_planarity_metrics - Shape descriptors

Fragment Data Organization:

# Example fragment parameter structure
fragment_params = {
    'n_fragments': [2, 3, 1],  # Number of fragments per structure
    'fragment_formula': [['C6H6', 'H2O'], ['C2H4', 'NH3', 'CO2'], ['C12H10']],
    'fragment_com_coords': np.array([...]),  # Centers of mass
    'fragment_inertia_eigvals': np.array([...]),  # Principal moments
    'fragment_planarity_rmsd': np.array([...])  # Planarity metrics
}

Advanced Writing Patterns

Batch Processing Workflow

def complete_data_writing_workflow(h5_out_path, processed_data_batches):
    """Complete workflow for writing processed structural data."""

    with h5py.File(h5_out_path, 'w') as h5_out:
        raw_writer = RawDataWriter(h5_out)
        computed_writer = ComputedDataWriter(h5_out)

        total_written = 0

        for batch_idx, batch_data in enumerate(processed_data_batches):
            batch_size = len(batch_data['refcodes'])
            start_idx = total_written

            print(f"Writing batch {batch_idx + 1}: structures {start_idx + 1}-{start_idx + batch_size}")

            # Write raw data
            raw_writer.write_raw_crystal_data(start_idx, batch_data['raw_crystal'])
            raw_writer.write_raw_atom_data(start_idx, batch_data['raw_atoms'])
            raw_writer.write_raw_bond_data(start_idx, batch_data['raw_bonds'])

            # Write computed features
            computed_writer.write_computed_crystal_data(start_idx, batch_data['computed_crystal'])
            computed_writer.write_computed_atom_data(start_idx, batch_data['computed_atoms'])
            computed_writer.write_computed_fragment_data(start_idx, batch_data['computed_fragments'])

            total_written += batch_size

        print(f"Successfully wrote {total_written} structures to {h5_out_path}")

Memory-Efficient Writing

def memory_efficient_writing(data_generator, h5_out_path, batch_size=32):
    """Write data efficiently without accumulating large arrays in memory."""

    with h5py.File(h5_out_path, 'w') as h5_out:
        raw_writer = RawDataWriter(h5_out)
        computed_writer = ComputedDataWriter(h5_out)

        current_batch = []
        start_idx = 0

        for structure_data in data_generator:
            current_batch.append(structure_data)

            # Write when batch is full
            if len(current_batch) >= batch_size:
                batch_data = organize_batch_data(current_batch)

                # Write and immediately clear batch
                write_batch(raw_writer, computed_writer, start_idx, batch_data)

                start_idx += len(current_batch)
                current_batch.clear()

                # Force garbage collection
                import gc
                gc.collect()

        # Write remaining partial batch
        if current_batch:
            batch_data = organize_batch_data(current_batch)
            write_batch(raw_writer, computed_writer, start_idx, batch_data)

Error Handling and Validation

def robust_data_writing(h5_out_path, data_batches):
    """Write data with comprehensive error handling."""

    with h5py.File(h5_out_path, 'w') as h5_out:
        raw_writer = RawDataWriter(h5_out)
        computed_writer = ComputedDataWriter(h5_out)

        successful_writes = 0
        failed_writes = []

        for batch_idx, batch_data in enumerate(data_batches):
            try:
                # Validate data before writing
                validate_batch_data(batch_data)

                # Attempt to write
                start_idx = successful_writes
                raw_writer.write_raw_crystal_data(start_idx, batch_data['raw_crystal'])
                computed_writer.write_computed_crystal_data(start_idx, batch_data['computed_crystal'])

                successful_writes += len(batch_data['refcodes'])
                print(f"✓ Batch {batch_idx + 1}: {len(batch_data['refcodes'])} structures written")

            except Exception as e:
                error_msg = f"Failed to write batch {batch_idx + 1}: {str(e)}"
                print(f"✗ {error_msg}")
                failed_writes.append((batch_idx, error_msg))

                # Continue with next batch
                continue

        print(f"\nWriting summary:")
        print(f"  Successful: {successful_writes} structures")
        print(f"  Failed: {len(failed_writes)} batches")

        if failed_writes:
            print("Failed batches:")
            for batch_idx, error in failed_writes:
                print(f"  Batch {batch_idx + 1}: {error}")

def validate_batch_data(batch_data):
    """Validate batch data before writing."""

    # Check required keys
    required_keys = ['refcodes', 'raw_crystal', 'computed_crystal']
    for key in required_keys:
        if key not in batch_data:
            raise ValueError(f"Missing required key: {key}")

    # Check data consistency
    n_structures = len(batch_data['refcodes'])

    # Validate crystal data shapes
    crystal_data = batch_data['raw_crystal']
    if 'cell_lengths' in crystal_data:
        if crystal_data['cell_lengths'].shape[0] != n_structures:
            raise ValueError("Crystal data length mismatch")

    # Check for NaN values
    for key, value in crystal_data.items():
        if isinstance(value, np.ndarray) and np.any(np.isnan(value)):
            raise ValueError(f"NaN values detected in {key}")

Data Type Management

Automatic Type Conversion

def convert_data_types(data_dict):
    """Convert data types for optimal HDF5 storage."""

    converted = {}

    for key, value in data_dict.items():
        if torch.is_tensor(value):
            # PyTorch tensor to NumPy array
            converted[key] = value.detach().cpu().numpy()

        elif isinstance(value, list):
            # Handle nested lists (e.g., atom labels)
            if all(isinstance(item, list) for item in value):
                converted[key] = value  # Keep as nested list
            else:
                converted[key] = np.array(value)

        elif isinstance(value, np.ndarray):
            # Ensure appropriate dtype
            if value.dtype == np.float64:
                converted[key] = value.astype(np.float32)  # Reduce precision
            elif value.dtype == np.int64:
                converted[key] = value.astype(np.int32)    # Reduce size
            else:
                converted[key] = value

        else:
            converted[key] = value

    return converted

HDF5 Dataset Configuration

def configure_hdf5_datasets(h5_file, n_structures, dimensions):
    """Configure HDF5 datasets with optimal settings."""

    # Compression and chunking settings
    compression_opts = {
        'compression': 'gzip',
        'compression_opts': 6,
        'shuffle': True,
        'fletcher32': True
    }

    # Create datasets with appropriate shapes and types
    crystal_datasets = {
        'cell_lengths': (n_structures, 3, np.float32),
        'cell_angles': (n_structures, 3, np.float32),
        'cell_volume': (n_structures, np.float32),
        'scaled_cell': (n_structures, 6, np.float32),
        'cell_matrix': (n_structures, 3, 3, np.float32)
    }

    for name, (shape, dtype) in crystal_datasets.items():
        h5_file.create_dataset(
            name,
            shape=shape if isinstance(shape, tuple) else (shape,),
            dtype=dtype,
            chunks=True,
            **compression_opts
        )

    # Variable-length datasets for ragged arrays
    vlen_datasets = {
        'atom_coords': h5py.vlen_dtype(np.float32),
        'atom_symbol': h5py.string_dtype('utf-8'),
        'bond_type': h5py.string_dtype('utf-8')
    }

    for name, dtype in vlen_datasets.items():
        h5_file.create_dataset(
            name,
            shape=(n_structures,),
            dtype=dtype,
            **compression_opts
        )

Performance Optimization

Chunking and Compression

def optimize_hdf5_performance(h5_file, dataset_name, data_shape):
    """Optimize HDF5 dataset for read/write performance."""

    # Calculate optimal chunk size
    if len(data_shape) == 1:
        # 1D datasets: chunk by batch size
        chunk_shape = (min(1000, data_shape[0]),)
    elif len(data_shape) == 2:
        # 2D datasets: chunk by row
        chunk_shape = (1, data_shape[1])
    elif len(data_shape) == 3:
        # 3D datasets: chunk by individual matrices
        chunk_shape = (1, data_shape[1], data_shape[2])
    else:
        chunk_shape = None

    # Configure dataset with optimal settings
    dataset = h5_file.create_dataset(
        dataset_name,
        shape=data_shape,
        chunks=chunk_shape,
        compression='gzip',
        compression_opts=6,
        shuffle=True,
        fletcher32=True,
        scaleoffset=2 if data_shape[-1] > 100 else None
    )

    return dataset

Parallel Writing

def parallel_data_writing(data_batches, h5_out_path, n_workers=4):
    """Write data using parallel workers (requires careful HDF5 handling)."""

    from multiprocessing import Pool
    import tempfile
    import os

    # Create temporary files for each worker
    temp_files = []
    for i in range(n_workers):
        temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.h5')
        temp_files.append(temp_file.name)
        temp_file.close()

    # Distribute batches among workers
    batch_chunks = [data_batches[i::n_workers] for i in range(n_workers)]

    def write_worker_data(args):
        worker_id, worker_batches, temp_path = args

        with h5py.File(temp_path, 'w') as h5_temp:
            writer = RawDataWriter(h5_temp)

            for batch_idx, batch_data in enumerate(worker_batches):
                start_idx = batch_idx * len(batch_data['refcodes'])
                writer.write_raw_crystal_data(start_idx, batch_data['raw_crystal'])

        return worker_id, temp_path

    # Execute parallel writing
    with Pool(n_workers) as pool:
        worker_args = [(i, batch_chunks[i], temp_files[i]) for i in range(n_workers)]
        results = pool.map(write_worker_data, worker_args)

    # Merge temporary files into final output
    merge_hdf5_files([result[1] for result in results], h5_out_path)

    # Clean up temporary files
    for temp_file in temp_files:
        os.unlink(temp_file)

Cross-References

Related CSA Modules:

External Dependencies:

  • HDF5 - Hierarchical data format

  • h5py - Python interface to HDF5

  • NumPy - Array operations and data structures

  • PyTorch - Tensor operations and GPU computing

Performance References: