Data Model
CSA organizes crystal structure data in a hierarchical, efficient format optimized for computational analysis. This guide explains how CSA represents molecular crystals, stores variable-length data, and enables fast access to structural information.
Note
Understanding the data model is essential for writing analysis scripts and accessing CSA results programmatically.
Overview
CSA uses HDF5 (Hierarchical Data Format) to store crystal structure data in a scientifically robust, cross-platform format. The data model is designed to handle the inherent complexity of crystallographic data while enabling efficient computation on modern hardware.
Key Design Principles:
Variable-Length Storage: Crystals have different numbers of atoms, bonds, and contacts
Batch Processing: Data organized for efficient GPU tensor operations
Type Safety: Strict data types prevent computational errors
Metadata Rich: Complete provenance and experimental information
Compression: Efficient storage with lossless compression
Data Hierarchy
CSA organizes data at four hierarchical levels:
Dataset Level
├── Crystal Level (per structure)
│ ├── Molecule Level (per molecule in asymmetric unit)
│ │ ├── Fragment Level (per rigid fragment)
│ │ └── Atom Level (per atom)
│ └── Interaction Level (intermolecular contacts)
└── Metadata Level (experimental and computational details)
Dataset Level
The top level contains dataset-wide information and structure lists:
# Dataset structure overview
with h5py.File('structures_processed.h5', 'r') as f:
print("Dataset-level information:")
print(f"Total structures: {len(f['refcode_list'])}")
print(f"Processing date: {f.attrs.get('creation_date', 'N/A')}")
print(f"CSA version: {f.attrs.get('csa_version', 'N/A')}")
Key Datasets:
Dataset Name |
Data Type |
Description |
|---|---|---|
|
|
CSD reference codes for all structures |
|
|
Success/failure flags for each structure |
|
|
Processing time per structure (seconds) |
Crystal Level
Each crystal structure has associated properties describing the unit cell, space group, and overall characteristics:
# Crystal-level data access
def get_crystal_properties(f, structure_index):
"""Extract crystal-level properties for a specific structure."""
properties = {
'refcode': f['refcode_list'][structure_index].decode(),
'space_group': f['space_group'][structure_index].decode(),
'z_prime': f['z_prime'][structure_index],
'cell_volume': f['cell_volume'][structure_index],
'cell_density': f['cell_density'][structure_index],
'temperature': f['temperature'][structure_index],
'pressure': f['pressure'][structure_index]
}
return properties
Crystal Property Categories:
Unit Cell Parameters:
- cell_a, cell_b, cell_c: Lattice parameters (Å)
- cell_alpha, cell_beta, cell_gamma: Lattice angles (degrees)
- cell_volume: Unit cell volume (ų)
Crystallographic Properties:
- space_group: Space group symbol (e.g., “P21/c”)
- space_group_number: International space group number
- z_prime: Number of independent molecules in asymmetric unit
- crystal_system: Crystal system classification
Physical Properties:
- cell_density: Calculated density (g/cm³)
- packing_coefficient: Fraction of space occupied by molecules
- temperature: Measurement temperature (K)
- pressure: Measurement pressure (typically 1 atm)
Experimental Details:
- resolution: Diffraction resolution (Å)
- r_factor: Crystallographic R-factor
- data_completeness: Fraction of measured reflections
- radiation_type: X-ray source (e.g., “Mo Kα”)
Molecule Level
Within each crystal, CSA tracks individual molecules in the asymmetric unit:
# Molecule-level data access
def get_molecule_data(f, structure_index):
"""Extract molecular information for a structure."""
n_molecules = f['n_molecules'][structure_index]
molecules = []
for mol_idx in range(n_molecules):
# Molecular formula and properties
formula = f['molecule_formula'][structure_index][mol_idx].decode()
weight = f['molecule_weight'][structure_index][mol_idx]
charge = f['molecule_formal_charge'][structure_index][mol_idx]
molecules.append({
'formula': formula,
'molecular_weight': weight,
'formal_charge': charge,
'molecule_index': mol_idx
})
return molecules
Molecular Properties:
Chemical Identity:
- molecule_formula: Molecular formula (e.g., “C8H10N2O”)
- molecule_weight: Molecular weight (g/mol)
- molecule_formal_charge: Net formal charge
- molecule_multiplicity: Spin multiplicity
Connectivity:
- n_atoms: Number of atoms in molecule
- n_bonds: Number of covalent bonds
- n_fragments: Number of rigid fragments
- n_rings: Number of ring systems
Atom Level
Individual atoms are the fundamental units with coordinates, properties, and connectivity:
# Atom-level data access
def get_atomic_data(f, structure_index):
"""Extract atomic coordinates and properties."""
n_atoms = f['n_atoms'][structure_index]
# Atomic coordinates (Cartesian, Å)
coords = f['atom_coords'][structure_index].reshape(n_atoms, 3)
# Atomic properties
labels = f['atom_label'][structure_index].astype(str)
elements = f['atom_element'][structure_index].astype(str)
# Bond connectivity
bond_indices = f['bond_atom_indices'][structure_index]
bond_orders = f['bond_order'][structure_index]
return {
'coordinates': coords,
'labels': labels,
'elements': elements,
'bond_indices': bond_indices,
'bond_orders': bond_orders
}
Atomic Properties:
Positional Information:
- atom_coords: Cartesian coordinates (x, y, z) in Å
- atom_frac_coords: Fractional coordinates relative to unit cell
- atom_displacement: Atomic displacement parameters
Chemical Information:
- atom_element: Element symbol (e.g., “C”, “N”, “O”)
- atom_label: Unique atom identifier (e.g., “C1”, “N2”)
- atom_type: Atom type including hybridization
- atom_formal_charge: Formal charge on atom
Crystallographic Details:
- atom_site_occupancy: Site occupancy factor (0-1)
- atom_thermal_factor: Isotropic thermal parameter
- atom_symmetry_equivalent: Symmetry operation indices
Fragment Level
CSA automatically identifies rigid molecular fragments - groups of atoms connected by non-rotatable bonds:
# Fragment-level analysis
def analyze_fragments(f, structure_index):
"""Analyze rigid molecular fragments."""
n_fragments = f['n_fragments'][structure_index]
fragments = []
for frag_idx in range(n_fragments):
# Fragment identification
formula = f['fragment_formula'][structure_index][frag_idx].decode()
atom_indices = f['fragment_atom_indices'][structure_index][frag_idx]
# Geometric properties
com_coords = f['fragment_com_coords'][structure_index][frag_idx * 3:(frag_idx + 1) * 3]
inertia_eigvals = f['fragment_inertia_eigvals'][structure_index][frag_idx * 3:(frag_idx + 1) * 3]
# Shape descriptors
asphericity = inertia_eigvals[2] - 0.5 * (inertia_eigvals[0] + inertia_eigvals[1])
acylindricity = inertia_eigvals[1] - inertia_eigvals[0]
fragments.append({
'formula': formula,
'atom_indices': atom_indices,
'center_of_mass': com_coords,
'inertia_eigenvalues': inertia_eigvals,
'asphericity': asphericity,
'acylindricity': acylindricity
})
return fragments
Fragment Properties:
Identity and Composition:
- fragment_formula: Chemical formula of fragment
- fragment_atom_indices: Indices of atoms in fragment
- fragment_n_atoms: Number of atoms in fragment
- fragment_molecular_weight: Fragment molecular weight
Geometric Properties:
- fragment_com_coords: Center of mass coordinates (x, y, z)
- fragment_inertia_tensor: 3×3 inertia tensor
- fragment_inertia_eigvals: Principal moments of inertia
- fragment_inertia_eigvecs: Principal axes of inertia
Shape Descriptors:
- fragment_radius_gyration: Radius of gyration
- fragment_asphericity: Deviation from spherical shape
- fragment_acylindricity: Deviation from cylindrical shape
- fragment_relative_anisotropy: Overall shape anisotropy
Interaction Level
CSA identifies and characterizes intermolecular interactions:
# Intermolecular contact analysis
def analyze_contacts(f, structure_index):
"""Analyze intermolecular contacts and hydrogen bonds."""
# Intermolecular contacts
n_contacts = f['inter_cc_n_contacts'][structure_index]
contacts = []
for contact_idx in range(n_contacts):
# Contact atoms
central_atom = f['inter_cc_central_atom'][structure_index][contact_idx].decode()
contact_atom = f['inter_cc_contact_atom'][structure_index][contact_idx].decode()
# Geometric properties
length = f['inter_cc_length'][structure_index][contact_idx]
is_hbond = f['inter_cc_is_hbond'][structure_index][contact_idx]
contacts.append({
'central_atom': central_atom,
'contact_atom': contact_atom,
'length': length,
'is_hydrogen_bond': bool(is_hbond)
})
return contacts
Interaction Types:
Close Contacts:
- inter_cc_central_atom: Central atom in contact
- inter_cc_contact_atom: Contacting atom
- inter_cc_length: Contact distance (Å)
- inter_cc_angle: Contact angle (degrees)
Hydrogen Bonds:
- inter_cc_is_hbond: Boolean flag for H-bond classification
- hbond_donor_atom: H-bond donor atom
- hbond_acceptor_atom: H-bond acceptor atom
- hbond_dha_angle: Donor-H-acceptor angle (degrees)
π-π Interactions:
- pi_pi_distance: Distance between ring centroids (Å)
- pi_pi_angle: Angle between ring planes (degrees)
- pi_pi_offset: Lateral offset between rings (Å)
Data Types and Storage
HDF5 Data Organization
CSA uses HDF5’s advanced features for efficient data storage:
Fixed-Length Arrays: For properties that are the same size across all structures:
# Fixed-length datasets
f['z_prime'] # Shape: (n_structures,)
f['cell_volume'] # Shape: (n_structures,)
f['cell_density'] # Shape: (n_structures,)
Variable-Length Arrays: For properties that vary by structure:
# Variable-length datasets
f['atom_coords'] # Shape: varies per structure
f['fragment_formula'] # Shape: varies per structure
f['inter_cc_length'] # Shape: varies per structure
String Handling: Efficient storage of text data:
# String datasets with UTF-8 encoding
refcodes = f['refcode_list'][...].astype(str)
space_groups = [f['space_group'][i].decode() for i in range(len(f['refcode_list']))]
Data Access Patterns
Sequential Access: For processing all structures:
def process_all_structures(hdf5_file):
"""Process structures sequentially."""
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
refcode = f['refcode_list'][i].decode()
cell_volume = f['cell_volume'][i]
# Process structure...
yield refcode, cell_volume
Random Access: For accessing specific structures:
def get_structure_by_refcode(hdf5_file, target_refcode):
"""Find and return data for a specific refcode."""
with h5py.File(hdf5_file, 'r') as f:
refcodes = f['refcode_list'][...].astype(str)
# Find matching index
indices = np.where(refcodes == target_refcode)[0]
if len(indices) == 0:
return None
idx = indices[0]
# Extract structure data
structure_data = {
'refcode': target_refcode,
'space_group': f['space_group'][idx].decode(),
'cell_volume': f['cell_volume'][idx],
'n_atoms': f['n_atoms'][idx]
}
return structure_data
Batch Access: For efficient processing of multiple structures:
def get_batch_properties(hdf5_file, start_idx, batch_size):
"""Get properties for a batch of structures."""
with h5py.File(hdf5_file, 'r') as f:
end_idx = min(start_idx + batch_size, len(f['refcode_list']))
batch_data = {
'refcodes': f['refcode_list'][start_idx:end_idx].astype(str),
'volumes': f['cell_volume'][start_idx:end_idx],
'densities': f['cell_density'][start_idx:end_idx],
'n_atoms': f['n_atoms'][start_idx:end_idx]
}
return batch_data
Memory Management
Efficient Data Loading
CSA’s data model is designed for memory-efficient processing:
Lazy Loading: Only load data when needed:
def lazy_structure_iterator(hdf5_file):
"""Iterate through structures without loading all data."""
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
# Load only needed data for this structure
yield {
'index': i,
'refcode': f['refcode_list'][i].decode(),
'loader': lambda idx=i: load_structure_data(f, idx)
}
Chunked Processing: Process data in manageable chunks:
def process_in_chunks(hdf5_file, chunk_size=1000):
"""Process structures in memory-efficient chunks."""
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for start in range(0, n_structures, chunk_size):
end = min(start + chunk_size, n_structures)
# Load chunk data
chunk_data = {
'refcodes': f['refcode_list'][start:end].astype(str),
'volumes': f['cell_volume'][start:end],
'densities': f['cell_density'][start:end]
}
# Process chunk
yield chunk_data
Selective Loading: Load only required datasets:
def load_specific_properties(hdf5_file, properties):
"""Load only specified properties to save memory."""
data = {}
with h5py.File(hdf5_file, 'r') as f:
for prop in properties:
if prop in f:
data[prop] = f[prop][...]
else:
print(f"Warning: Property '{prop}' not found in dataset")
return data
Compression and Storage Efficiency
CSA uses compression to minimize storage requirements:
Compression Settings:
# HDF5 compression configuration used by CSA
compression_settings = {
'compression': 'lz4', # Fast compression/decompression
'compression_opts': 4, # Moderate compression level
'shuffle': True, # Byte shuffling for better compression
'fletcher32': True # Checksums for data integrity
}
Storage Efficiency Statistics:
Data Type |
Uncompressed Size |
Compressed Size |
Compression Ratio |
|---|---|---|---|
Atomic coordinates |
12 MB/1000 structures |
3.2 MB |
3.8:1 |
Fragment properties |
8 MB/1000 structures |
2.1 MB |
3.8:1 |
Contact data |
15 MB/1000 structures |
4.2 MB |
3.6:1 |
String data |
2 MB/1000 structures |
0.8 MB |
2.5:1 |
Data Validation and Integrity
Built-in Validation
CSA includes comprehensive data validation:
Data Type Validation:
def validate_data_types(hdf5_file):
"""Validate data types in HDF5 file."""
expected_types = {
'refcode_list': 'string',
'z_prime': 'int8',
'cell_volume': 'float32',
'cell_density': 'float32',
'n_atoms': 'int16',
'atom_coords': 'float32'
}
issues = []
with h5py.File(hdf5_file, 'r') as f:
for dataset_name, expected_type in expected_types.items():
if dataset_name in f:
actual_type = str(f[dataset_name].dtype)
if expected_type not in actual_type:
issues.append(f"{dataset_name}: expected {expected_type}, got {actual_type}")
return issues
Range Validation:
def validate_data_ranges(hdf5_file):
"""Validate that data values are within reasonable ranges."""
validation_rules = {
'cell_volume': (10.0, 50000.0), # ų
'cell_density': (0.5, 5.0), # g/cm³
'z_prime': (1, 8), # Typical Z' values
'temperature': (4.0, 1000.0), # K
'resolution': (0.5, 5.0) # Å
}
issues = []
with h5py.File(hdf5_file, 'r') as f:
for property_name, (min_val, max_val) in validation_rules.items():
if property_name in f:
data = f[property_name][...]
if np.any(data < min_val) or np.any(data > max_val):
out_of_range = np.sum((data < min_val) | (data > max_val))
issues.append(f"{property_name}: {out_of_range} values out of range [{min_val}, {max_val}]")
return issues
Consistency Checks:
def validate_data_consistency(hdf5_file):
"""Check for internal consistency in the data."""
issues = []
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
# Check that all fixed-length arrays have correct size
fixed_length_datasets = ['z_prime', 'cell_volume', 'cell_density', 'n_atoms']
for dataset_name in fixed_length_datasets:
if dataset_name in f:
if len(f[dataset_name]) != n_structures:
issues.append(f"{dataset_name}: length {len(f[dataset_name])} != {n_structures}")
# Check atom count consistency
for i in range(min(100, n_structures)): # Check first 100 structures
expected_atoms = f['n_atoms'][i]
actual_coords = len(f['atom_coords'][i]) // 3 # 3 coordinates per atom
if expected_atoms != actual_coords:
refcode = f['refcode_list'][i].decode()
issues.append(f"{refcode}: n_atoms={expected_atoms} but {actual_coords} coordinates")
return issues
Working with CSA Data
Common Analysis Patterns
Property Extraction and Analysis:
import pandas as pd
import numpy as np
def extract_crystal_properties(hdf5_file):
"""Extract crystal properties into a pandas DataFrame."""
with h5py.File(hdf5_file, 'r') as f:
data = {
'refcode': f['refcode_list'][...].astype(str),
'space_group': [f['space_group'][i].decode() for i in range(len(f['refcode_list']))],
'z_prime': f['z_prime'][...],
'cell_volume': f['cell_volume'][...],
'cell_density': f['cell_density'][...],
'n_atoms': f['n_atoms'][...],
'n_fragments': f['n_fragments'][...],
'temperature': f['temperature'][...]
}
return pd.DataFrame(data)
Fragment Analysis Workflow:
def analyze_fragment_shapes(hdf5_file):
"""Analyze molecular fragment shapes across dataset."""
fragment_data = []
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
refcode = f['refcode_list'][i].decode()
n_frags = f['n_fragments'][i]
for j in range(n_frags):
# Extract fragment properties
formula = f['fragment_formula'][i][j].decode()
# Inertia eigenvalues for shape analysis
inertia_start = j * 3
inertia_end = (j + 1) * 3
eigvals = f['fragment_inertia_eigvals'][i][inertia_start:inertia_end]
# Calculate shape descriptors
asphericity = eigvals[2] - 0.5 * (eigvals[0] + eigvals[1])
acylindricity = eigvals[1] - eigvals[0]
fragment_data.append({
'refcode': refcode,
'fragment_formula': formula,
'asphericity': asphericity,
'acylindricity': acylindricity
})
return pd.DataFrame(fragment_data)
Contact Network Analysis:
def build_contact_network(hdf5_file, distance_cutoff=3.5):
"""Build intermolecular contact networks."""
networks = {}
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
refcode = f['refcode_list'][i].decode()
n_contacts = f['inter_cc_n_contacts'][i]
# Build contact list for this structure
contacts = []
for j in range(n_contacts):
central = f['inter_cc_central_atom'][i][j].decode()
contact = f['inter_cc_contact_atom'][i][j].decode()
length = f['inter_cc_length'][i][j]
is_hbond = f['inter_cc_is_hbond'][i][j]
if length <= distance_cutoff:
contacts.append({
'central': central,
'contact': contact,
'length': length,
'is_hbond': bool(is_hbond)
})
networks[refcode] = contacts
return networks
Data Export and Conversion
Export to Common Formats:
def export_to_csv(hdf5_file, output_prefix):
"""Export CSA data to CSV files for external analysis."""
# Crystal properties
crystal_df = extract_crystal_properties(hdf5_file)
crystal_df.to_csv(f'{output_prefix}_crystals.csv', index=False)
# Fragment properties
fragment_df = analyze_fragment_shapes(hdf5_file)
fragment_df.to_csv(f'{output_prefix}_fragments.csv', index=False)
# Contact summary
contact_summary = []
with h5py.File(hdf5_file, 'r') as f:
for i in range(len(f['refcode_list'])):
refcode = f['refcode_list'][i].decode()
n_contacts = f['inter_cc_n_contacts'][i]
n_hbonds = np.sum(f['inter_cc_is_hbond'][i][:n_contacts])
contact_summary.append({
'refcode': refcode,
'total_contacts': n_contacts,
'hydrogen_bonds': n_hbonds,
'other_contacts': n_contacts - n_hbonds
})
pd.DataFrame(contact_summary).to_csv(f'{output_prefix}_contacts.csv', index=False)
Integration with Analysis Libraries:
def prepare_for_sklearn(hdf5_file, properties):
"""Prepare data for scikit-learn analysis."""
X = []
structure_ids = []
with h5py.File(hdf5_file, 'r') as f:
n_structures = len(f['refcode_list'])
for i in range(n_structures):
# Extract specified properties as feature vector
features = []
for prop in properties:
if prop in f:
features.append(f[prop][i])
if len(features) == len(properties): # Only include complete cases
X.append(features)
structure_ids.append(f['refcode_list'][i].decode())
return np.array(X), structure_ids
Best Practices
Data Access Optimization
Use batch processing for large datasets
Close files promptly to free resources
Cache frequently accessed data in memory
Use appropriate data types to minimize memory usage
Leverage HDF5 chunking for better I/O performance
Error Handling
def robust_data_access(hdf5_file, dataset_name, structure_index):
"""Safely access data with proper error handling."""
try:
with h5py.File(hdf5_file, 'r') as f:
if dataset_name not in f:
raise KeyError(f"Dataset '{dataset_name}' not found")
if structure_index >= len(f['refcode_list']):
raise IndexError(f"Structure index {structure_index} out of range")
return f[dataset_name][structure_index]
except (OSError, IOError) as e:
raise RuntimeError(f"Error accessing HDF5 file: {e}")
Performance Considerations
Profile data access patterns before optimization
Use vectorized operations when possible
Minimize file open/close cycles in loops
Consider memory vs. speed tradeoffs for caching
Monitor memory usage during large analyses
Next Steps
With understanding of CSA’s data model:
Basic Users: Apply this knowledge in Basic Analysis workflows Advanced Users: Explore advanced_features for complex analyses Developers: Review data_reader module for programmatic access
See Also
Basic Analysis : Practical data access examples ../examples/reading_data : Code examples for data access hdf5_utils module : Low-level HDF5 utilities ../technical_details/performance : Performance optimization guide