csd_operations.CSDOperations
- class csd_operations.CSDOperations(data_directory, data_prefix)[source]
High-level interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database (CSD).
- data_directory
Base directory for reading and writing CSD-related files.
- Type:
Path
- reader
CCDC EntryReader instance connected to the “CSD” database.
- Type:
io.EntryReader
- similarity_engine
Engine for computing pairwise packing similarity.
- Type:
PackingSimilarity
- __init__(data_directory, data_prefix)[source]
Initialize CSDOperations with target directory and filename prefix.
Methods
__init__(data_directory, data_prefix)Initialize CSDOperations with target directory and filename prefix.
cluster_families(filters)Perform packing similarity clustering on each refcode family.
filter_families_by_size(df[, min_size])Exclude families with fewer than a specified number of members.
Query the CSD and group entries by base refcode.
get_unique_structures(filters[, method])Select one representative per cluster using the vdWFV metric.
save_refcode_families_csv([df, filename])Write the refcode-families DataFrame to a CSV file.
- __init__(data_directory, data_prefix)[source]
Initialize CSDOperations with target directory and filename prefix.
- get_refcode_families_df()[source]
Query the CSD and group entries by base refcode.
- Returns:
DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode
- Return type:
pd.DataFrame
- save_refcode_families_csv(df=None, filename=None)[source]
Write the refcode-families DataFrame to a CSV file.
- filter_families_by_size(df, min_size=2)[source]
Exclude families with fewer than a specified number of members.
- cluster_families(filters)[source]
Perform packing similarity clustering on each refcode family.
Workflow
Load initial refcode families CSV.
Group refcodes by ‘family_id’.
For each group, validate entries and build a similarity graph.
Identify connected components as clusters.
Save clustered results to CSV.
- param filters:
Criteria for structure validation.
- type filters:
Dict[str, Any]
- returns:
DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If the initial CSV is missing.
- raises RuntimeError:
If clustering fails for any family.
- get_unique_structures(filters, method='vdWFV')[source]
Select one representative per cluster using the vdWFV metric.
Workflow
Load clustered families CSV.
Group by [‘family_id’, ‘cluster_id’].
Compute vdWFV for each refcode; select the minimum.
Save unique representatives to CSV.
- param filters:
Placeholder for revalidation filters.
- type filters:
Dict[str, Any]
- param method:
Only ‘vdWFV’ is supported.
- type method:
str, default=”vdWFV”
- returns:
DataFrame with columns [‘family_id’, ‘refcode’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If clustered CSV is missing.
- raises NotImplementedError:
If method is not ‘vdWFV’.