csd_operations.CSDOperations

class csd_operations.CSDOperations(data_directory, data_prefix)[source]

High-level interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database (CSD).

data_directory

Base directory for reading and writing CSD-related files.

Type:

Path

data_prefix

Prefix used when naming output files.

Type:

str

reader

CCDC EntryReader instance connected to the “CSD” database.

Type:

io.EntryReader

similarity_engine

Engine for computing pairwise packing similarity.

Type:

PackingSimilarity

__init__(data_directory, data_prefix)[source]

Initialize CSDOperations with target directory and filename prefix.

Parameters:
  • data_directory (Union[str, Path]) – Directory under which all CSV outputs will be saved.

  • data_prefix (str) – Prefix for generated CSV filenames (e.g., “<prefix>_refcode_families.csv”).

Methods

__init__(data_directory, data_prefix)

Initialize CSDOperations with target directory and filename prefix.

cluster_families(filters)

Perform packing similarity clustering on each refcode family.

filter_families_by_size(df[, min_size])

Exclude families with fewer than a specified number of members.

get_refcode_families_df()

Query the CSD and group entries by base refcode.

get_unique_structures(filters[, method])

Select one representative per cluster using the vdWFV metric.

save_refcode_families_csv([df, filename])

Write the refcode-families DataFrame to a CSV file.

__init__(data_directory, data_prefix)[source]

Initialize CSDOperations with target directory and filename prefix.

Parameters:
  • data_directory (Union[str, Path]) – Directory under which all CSV outputs will be saved.

  • data_prefix (str) – Prefix for generated CSV filenames (e.g., “<prefix>_refcode_families.csv”).

get_refcode_families_df()[source]

Query the CSD and group entries by base refcode.

Returns:

DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode

Return type:

pd.DataFrame

save_refcode_families_csv(df=None, filename=None)[source]

Write the refcode-families DataFrame to a CSV file.

Parameters:
  • df (pd.DataFrame, optional) – DataFrame to save. If None, uses get_refcode_families_df().

  • filename (Union[str, Path], optional) – Full file path for output. If None, defaults to data_directory / f”{data_prefix}_refcode_families.csv”.

Raises:

OSError – If writing to disk fails.

filter_families_by_size(df, min_size=2)[source]

Exclude families with fewer than a specified number of members.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].

  • min_size (int, default=2) – Minimum number of members for a family to be retained.

Returns:

Filtered DataFrame.

Return type:

pd.DataFrame

Raises:

KeyError – If ‘family_id’ column is missing.

cluster_families(filters)[source]

Perform packing similarity clustering on each refcode family.

Workflow

  1. Load initial refcode families CSV.

  2. Group refcodes by ‘family_id’.

  3. For each group, validate entries and build a similarity graph.

  4. Identify connected components as clusters.

  5. Save clustered results to CSV.

param filters:

Criteria for structure validation.

type filters:

Dict[str, Any]

returns:

DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If the initial CSV is missing.

raises RuntimeError:

If clustering fails for any family.

get_unique_structures(filters, method='vdWFV')[source]

Select one representative per cluster using the vdWFV metric.

Workflow

  1. Load clustered families CSV.

  2. Group by [‘family_id’, ‘cluster_id’].

  3. Compute vdWFV for each refcode; select the minimum.

  4. Save unique representatives to CSV.

param filters:

Placeholder for revalidation filters.

type filters:

Dict[str, Any]

param method:

Only ‘vdWFV’ is supported.

type method:

str, default=”vdWFV”

returns:

DataFrame with columns [‘family_id’, ‘refcode’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If clustered CSV is missing.

raises NotImplementedError:

If method is not ‘vdWFV’.