Understanding GufeTokenizables

Most objects in gufe are subclasses of GufeTokenizable. This base class enforces common behavior and sets requirements necessary to guarantee performance and reproducibility between all downstream packages that use gufe.

For example, when we create a SmallMoleculeComponent representing benzene, that object is also a GufeTokenizable:

# load benzene as a gufe SmallMoleculeComponent
>>> import gufe
>>> benzene = gufe.SmallMoleculeComponent.from_sdf_file("benzene.sdf")
>>> type(benzene)
gufe.components.smallmoleculecomponent.SmallMoleculeComponent

>>> from gufe.tokenization import GufeTokenizable
>>> isinstance(benzene, GufeTokenizable)
True

By definition, a GufeTokenizable must be:

  1. immutable

  2. hashable

  3. serializable

1. Immutability of GufeTokenizables

One important restriction on GufeTokenizable subclasses is that they must be immutable, meaning that its attributes do not change after initialization. In other words, all attributes will be fixed when you create an object. If your object is immutable, then it is suitable to be a GufeTokenizable.

For example, once the benzene molecule from above is loaded, its attributes (such as name) are immutable:

# benzene is a GufeTokenizable, and therefore immutable:
>>> benzene.name
'benzene'
>>> benzene.name = 'benzene_1'
AttributeError

While we cannot mutate benzene itself, we can create a new object based on a mutated copy of its contents. Every GufeTokenizable has the copy_with_replacements() method, which is a convenience method around the following approach:

# get the contents of benzene as a dict
>>> dct = benzene.to_dict()

# now we can mutate any property in this dict
>>> dct['molprops']['ofe-name'] = 'benzene1'

# create a new gufe object based on the mutated dict
>>> benzene1 = gufe.SmallMoleculeComponent.from_dict(dct)
>>> benzene1.name
'benzene1'

Immutability is critical to gufe’s design, because it means that gufe can generate a deterministic unique identifier (the GufeKey) based on the GufeTokenizable’s properties.

2. Hashing GufeTokenizables: the GufeKey

Because gufe objects are immutable, each object has a unique identifier, which we call its GufeKey. The GufeKey is a string, typically in the format {CLASS_NAME}-{HEXADECIMAL_LABEL}.

For our benzene SmallMoleculeComponent, the key is 'SmallMoleculeComponent-ec3c7a92771f8872dab1a9fc4911c795':

# get the GufeKey of the benzene GufeTokenizable
>>> benzene.key
'SmallMoleculeComponent-ec3c7a92771f8872dab1a9fc4911c795'

For most objects, the hexadecimal label is generated based on the contents of the class – in particular, it is based on contents of the _to_dict() dictionary, filtered to remove anything that matches the _defaults() dictionary.

For our benzene object, that means that its GufeKey is directly determined from all items in its to_dict() representation, except for :version:, since that is a default parameter:

# these defaults are not used to determine the GufeKey
>>> benzene.defaults()
{'name': '', ':version:': 1}

# these contents except for `version` (a default) are used to determine the GufeKey
>>> benzene.to_dict()
{'atoms': [(6, 0, 0, True, 0, 0, {}, 3),
(6, 0, 0, True, 0, 0, {}, 3),
(6, 0, 0, True, 0, 0, {}, 3),
(6, 0, 0, True, 0, 0, {}, 3),
(6, 0, 0, True, 0, 0, {}, 3),
(6, 0, 0, True, 0, 0, {}, 3),
(1, 0, 0, False, 0, 0, {}, 1),
(1, 0, 0, False, 0, 0, {}, 1),
(1, 0, 0, False, 0, 0, {}, 1),
(1, 0, 0, False, 0, 0, {}, 1),
(1, 0, 0, False, 0, 0, {}, 1),
(1, 0, 0, False, 0, 0, {}, 1)],
'bonds': [(0, 1, 12, 0, {}),
(0, 5, 12, 0, {}),
(0, 6, 1, 0, {}),
(1, 2, 12, 0, {}),
(1, 7, 1, 0, {}),
(2, 3, 12, 0, {}),
(2, 8, 1, 0, {}),
(3, 4, 12, 0, {}),
(3, 9, 1, 0, {}),
(4, 5, 12, 0, {}),
(4, 10, 1, 0, {}),
(5, 11, 1, 0, {})],
'conformer': ("\x93NUMPY\x01\x00v\x00{'descr': '<f8', 'fortran_order': False, 'shape': (12, 3), }                                                         \nî|?5^ú9@\x02+\x87\x16ÙN\x15@\x04V\x0e\x1d\x13@\x85ëQ¸\x1ee:@²\x9dï§ÆK\x14@Ë¡E¶óý\x0b@×£p=\nW;@q=\n×£p\x17@\x9eï§ÆK7\x07@\x83ÀÊ¡EÖ;@Év¾\x9f\x1a¯\x1b@Zd;ßO\x8d\x0c@ìQ¸\x1e\x85k;@b\x10X9´È\x1c@\x06\x81\x95C\x8bl\x13@sh\x91í|\x7f:@j¼t\x93\x18\x84\x19@ÇK7\x89\x15\x9e<,Ô:9@<NÑ\x91\\¾\x12@\x97ÿ\x90~ûú\x14@\x0f\x9c3¢´÷9@\x8d\r¾ð\x10\x16HPü\x98\x07@ªñÒMb°;@¼\x05\x12\x14?\x86\x16@Ãdª`TRþ?¦\x9bÄ °\x92<@Ý$\x06\x81\x95C\x1e@Kê\x044\x11\x08@RI\x9d\x80&Ò;@\x02\x9a\x08\x1b\x9e\x1e @zÇ):\x92\x8b\x15@9EGrù/:@}?5^ºI\x1a@]mÅþ²û\x19@",
{}),
'molprops': {'ofe-name': 'benzene'},
'__qualname__': 'SmallMoleculeComponent',
'__module__': 'gufe.components.smallmoleculecomponent',
':version:': 1}

This gives the GufeKey the following important properties:

  • A GufeKey is based on a cryptographic hash, so it is extremely unlikely that two objects that are functionally different will have the same key.

  • GufeKey creation is deterministic, so that it is preserved for a given Python environment across processes on the same hardware.

These properties, in particular the stability across Python sessions, make the GufeKey a stable identifier for the object. This stability means that they can be used for store-by-reference, and therefore deduplicated to optimize memory and performance.

Note

GufeKeys are not guaranteed to be stable across different Python environments or hardware.

Deduplication of GufeTokenizables

There are two types of deduplication of GufeTokenizables: * Objects are deduplicated in memory because gufe keeps a registry of all instantiated GufeTokenizables. * Objects can be deduplicated on storage to disk because we store by reference to the gufe key.

Deduplication in memory (flyweight pattern)

Memory deduplication means that only one object with a given GufeKey will exist in any single Python session. We ensure this by maintaining a registry of all GufeTokenizables that gets updated any time a GufeTokenizable is created. The registry is a mapping to weak references, which allows Python’s garbage collection to clean up GufeTokenizables that are no longer needed. This is essentially an implementation of the flyweight pattern.

This memory deduplication is ensured by the GufeTokenizable.from_dict, which is typically used in deserialization. It will always use the first object in memory with that GufeKey. In practice, that leads to the following behavior, where Foo() is representative of any GufeTokenizable:

# here Foo is a GufeTokenizable:
>>> a = Foo(0)
>>> b = Foo(0)
>>> a is b
True
# deserialize Foo() to a pure dict representation
>>> foo_as_dict = a.to_dict()
# re-serialize as a GufeTokenizable
>>> c = Foo.from_dict(foo_as_dict)
>>> c is a
True
>>> c is b
True

Deduplication on disk

Deduplication when writing a GufeTokenizable to disk can be handled by to_keyed_chain(), which serializes and de-duplicates an entire chain of tokenizable objects to disk in a single file.

gufe provides no tooling for deduplicating across chains stored to disk and it is up to the storage system to implement its own registry for handling this. However, you take advantage of gufe functionality to facilitate system-specific deduplication.

You can use keyed dict to represent a GufeTokenizable object as a dict that references inner GufeTokenizables by their GufeKeys, and repeat this recursively to store (with the disk storage method of your choosing) all objects by reference. Then, to get a list of all objects, use get_all_gufe_objs() on the outermost objects.

If you don’t need the level of granularity that keyed dict representation offers, keyed chain does this recursive unpacking and handles the correct serialization of all nested objects. It is important to note that functions like obj.to_json() and obj.to_msgpack() use to_keyed_chain() under the hood to make serialization to disk and over the network possible. When you want to load this object back into memory, you would use something like obj.from_json() or obj.to_json() to handle serialization back into the correct structure.

1. Serializable Representations of GufeTokenizables

GufeTokenizables are also designed to be easily serializable, allowing them to be reliably passed between processes on the same or different machines, written to disk, stored in databases, etc. There are multiple serialization methods available, and a variety of representations GufeTokenizables can take on, to meet different use cases.

Representations

Each subclass’s implementation of to_dict() defines what information a GufeTokenizable will serialize, and all other representations (to_shallow_dict, to_keyed_dict, to_keyed_chain) behavior are determined by this basic to_dict() definition.

a) dictionary

The to_dict() method is the most explicit way to represent a GufeTokenizable. This method recursively unpacks any inner GufeTokenizables that an outer GufeTokenizable contains to their full dict representation. Although this method is best way to see all information stored in a GufeTokenizable, it is also the least space-efficient.

For example, we can easily comprehend the to_dict() representation of benzene as shown above, but for a larger and deeply nested object, such as an AlchemicalNetwork, the to_dict() representation is neither easily readable by humans or memory-efficient. GufeTokenizables referenced multiple times among the nested objects are duplicated in this representation.

b) shallow dictionary

The to_shallow_dict() method is similar to to_dict() in that it unpacks a tokenizable into a dict format, but a shallow dict is not recursive and only unpacks the top level of the GufeTokenizable. Any nested GufeTokenizables are left as-is.

# shallow dict representation of an alchemical network
>>> alchemical_network.to_shallow_dict()
{
'nodes': [
    ChemicalSystem(name=benzene-solvent, components={'ligand': SmallMoleculeComponent(name=benzene), 'solvent': SolventComponent(name=O, K+, Cl-)}),
    ChemicalSystem(name=toluene-solvent, components={'ligand': SmallMoleculeComponent(name=toluene), 'solvent': SolventComponent(name=O, K+, Cl-)}),
    ChemicalSystem(name=styrene-solvent, components={'ligand': SmallMoleculeComponent(name=styrene), 'solvent': SolventComponent(name=O, K+, Cl-)}),
    ChemicalSystem(name=phenol-solvent, components={'ligand': SmallMoleculeComponent(name=phenol), 'solvent': SolventComponent(name=O, K+, Cl-)})
    ],
'edges': [
    Transformation(stateA=ChemicalSystem(name=benzene-solvent, components={'ligand': SmallMoleculeComponent(name=benzene), 'solvent': SolventComponent(name=O, K+, Cl-)}), stateB=ChemicalSystem(name=toluene-solvent, components={'ligand': SmallMoleculeComponent(name=toluene), 'solvent': SolventComponent(name=O, K+, Cl-)}), protocol=<Protocol-489fb1395a32c5183bcc1d43fa521960>, name=None),
    Transformation(stateA=ChemicalSystem(name=benzene-solvent, components={'ligand': SmallMoleculeComponent(name=benzene), 'solvent': SolventComponent(name=O, K+, Cl-)}), stateB=ChemicalSystem(name=styrene-solvent, components={'ligand': SmallMoleculeComponent(name=styrene), 'solvent': SolventComponent(name=O, K+, Cl-)}), protocol=<Protocol-489fb1395a32c5183bcc1d43fa521960>, name=None),
    Transformation(stateA=ChemicalSystem(name=benzene-solvent, components={'ligand': SmallMoleculeComponent(name=benzene), 'solvent': SolventComponent(name=O, K+, Cl-)}), stateB=ChemicalSystem(name=phenol-solvent, components={'ligand': SmallMoleculeComponent(name=phenol), 'solvent': SolventComponent(name=O, K+, Cl-)}), protocol=<Protocol-489fb1395a32c5183bcc1d43fa521960>, name=None)
    ],
'name': None,
'__qualname__': 'AlchemicalNetwork',
'__module__': 'gufe.network',
':version:': 1
}

This representation is most useful for iterating through the hierarchy of a GufeTokenizable one layer at a time. Because it leaves nested GufeTokenizables untouched, it is generally unsuitable for serialization.

c) keyed dictionary

The to_keyed_dict() method is similar to to_shallow_dict in that it only unpacks the first layer of a GufeTokenizable. However, a keyed dict represents the next layer as its GufeKey, e.g. {':gufe-key:': 'ChemicalSystem-96f686efdc070e01b74888cbb830f720'}.

A keyed dict is the most compact representation of a GufeTokenizable and can be useful for understanding its contents, but it does not have the complete representation for reconstruction or sending information (for this, see the next section, keyed chain)

# keyed dict representation of an alchemical network
>>> alchemical_network.to_keyed_dict()
{
'nodes': [
    {':gufe-key:': 'ChemicalSystem-3c648332ff8dccc03a1e1a3d44bc9755'},
    {':gufe-key:': 'ChemicalSystem-655f4d0008a537fe811b11a2dc4a029e'},
    {':gufe-key:': 'ChemicalSystem-6a13159b10c95cb05f542de64ec91fe7'},
    {':gufe-key:': 'ChemicalSystem-ba83a53f18700b3738680da051ff35f3'}
    ],
'edges': [
    {':gufe-key:': 'Transformation-4d0f802817071c8d14b37efd35187318'},
    {':gufe-key:': 'Transformation-7e7433a86239a41490da52222bf6f78f'},
    {':gufe-key:': 'Transformation-e8d1ccf53116e210d1ccbc3870007271'}
    ],
'name': None,
'__qualname__': 'AlchemicalNetwork',
'__module__': 'gufe.network',
':version:': 1
}

d) keyed chain

The to_keyed_chain() method is a powerful representation of a GufeTokenizable that enables efficient reconstruction of an object without duplication. It uses to_keyed_dict() to unpack a GufeTokenizable from the bottom (innermost) layer up into a flat list of tuples, in the form [(gufe_key, keyed_dict)]. The length of this list is equal to the number of unique GufeTokenizables required to represent the object. This bottom-up deduplication strategy effectively constructs a DAG (directed acyclic graph) where re-used GufeTokenizables are deduplicated.

To show the structure of a keyed chain, below we have redacted all information except the GufeKeys from the output:

# keyed chain representation ('...' indicates hidden output)
>>> alchemical_network.to_keyed_chain()
[
('SolventComponent-e0e47f56b43717156128ad4ae2d49897',{...}),
('SmallMoleculeComponent-3b51f5f92521c712049da092ab061930', {...}),
('SmallMoleculeComponent-ec3c7a92771f8872dab1a9fc4911c795', {...}),
('SmallMoleculeComponent-8225dfb11f2e8157a3fcdcd673d3d40e', {...}),
('Protocol-489fb1395a32c5183bcc1d43fa521960', {...}),
('ChemicalSystem-ba83a53f18700b3738680da051ff35f3', {
    'components': {
        'ligand': {':gufe-key:': 'SmallMoleculeComponent-3b51f5f92521c712049da092ab061930'},
        'solvent': {':gufe-key:': 'SolventComponent-e0e47f56b43717156128ad4ae2d49897'}
        },
    ...}),
('ChemicalSystem-3c648332ff8dccc03a1e1a3d44bc9755', {
    'components': {
        'ligand': {':gufe-key:': 'SmallMoleculeComponent-ec3c7a92771f8872dab1a9fc4911c795'},
        'solvent': {':gufe-key:': 'SolventComponent-e0e47f56b43717156128ad4ae2d49897'},
        },
    ...}),
('ChemicalSystem-655f4d0008a537fe811b11a2dc4a029e', {
    'components': {
        'ligand': {':gufe-key:': 'SmallMoleculeComponent-8225dfb11f2e8157a3fcdcd673d3d40e'},
        'solvent': {':gufe-key:': 'SolventComponent-e0e47f56b43717156128ad4ae2d49897'}
        },
    ...}),
('Transformation-e8d1ccf53116e210d1ccbc3870007271', {
    'stateA': {':gufe-key:': 'ChemicalSystem-3c648332ff8dccc03a1e1a3d44bc9755'},
    'stateB': {':gufe-key:': 'ChemicalSystem-ba83a53f18700b3738680da051ff35f3'},
    'protocol': {':gufe-key:': 'DummyProtocol-489fb1395a32c5183bcc1d43fa521960'},
    ...}),
('Transformation-4d0f802817071c8d14b37efd35187318', {
    'stateA': {':gufe-key:': 'ChemicalSystem-3c648332ff8dccc03a1e1a3d44bc9755'},
    'stateB': {':gufe-key:': 'ChemicalSystem-655f4d0008a537fe811b11a2dc4a029e'},
    'protocol': {':gufe-key:': 'DummyProtocol-489fb1395a32c5183bcc1d43fa521960'},
    ...}),
('AlchemicalNetwork-f8bfd63bc848672aa52b081b4d68fadf', {
    'nodes': [
        {':gufe-key:': 'ChemicalSystem-3c648332ff8dccc03a1e1a3d44bc9755'},
        {':gufe-key:': 'ChemicalSystem-655f4d0008a537fe811b11a2dc4a029e'},
        {':gufe-key:': 'ChemicalSystem-ba83a53f18700b3738680da051ff35f3'}
        ],
    'edges': [
        {':gufe-key:': 'Transformation-4d0f802817071c8d14b37efd35187318'},
        {':gufe-key:': 'Transformation-e8d1ccf53116e210d1ccbc3870007271'},
        ],
    ...}),
]

For keyed chains, the order of the elements in this list matters! When deserializing the keyed chain back into a GufeTokenizable, this list is iterated through in order, meaning that each object can only reference GufeKeys that come before it in this list.

Below is a diagram of how a nested GufeTokenizable (in this case an AlchemicalNetwork) can be represented as a keyed chain, with the first elements in the keyed chain at the bottom of the graph. Note that this graphical representation is a Directed Acyclic Graph (DAG):

Diagram of a keyed chain representation of an alchemical network.

Serialization Methods

All GufeTokenizables can be serialized as either JSON (to_json()) or MessagePack (to_msgpack()). JSON is preferable for human-readability, archival, and interoperability with other tools that do not use gufe. MessagePack is a more efficient format and ideal for passing information between processes, but it is not human-readable and requires gufe for extracting any data.

See Serialization constraints and methods in gufe for the data types supported by the serialization methods in gufe.

Note

See Serializing gufe objects for details on how to implement serialization of your own GufeTokenizables.