Lazy File Loading#

With ZnTrack > 0.3.5 a lazy loading feature was introduced. This is essential for graphs with many dependencies and large Files. Lazy file loading allows us to only load data when it is accessed. This tutorial will show the benefits but also the difficulties that come with it.

By default config.lazy == True which globally enables lazy file loading. See the Note section when this can cause problems. You can disable it by changing the zntrack.config.lazy = False

[1]:
from zntrack import config

# When using ZnTrack we can write our code inside a Jupyter notebook.
# We can make use of this functionality by setting the `nb_name` config as follows:
config.nb_name = "09_lazy.ipynb"
config.lazy = False

Let’s start by creating some Example Nodes

[4]:
import zntrack
import random

We will now create a PrintOption that is identical to zn.outs but prints a message every time the data is read from files.

[5]:
from zntrack.fields.zn.options import Output


class PrintOption(Output):
    def __init__(self):
        super().__init__(dvc_option="outs", use_repr=False)
        # zntrack will try dvc --PrintOption outs.json
        # we must tell it to use dvc --outs outs.json instead

    def _get_value_from_file(self, instance) -> any:
        print(f"Loading data from files for {instance.name}")
        return super()._get_value_from_file(instance)
[6]:
class RandomNumber(zntrack.Node):
    start = zntrack.zn.params()
    stop = zntrack.zn.params()
    number = PrintOption()  # = zn.outs() + print

    def run(self):
        self.number = random.randrange(self.start, self.stop)
/tmp/ipykernel_3870159/2524714966.py:2: DeprecationWarning: Use 'zntrack.params' instead.
  start = zntrack.zn.params()
/tmp/ipykernel_3870159/2524714966.py:3: DeprecationWarning: Use 'zntrack.params' instead.
  stop = zntrack.zn.params()

In this first Example we will not use lazy loading.

[7]:
with zntrack.Project() as project:
    random_number = RandomNumber(start=1, stop=1000)
project.run()
Running DVC command: 'stage add --name RandomNumber --force ...'
Jupyter support is an experimental feature! Please save your notebook before running this command!
Submit issues to https://github.com/zincware/ZnTrack.
Running DVC command: 'repro'
[8]:
random_number.load(lazy=False)

As we can see, the RandomNumber is already loaded into memory

[9]:
random_number.number
[9]:
598

Now let us do the same thing with lazy=True

[10]:
lazy_random_number = RandomNumber.from_rev(lazy=True)
print(lazy_random_number.__dict__["number"])
<class 'zntrack.utils.LazyOption'>

We can see, that the random number is not yet available but as soon as we access the attribute it will be loaded for us (and stored in memory).

[11]:
lazy_random_number.number
[11]:
598

Let’s build some dependencies to show where lazy loading is especially useful.

[12]:
class AddOne(zntrack.Node):
    deps = zntrack.zn.deps()
    number = PrintOption()

    def run(self):
        self.number = self.deps.number + 1
/tmp/ipykernel_3870159/790841409.py:2: DeprecationWarning: Use 'zntrack.deps' instead.
  deps = zntrack.zn.deps()
[13]:
with zntrack.Project() as project:
    random_number = RandomNumber(start=1, stop=100)

    add_one = AddOne(deps=random_number, name="AddOne_0")
    for index in range(10):
        add_one = AddOne(deps=add_one, name=f"AddOne_{index+1}")

project.run()
Running DVC command: 'stage add --name RandomNumber --force ...'
Running DVC command: 'stage add --name AddOne_0 --force ...'
Running DVC command: 'stage add --name AddOne_1 --force ...'
Running DVC command: 'stage add --name AddOne_2 --force ...'
Running DVC command: 'stage add --name AddOne_3 --force ...'
Running DVC command: 'stage add --name AddOne_4 --force ...'
Running DVC command: 'stage add --name AddOne_5 --force ...'
Running DVC command: 'stage add --name AddOne_6 --force ...'
Running DVC command: 'stage add --name AddOne_7 --force ...'
Running DVC command: 'stage add --name AddOne_8 --force ...'
Running DVC command: 'stage add --name AddOne_9 --force ...'
Running DVC command: 'stage add --name AddOne_10 --force ...'
Running DVC command: 'repro'
[14]:
!dvc dag
+--------------+
| RandomNumber |
+--------------+
        *
        *
        *
  +----------+
  | AddOne_0 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_1 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_2 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_3 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_4 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_5 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_6 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_7 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_8 |
  +----------+
        *
        *
        *
  +----------+
  | AddOne_9 |
  +----------+
        *
        *
        *
  +-----------+
  | AddOne_10 |
  +-----------+

If we now load the latest AddOne we will see that it loads up everything into memory, although we might only be interested in the most recent number.

[15]:
add_one = AddOne.from_rev(name="AddOne_10", lazy=False)

It is rather unlikely that we need all these data to be stored in memory. So we can use lazy=True to avoid that.

[16]:
add_one_lazy = AddOne.from_rev(name="AddOne_10", lazy=True)

We can check with an arbitrary depth of dependencies that both instances yield the same value.

[17]:
add_one_lazy.deps.deps.deps.deps.deps.deps.deps.number
[17]:
71
[18]:
add_one.deps.deps.deps.deps.deps.deps.deps.number
[18]:
71

Notes#

When using ZnTrack to compare data of different versions it is important to either not use lazy=True or load the data manually before loading another version of the data. In the following example we store the result of dvc repro for three different experiments with and without lazy=True and compare the results.

[19]:
with zntrack.Project() as project:
    node = RandomNumber(start=0, stop=5000)
project.run()

random_number_lazy_1 = RandomNumber.from_rev(lazy=True)
random_number_1 = RandomNumber.from_rev(lazy=False)


node.stop = 5001
project.run()

random_number_lazy_2 = RandomNumber.from_rev(lazy=True)
random_number_2 = RandomNumber.from_rev(lazy=False)

node.stop = 5002
project.run()

random_number_lazy_3 = RandomNumber.from_rev(lazy=True)
random_number_3 = RandomNumber.from_rev(lazy=False)
Running DVC command: 'stage add --name RandomNumber --force ...'
Running DVC command: 'repro'
Running DVC command: 'stage add --name RandomNumber --force ...'
Running DVC command: 'repro'
Running DVC command: 'stage add --name RandomNumber --force ...'
Running DVC command: 'repro'
[20]:
# with lazy we get the same number for every run which is not what we expect.
print(
    f"{random_number_lazy_1.number} == {random_number_lazy_2.number} =="
    f" {random_number_lazy_3.number}"
)
assert random_number_lazy_1.number == random_number_lazy_2.number
assert random_number_lazy_1.number == random_number_lazy_3.number
3008 == 3008 == 3008
[21]:
# With lazy=False we get the results we expect.
# (Except for some random scenarios, where two random numbers are the same.)
print(f"{random_number_1.number} != {random_number_2.number} != {random_number_3.number}")
assert random_number_1.number != random_number_2.number
assert random_number_1.number != random_number_3.number
1647 != 849 != 3008

You can “lock” one value into place (loading it into memory) by accessing it e.g. through _ = add_one_lazy_1.number. This way you are able to only load certain values and still having the benefit of lazy=True if you only want to compare certain values.

[22]:
temp_dir.cleanup()