Wednesday, November 15, 2023

Python: ruamel.yaml lib has a problem handling comments

 In our project, we are using the ruamel.yaml library for handling reading/writing YAML files. The reason we are not using yaml basic lib from Python is that ruamel handles better yaml standard, keeps the comments and formatting, and always dumps the keys in the same order.

However, since version 0.18.3, we had some strange behavior in our file dump. Some newlines were removed from some files. I opened ticket #492, with the following code that replicates the problem:

import ruamel.yaml

y = ruamel.yaml.YAML()
with open("organizational_units.yaml", "r") as file:
    ou = y.load(file)

with open("organizational_units.yaml", "r") as file:
    content = y.load(file)

content["organizational_units"] = ou["organizational_units"]

with open("test.yaml", "w") as file:
    y.dump(content, file)
with open("test.yaml", "w") as file:
    y.dump(content, file)

with open("test.yaml", "r") as file:
    y.load(file)

It is of course an oversimplified version of what we are doing in our project. We are normally loading several YAML files and combine them into one big model. Then, when we need to save changes into one file, we first reload it into memory in order to retrieve the original comments at the beginning of the file before replacing the old content with the new one.

Then you can see that we are saving our file twice. In fact, we are really performing a first save into an in-memory string stream, before logging the content in the file (at least in debug mode). Then we are saving it. Again, this code here is a simplification just to display the problem.

The problem occurs on the second save. The first works fine. Using this file as an example:

# Organizational Unit Specification

organizational_units:

- Name: root
  Accounts:
  - FirstAccount
  - SecondAccount

After the second save, we have this result:

# Organizational Unit Specification

organizational_units: -
  Name: root
  Accounts:
  - FirstAccount
  - SecondAccount

Noticed the missing newlines?

The last line of the code is loading the resulting file, just to show that we can not read it back.

After opening the ticket, I got the answer (on the same day, nice reactivity!) that it is in fact the duplicate of ticket #410. The #410 is a bit different, because it duplicates the complete structure, while we are only replacing a part of it. So maybe that is why our code was still working. I think the part that broke it is coming from this fix: "fix issue with spurious newline on first item after comment + nested block sequence".

As the developer explains, the issue is coming from the way the library is storing comments internally. It seems that comments are stored in different places, with the same reference. And when they are dumped, to avoid saving them several times, there is some internal bookkeeping going on. When we replaced reference to the top key, we broke some comments reference.

As a workaround, I restored comments reference around the top key:

comments = content["organizational_units"].ca.comment
content["organizational_units"] = ou["organizational_units"]
content["organizational_units"].ca.comment = comments

Worked for me...

No comments:

Post a Comment