Reproducible Python Bytecode

Some Linux distributions (like Alpine and Arch Linux) are shipping something called “python bytecode” in their packages. It’s stored in .pyc files and is generated during the package build. They’re stored in __pycache__ folders and can be seen here:

% tar tvvf /var/cache/pacman/pkg/python-wsproto-1.0.0-1-any.pkg.tar.zst
-rw-r--r-- root/root      5053 2020-12-09 16:24 .BUILDINFO
-rw-r--r-- root/root      2497 2020-12-09 16:24 .MTREE
-rw-r--r-- root/root       436 2020-12-09 16:24 .PKGINFO
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/lib/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/lib/python3.9/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/lib/python3.9/site-packages/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto-1.0.0-py3.9.egg-info/
-rw-r--r-- root/root      6997 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto-1.0.0-py3.9.egg-info/PKG-INFO
-rw-r--r-- root/root      1105 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto-1.0.0-py3.9.egg-info/SOURCES.txt
-rw-r--r-- root/root         1 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto-1.0.0-py3.9.egg-info/dependency_links.txt
-rw-r--r-- root/root        53 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto-1.0.0-py3.9.egg-info/requires.txt
-rw-r--r-- root/root         8 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto-1.0.0-py3.9.egg-info/top_level.txt
-rw-r--r-- root/root      2852 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__init__.py
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/
-rw-r--r-- root/root      3180 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/__init__.cpython-39.opt-1.pyc
-rw-r--r-- root/root      3180 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/__init__.cpython-39.pyc
-rw-r--r-- root/root      5018 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/connection.cpython-39.opt-1.pyc
-rw-r--r-- root/root      5018 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/connection.cpython-39.pyc
-rw-r--r-- root/root      9411 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/events.cpython-39.opt-1.pyc
-rw-r--r-- root/root      9411 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/events.cpython-39.pyc
-rw-r--r-- root/root      8327 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/extensions.cpython-39.opt-1.pyc
-rw-r--r-- root/root      8327 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/extensions.cpython-39.pyc
-rw-r--r-- root/root     16162 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/frame_protocol.cpython-39.opt-1.pyc
-rw-r--r-- root/root     16162 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/frame_protocol.cpython-39.pyc
-rw-r--r-- root/root     11499 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/handshake.cpython-39.opt-1.pyc
-rw-r--r-- root/root     11499 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/handshake.cpython-39.pyc
-rw-r--r-- root/root       226 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/typing.cpython-39.opt-1.pyc
-rw-r--r-- root/root       226 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/typing.cpython-39.pyc
-rw-r--r-- root/root      2898 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/utilities.cpython-39.opt-1.pyc
-rw-r--r-- root/root      2898 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/__pycache__/utilities.cpython-39.pyc
-rw-r--r-- root/root      6762 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/connection.py
-rw-r--r-- root/root      7983 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/events.py
-rw-r--r-- root/root     11211 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/extensions.py
-rw-r--r-- root/root     23346 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/frame_protocol.py
-rw-r--r-- root/root     17991 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/handshake.py
-rw-r--r-- root/root         7 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/py.typed
-rw-r--r-- root/root        68 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/typing.py
-rw-r--r-- root/root      2742 2020-12-09 16:24 usr/lib/python3.9/site-packages/wsproto/utilities.py
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/share/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/share/licenses/
drwxr-xr-x root/root         0 2020-12-09 16:24 usr/share/licenses/python-wsproto/
-rw-r--r-- root/root      1093 2020-12-09 16:24 usr/share/licenses/python-wsproto/LICENSE

They contain the compiled python bytecode, but even though it’s “compiled” it isn’t architecture specific. It can be used on any cpu that has a compatible python installation. The python code is effectively distributed twice, one is easy to read for humans, and one is efficient to read for computers. With a package like this, the content of the python bytecode files is what’s actually going to run on our computer.

“But what if the code in the compiled files is different from the human-readable one?” - Excellent question, thanks! As part of the reproducible-builds effort there are multiple Arch Linux rebuilders (1, 2, 3) attempting to generate the same python byte code in a similar(-ish) environment and report back if they got something different. Ideally they would all agree how the “correctly compiled python byte code” should look like and indicate if they disagree. This is usually referred to as rebuilding.

Python packages very often showed up unreproducible because of differences in the bytecode .pyc files. A diff of the raw hex dump looks like this (keep the smol z in mind, we’re going to see them again later!):

│ ├── usr/lib/python3.9/site-packages/watchgod/__pycache__/watcher.cpython-39.opt-1.pyc
│ │ @@ -187,17 +187,17 @@
│ │  00000ba0: 0000 1800 0000 7312 0000 0008 0124 0610  ......s......$..
│ │  00000bb0: 0310 0320 0714 0102 fe0c 0b20 0b72 0c00  ... ....... .r..
│ │  00000bc0: 0000 6300 0000 0000 0000 0000 0000 0000  ..c.............
│ │  00000bd0: 0000 0003 0000 0040 0000 0073 2400 0000  [email protected]$...
│ │  00000be0: 6500 5a01 6400 5a02 6800 6401 a301 5a03  e.Z.d.Z.h.d...Z.
│ │  00000bf0: 6402 6504 6403 9c02 6404 6405 8404 5a05  d.e.d...d.d...Z.
│ │  00000c00: 6406 5300 2907 720d 0000 003e 0500 0000  d.S.).r....>....
│ │ -00000c10: da0b 5f5f 7079 6361 6368 655f 5f7a 0d73  ..__pycache__z.s
│ │ -00000c20: 6974 652d 7061 636b 6167 6573 7a05 2e69  ite-packagesz..i
│ │ -00000c30: 6465 617a 042e 6769 745a 0c6e 6f64 655f  deaz..gitZ.node_
│ │ +00000c10: 7a0d 7369 7465 2d70 6163 6b61 6765 737a  z.site-packagesz
│ │ +00000c20: 052e 6964 6561 7a04 2e67 6974 da0b 5f5f  ..ideaz..git..__
│ │ +00000c30: 7079 6361 6368 655f 5f5a 0c6e 6f64 655f  pycache__Z.node_
│ │  00000c40: 6d6f 6475 6c65 7372 2500 0000 7226 0000  modulesr%...r&..
│ │  00000c50: 0063 0200 0000 0000 0000 0000 0000 0200  .c..............
│ │  00000c60: 0000 0200 0000 4300 0000 730c 0000 007c  ......C...s....|
│ │  00000c70: 016a 007c 006a 0176 0153 0072 1f00 0000  .j.|.j.v.S.r....
│ │  00000c80: 2902 da04 6e61 6d65 da0c 6967 6e6f 7265  )...name..ignore
│ │  00000c90: 645f 6469 7273 7229 0000 0072 1a00 0000  d_dirsr)...r....
│ │  00000ca0: 721a 0000 0072 1b00 0000 722a 0000 0056  r....r....r*...V

Most of the file matched, but there are small clusters of strings that appear out of order. I’m not sure who originally figured it out, but setting PYTHONHASHSEED=0 in the build makes the order deterministic.

Determinism is important for rebuilding, so we get one canonical build output. If multiple rebuilders confirm they’ve successfully reproduced the package (meaning, the package they built is bit-for-bit identical) that’s a good indicator the compiled python bytecode corresponds to the human readable python code (using the instructions in the PKGBUILD). Even if you’re suspecting they could all be secretly colluding, you’re very welcome to attempt to reproduce the package yourself and let us know in #archlinux-reproducible if you get something different!

PYTHONHASHSEED tweaks an internal of a mechanism that’s supposed to help defend against something called “hash-table collision denial of service”. Sounds dangerous, let’s have a look!

Hash tables - The fast and the expensive insert

If you do python you most likely know hash tables from the dict type:

% python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> # create a new dict
>>> foo = dict()
>>>
>>> # insert some values
>>> foo['abc'] = 123
>>> foo['def'] = 456
>>>
>>> # access the dict
>>> foo
{'abc': 123, 'def': 456}
>>> foo['abc']
123
>>> foo['def']
456
>>> foo['xyz']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'xyz'
>>>

It’s a neat data structure for key-value lookups! Every entry in it is identified by a key (in the example above that’s 'abc', 'def' and the non-existent key 'xyz'). Referring to an entry by key is significantly faster than having to search in a list, lookup speed is still fast even with a high number of entries. The “hash” in “hash table” is due to its use of “one-way hash functions”. Famous hash functions include sha256 and blake2b, but for performance reasons hash tables usually use a non-cryptographic hash function. Python uses something called DJBX33A.

As long as every key generates a unique hash, hash tables are extremely efficient. But because the hash function used is non-cryptographic, a hash table needs to be able to gracefully handle collisions. This is usually done in a much slower fallback.

In a worst-case scenario, every key hashes to the exact same hash, colliding with all other hashes. Every insert would use the much more expensive fallback. This would be extremely unlikely with cryptographic hash functions, but for the hash functions used in hash tables people managed to calculate large sets of colliding keys to intentionally cause high load in web applications for denial-of-service attacks.

To avoid this worst case python uses a keyed hash function, the key is a random value generated at runtime setup by the python interpreter. This makes it harder (if not impossible) to predict the hash that python uses for its hash table. This makes it very hard to cause collisions (beyond the few happening naturally with a high number of keys).

This can be demonstrated with the hash() function . We get the same output if we input the same value, but after restarting python we get a completely different hash:

% python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash('asdf')
3521738452898009356
>>> hash('asdf')
3521738452898009356
>>>
% python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash('asdf')
8679873952076616912
>>> hash('asdf')
8679873952076616912
>>>
%

The first root-cause suspect was the dict type, but starting with python 3.7, the dict type is actually insert-order preserving. This means enumerating it is going to return the items in deterministic insert order.

% python -c 'print([x for x in {k: 1 for k in map(str, range(10))}])'
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
% python -c 'print([x for x in {k: 1 for k in map(str, range(10))}])'
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
% python -c 'print([x for x in {k: 1 for k in map(str, range(10))}])'
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
% python -c 'print([x for x in {k: 1 for k in map(str, range(10))}])'
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

This means dict is likely not the problem here.

There are other types in python that behave very similar though, like set or frozenset. A set is like a dict without values and you’re only able to check if a given key exists. This allows for very fast “does this list contain X” calls. In my earlier programming days I’ve used a dict but a set is the data type optimized for what I was trying to do.

If we try again with a set we can see that insert order is not preserved:

% python -c 'print([x for x in set(map(str, range(10)))])'
['9', '2', '8', '0', '3', '7', '4', '6', '1', '5']
% python -c 'print([x for x in set(map(str, range(10)))])'
['1', '0', '5', '3', '4', '9', '2', '6', '7', '8']
% python -c 'print([x for x in set(map(str, range(10)))])'
['5', '8', '9', '7', '3', '2', '4', '6', '1', '0']
% python -c 'print([x for x in set(map(str, range(10)))])'
['8', '1', '7', '4', '5', '9', '3', '6', '0', '2']

The hash randomization is in full effect here. If we invoke the marshal tools manually we can see python just enumerates this set. This has a direct impact on the bytecode generated (the 8th byte, after z\x01. Remember our z buddy from earlier? The z means TYPE_SHORT_ASCII and \x01 is the length):

% python -c 'import marshal;print(marshal.dumps(set(map(str, range(10)))))'
b'<\n\x00\x00\x00z\x012z\x010z\x018z\x014z\x019z\x016z\x011z\x013z\x017z\x015'
% python -c 'import marshal;print(marshal.dumps(set(map(str, range(10)))))'
b'<\n\x00\x00\x00z\x014z\x019z\x010z\x013z\x011z\x015z\x016z\x018z\x012z\x017'
% python -c 'import marshal;print(marshal.dumps(set(map(str, range(10)))))'
b'<\n\x00\x00\x00z\x014z\x015z\x012z\x016z\x017z\x011z\x019z\x018z\x010z\x013'

Deterministic bytecode generation

The hash set order is directly dependent on the value of hash(). Within the same python process we’re using the same hash seed and get a stable order for our set:

% python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print([x for x in set(map(str, range(10)))])
['5', '6', '9', '1', '3', '2', '8', '4', '0', '7']
>>> print([x for x in set(map(str, range(10)))])
['5', '6', '9', '1', '3', '2', '8', '4', '0', '7']
>>>
% python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print([x for x in set(map(str, range(10)))])
['4', '9', '5', '2', '8', '3', '0', '1', '6', '7']
>>> print([x for x in set(map(str, range(10)))])
['4', '9', '5', '2', '8', '3', '0', '1', '6', '7']
>>>

Since the sorting seems to depend directly on this value, let’s try to make that stable. If we provide a static hash seed we get stable hashes across different python starts:

% PYTHONHASHSEED=0 python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash('asdf')
5039985649710418232
>>> hash('asdf')
5039985649710418232
>>>
% PYTHONHASHSEED=0 python
Python 3.9.6 (default, Jun 30 2021, 10:22:16)
[GCC 11.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> hash('asdf')
5039985649710418232
>>> hash('asdf')
5039985649710418232
>>>
%

With this static hash seed we get stable enumeration for our set even across different python processes:

% PYTHONHASHSEED=0 python -c 'print([x for x in set(map(str, range(10)))])'
['5', '3', '2', '7', '0', '4', '1', '8', '9', '6']
% PYTHONHASHSEED=0 python -c 'print([x for x in set(map(str, range(10)))])'
['5', '3', '2', '7', '0', '4', '1', '8', '9', '6']
% PYTHONHASHSEED=0 python -c 'print([x for x in set(map(str, range(10)))])'
['5', '3', '2', '7', '0', '4', '1', '8', '9', '6']

With a stable enumeration order the small clusters of randomly ordered items in python bytecode disappear:

% PYTHONHASHSEED=0 python -c 'import marshal;print(marshal.dumps(set(map(str, range(10)))))'
b'<\n\x00\x00\x00z\x015z\x013z\x012z\x017z\x010z\x014z\x011z\x018z\x019z\x016'
% PYTHONHASHSEED=0 python -c 'import marshal;print(marshal.dumps(set(map(str, range(10)))))'
b'<\n\x00\x00\x00z\x015z\x013z\x012z\x017z\x010z\x014z\x011z\x018z\x019z\x016'
% PYTHONHASHSEED=0 python -c 'import marshal;print(marshal.dumps(set(map(str, range(10)))))'
b'<\n\x00\x00\x00z\x015z\x013z\x012z\x017z\x010z\x014z\x011z\x018z\x019z\x016'

Security Implications

Are there any security implications when using a static hash seed at build-time?

That’s a very broad question, but after investigating this with the Arch Linux Security Team it doesn’t seem like there’s an runtime effect of deterministic bytecode.

If we look at the relevant section in marshal.c we can see that an empty PyDict/PySet is created while unmarshaling and then populated with calls to PyDict_SetItem and PySet_Add. Both of them have calls to PyObject_Hash (emphasis mine):

int
PyDict_SetItem(PyObject *op, PyObject *key, PyObject *value)
{
    PyDictObject *mp;
    Py_hash_t hash;
    if (!PyDict_Check(op)) {
        PyErr_BadInternalCall();
        return -1;
    }
    assert(key);
    assert(value);
    mp = (PyDictObject *)op;
    if (!PyUnicode_CheckExact(key) ||
        (hash = ((PyASCIIObject *) key)->hash) == -1)
    {
        hash = PyObject_Hash(key); /* HERE */
        if (hash == -1)
            return -1;
    }

    if (mp->ma_keys == Py_EMPTY_KEYS) {
        return insert_to_emptydict(mp, key, hash, value);
    }
    /* insertdict() handles any resizing that might be necessary */
    return insertdict(mp, key, hash, value);
}

/* [...] */

int
PySet_Add(PyObject *anyset, PyObject *key)
{
    if (!PySet_Check(anyset) &&
        (!PyFrozenSet_Check(anyset) || Py_REFCNT(anyset) != 1)) {
        PyErr_BadInternalCall();
        return -1;
    }
    return set_add_key((PySetObject *)anyset, key);
}

static int
set_add_key(PySetObject *so, PyObject *key)
{
    Py_hash_t hash;

    if (!PyUnicode_CheckExact(key) ||
        (hash = ((PyASCIIObject *) key)->hash) == -1) {
        hash = PyObject_Hash(key); /* HERE */
        if (hash == -1)
            return -1;
    }
    return set_add_entry(so, key, hash);
}

PyObject_Hash uses a seed that’s stored in a value called _Py_HashSecret, it’s initialized in _Py_HashRandomization_Init. If PYTHONHASHSEED is set in the environment use_hash_seed is true, and hash_seed is set to the value from the environment (0, in our PYTHONHASHSEED=0 case):

PyStatus
_Py_HashRandomization_Init(const PyConfig *config)
{
    void *secret = &_Py_HashSecret;
    Py_ssize_t secret_size = sizeof(_Py_HashSecret_t);

    if (_Py_HashSecret_Initialized) {
        return _PyStatus_OK();
    }
    _Py_HashSecret_Initialized = 1;

    if (config->use_hash_seed) {
        if (config->hash_seed == 0) {
            /* disable the randomized hash */
            memset(secret, 0, secret_size);
        }
        else {
            /* use the specified hash seed */
            lcg_urandom(config->hash_seed, secret, secret_size);
        }
    }
    else {
        /* use a random hash seed */
        int res;

        /* _PyRandom_Init() is called very early in the Python initialization
           and so exceptions cannot be used (use raise=0).
           _PyRandom_Init() must not block Python initialization: call
           pyurandom() is non-blocking mode (blocking=0): see the PEP 524. */
        res = pyurandom(secret, secret_size, 0, 0);
        if (res < 0) {
            return _PyStatus_ERR("failed to get random numbers "
                                 "to initialize Python");
        }
    }
    return _PyStatus_OK();
}

This means, as long as the static hash seed is only set at build time, the hash function is going to be keyed with the random value generated at runtime. It then calculates new hashes for the keys contained in the marshaled data. Knowing the marshaled python bytecode doesn’t give an attacker any advantages for hash-collision DoS attacks.

Conclusion

There are no concerns about hash-table collisions being used in denial-of-service attacks on the Arch Linux build infrastructure. Loading deterministic python bytecode doesn’t open up python to attacks at runtime.

Allan McRae sent a patch for libmakepkg to pacman-dev that I’m very excited about!

I’d like to thank Allan for his contribution to reproducible Arch Linux tooling, Yhg1s from #python for his pointers into the cpython codebase and Levente ‘anthraxx’ Polyak, who has helped with the python bytecode investigation. Thanks! ♥️