Single process, multiple interpreters, no GIL contention - pre-Python3.12

Kyle Franz

May 26, 2025

First off…

What’s up with Basis?

Dunno. I’ve spent the past few months working for money rather than for “free”1. I feel like I’m working more hours for them as I was on Basis, but the distribution is a little bit different. I also had a 35 year old Marshall Amp land in my lap2, so I’ve been learning guitar. That being said: I still like working with the framework, and I’ll probably keep using it for my own projects. If nobody uses it, oh well. I need to update the website and license to reflect the current state of things, but there’s no hurry. In any case I’m going to use this as a space to write about some of the robotics/C++/tech stuff I deal with. I don’t think Thomas will mind, anyhow. I have some fun stuff from my day job about how running stat on a certain directory can cause all networking (really, anything using hard IRQs) to hang for dozens of millseconds.

After driving to South Bay and back several times in a week, I let my mind wander a bit for how I would make Python work in Basis. I came up with a pretty good scheme for the serialization (probably can get away with just reserializing once for other languages), but there’s one big problem - how would I make stuffing multiple interpreters into one process work? Using subinterpreters would probably work, but restrict use to 3.12+. While I think a lot of people will want to use the latest and greatest Python, in practice it’s easy to get stuck on an older version due to dependencies. Particularly, I’d like to import rospy message definitions without having to worry about mucking around with PYTHONPATH to manually add them in. Surely there has to be a way of doing this, right?

A quick rundown on Basis’s architecture

Basis loads everything up into the same process space, unless you opt out.3 My not so secret belief is that “shared memory” is actually a bit of a scam, and that you can go even faster if you put everything into the same process space and share shared_ptr around. True zero copy without needing to manage shared memory segments. In my mind, a robot should probably have around 3 or 4 processes running “robotics” code. One for your perception pipeline, one for your planning pipeline, one for your low level controls/safety, and one for everything else dealing with telemetry, visualization, etc. ROS1 can do this with nodelets, though the API for them is poor. ROS2 does this with “components” - this actually looks like a pretty neat API, good on them. One thing Basis can do that others cannot, as far as I can tell is run without a Coordinator (ROS Master, if you lean that way) if you have only one process in your launch file.

flowchart TB
    Coordinator-->p1
    Coordinator-->p2
    
    subgraph p1 [process 1]
        direction TB
        realsense_capture
        perception_scene
    end

    subgraph p2 [process 2]
        direction TB
        ll_controls
    end

(side note: I’m noticing now that these diagrams are cut off on Safari - if you have ideas why, lemme know)

This presents a bit of a problem for Python, though. Python (pre 3.12) does not like being run as multiple instances in one process. It just straight out doesn’t work - everything runs on the same runtime, you can’t call Py_Initialize again to get a second runtime. You can run a separate interpreter using Py_NewInterpreter but this is mainly for getting a new environment4, all interpreters still hold the same GIL. For a large python robotics codebase, this would mean horrible contention between interpreters.

I was able to “fix” this, on Python3.8. As far as I know, nobody has ever done this before, but would be very interested in prior art on this. (I found this gist saying it sorta worked, but broke with multithreading). I found another post on HN from girfan/Gohar Chaudhry that mentions doing the same. I’ve reached out to him to see if he was able to get a fully working solution or not.

Disclaimer: I’m not a Python expert, nor a Linux expert5, and I’m especially not a Python internals expert. I recommend not trying to do anything in this post, except for educational purposes.

Disclaimer 2: This is a somewhat meandering post, if you’re interested in the actual way to do this, go to the TLDR

The long road to a solution

Getting it to work, sorta

I started off trying just some simple test code in a Unit using pybind11 was essentially.

  pybind11::scoped_interpreter py;
  pybind11::print("Hello, World!");

Sure enough, running one copy printed Hello, World! but loading the Unit twice via a launch file

---
# this launch file loads up the same Unit (shared object) twice, with two different names and sets of arguments
units:
  py_a:
    unit: pybind_test
    args:
      pub: True
  py_b:
    unit: pybind_test
    args:
      pub: False

got me this error

[12255.044380416] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so
terminate called after throwing an instance of 'std::runtime_error'
  what():  The interpreter is already running

This is entirely expected, and documented in both pybind and Python.

The initial “fix”

Basis Units are loaded via dlopen, essentially as plugins. If you’ve never seen a call to dlopen before, it looks like this.

// For now - need to use RTLD_GLOBAL to allow different inproc transports to communicate
void *handle = dlopen(path.c_str(), RTLD_NOW | RTLD_GLOBAL );

This call takes a path to a shared object and loads it into a handle that one can pull symbols from.

If you’re interested in how dlopen is used and how to load an arbitrary function for a shared object, expand this

Each Unit declares a function that will be used as the interface to the plugin loader - declaring as extern "C" allows for referencing them by name without having to know how C++ types are mangled.

// create_unit.h
extern "C" {
/**
 * Forward declaration of CreateUnit - declared once in each unit library to provide an easy interface to create the
 * contained unit without prior type knowledge. Basically - the entrypoint into a unit "plugin"
 * ...more docs...
 */
basis::Unit *CreateUnit(const std::optional<std::string_view> &unit_name_override,
                        const basis::arguments::CommandLineTypes &command_line,
                        basis::CreateUnitLoggerInterface error_logger);
}
// create_unit.cpp

// For now - need to use RTLD_GLOBAL to allow different inproc transports to communicate
void *handle = dlopen(path.c_str(), RTLD_NOW | RTLD_GLOBAL );

// ...

using CreateUnitCallback = decltype(CreateUnit) *;
auto load_unit = reinterpret_cast<CreateUnitCallback>(dlsym(handle, "CreateUnit"));

// ...

Unit* unit = load_unit(...);

This is pretty straightforward code, for dealing with linux internals. We take a path to a shared object, load it up, grab a named symbol from it, cast it to correct type, and call it. Inside, it essentially calls return new MyUnitType(), along with doing some other work to set up context for the unit, such as name, environment, args, etc.

Even after trying RTLD_LOCAL and RTLD_DEEPBIND to try and get things a little more self contained, we still got a crash. Turns out that dlopen will give you the same handle if called twice on the same path. Makes sense, somewhat. I could get around this by making a my_unit.so and a my_unit_2.so, and loading them separately with the proper flags to not share symbols…but it’s not enough. They are still using the same underlying python3.8.so. I could do the same trick to the Python lib, but now I have to worry about any underlying libraries, etc, etc. Not tenable, especially given I won’t have control over what pip installed libs will load. Surely there’s a better way, right?

There is, and it’s called dlmopen. I’d never heard of it before reading about it tonight (I’m not sure if I found it on Stack Overflow first or with a friendly AI bot). Some history on why it exists is here. As mentioned in the article, and as we’ll see later, it’s good for isolating libraries, but can be a somewhat leaky abstraction.

The usage is pretty simple for this case - just pass in LM_ID_NEWLM as the first argument (to declare a new symbol namespace), and now this library doesn’t have to share any symbols with the rest of the process.6

Using it like this

void *handle = dlmopen(LM_ID_NEWLM, path.c_str(), RTLD_NOW );

we get a somewhat successful looking log:

loading unit "/opt/basis/unit/pybind_test.unit.so"
[1970-01-01 03:50:51.588] [/py_b] [info] [pybind_test.cpp:10] About to call python interpreter function
Hello, World!
[13094.921738966] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so
loading unit "/opt/basis/unit/pybind_test.unit.so"
[1970-01-01 03:50:51.608] [/py_a] [info] [pybind_test.cpp:10] About to call python interpreter function
Hello, World!
[13094.942370882] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so

Awesome! It works! We’re done now, right?

No - remember that comment in the original code? “need to use RTLD_GLOBAL to allow different inproc transports to communicate”. While I generally like to avoid it, Basis uses static in one important place - to make inproc type safe communication channels between Units.

class InprocTransport {
  template <typename T> InprocConnectorBase *GetConnector() { return GetConnectorInternal<T>(); }

private:
  template <typename T> InprocConnector<T> *GetConnectorInternal() {
    // TODO: this static somewhat breaks the nice patterns around being explicit about how objects are initialized
    static InprocConnector<T> connector; // this is basically a wrapper around a list of typed subscribers 
    return &connector;
  }
  ...
}

This TODO is coming back to bite me a little bit. I could change the Unit API to find some clever way to inject these into newly created units, but it’s hurting my head trying to keep it somewhat type safe and ODR correct. The templating will make this a real pain, I think it restricts me from making one InprocTransport and injecting it in. I think I’d have to iterate each message type the Unit can use, keep some sort of map at the launcher level, do more magic with dlsym etc, etc. It’s almost certainly possible, but might be a mess. Even if I fixed that, we use static variables in other places, like logging - look at the 1970 date in the log above - by doing this we broke the global log formatter. I’m not willing to give up all uses of static, it’s just too convenient.7

The good news: instead of loading the whole unit in a separate namespace, we can load Python into a new namespace.

// Open the shared object
void* pyhandle = dlmopen(LM_ID_NEWLM, "libpython3.8.so", RTLD_NOW | RTLD_LOCAL | RTLD_DEEPBIND);

// For each symbol we want to use, put a shim in between our call and the python lib
#define SHIM_PY(f) auto f = reinterpret_cast<decltype(::f)*>(dlsym(pyhandle, #f)); if (!f) { \
  std::cerr << "dlerror: " << dlerror() << std::endl; \
} while(false)

// ie
// auto Py_IsInitialized = reinterpret_cast<decltype(::Py_IsInitialized)*>(dlsym(pyhandle, "Py_IsInitialized"));
SHIM_PY(Py_IsInitialized);
SHIM_PY(Py_InitializeEx);
SHIM_PY(Py_Finalize);
SHIM_PY(PyRun_SimpleStringFlags); // beware - PyRun_SimpleString won't work here as it's a define

// Now we can pretend we're calling the global python API but we're actually using our shim
Py_InitializeEx(0);
assert(Py_IsInitialized());
PyRun_SimpleStringFlags("print('Hello from Python!')", NULL);

Doing this we get the expected output:

basis@395dd188e6a2:/basis/build$ basis launch ../unit/pybind_test/launch/two.launch.yaml 
[20410.047686838] [launch] [info] Running process "/" with 2 units
  /py_b: pybind_test --pub False
  /py_a: pybind_test --pub True
Hello from Python!
[20410.060431713] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so
Hello from Python!
[20410.070156171] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so

At this point, I thought to myself: “Great! it should be smooth sailing from now on - I just have to find a way to wrap this into a nice library. I can either use some macro magic to make pybind compatible with this usage of Python, or I can put pybind and Python in a library, shim it, and then I’ll finish up the blog post.”

This was very wrong, and wasted enough time that I could have just written the rest of the python bindings that I actually needed for this in the time it took. The dangers of not having to worry about if the thing you’re writing is useful, I guess.

This approach broke the moment I tried to run Python code in a different thread, which is super important for use in Basis. It’s fairly complex, but the main issue is that pthread_getspecific and friends (used to dynamically create thread local variables) was returning different values when called directly or via a Python API call via dlmopen/dlsym. This happened because I ended up with two glibc. I can’t pin my finger on why this was bad, but I believe it had something to do with the glibc creating new threads not being the same one responsible for managing the thread locals used internally by Python.8

It took me a while to even understand what was broken. Asking an LLM was mainly met with what the LLM might approximate as horror (as much as an LLM can feel anyhow) about what I was doing, rather than actual helpful advice for tracking down a fix. Finally after a bunch of research, I found a fix: we can inject the pthread portions of glibc from the main process space into our namespaced python.

The “real” “fix”

  1. Load shim_python.so, which links against libpython3.8.so9
  2. Inject a struct containing the host implementation of the functions I wanted to override
  3. Add implementations of each function forwarding to the host version
// Header
typedef struct {
    // Needed for TLS to work properly
    void* (*pthread_getspecific)(pthread_key_t);
    int (*pthread_setspecific)(pthread_key_t, const void*);
    int (*pthread_key_create)(pthread_key_t*, void (*)(void*));
    int (*pthread_key_delete)(pthread_key_t);
    // Needed for threading.Thread to work properlyßå
    int (*pthread_create) (pthread_t *__restrict __newthread,
			   const pthread_attr_t *__restrict __attr,
			   void *(*__start_routine) (void *),
			   void *__restrict __arg) __THROWNL __nonnull ((1, 3));
    // Note: we should probably also add pthread destruction here as well - YOLO
} pthread_shim_table_t;

pthread_shim_table_t* shim_pthread_table;

// Impl
void *pthread_getspecific(pthread_key_t key) {
  return shim_pthread_table->pthread_getspecific(key);
}

Now when Python uses any thread machinery, it uses the pthread implementation from the main thread. It’s even safe to initialize two Python instances on the main thread this way - they each have their own separate thread local storage for their state.

Running basis launch ../unit/pybind_test/launch/two.launch.yaml again we get:

[197792.817437253] [launch] [info] Running process "/" with 2 units
  /py_b: pybind_test --pub False
  /py_a: pybind_test --pub True
Running on main thread
Kicking off code on a thread!
[197792.968906169] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so
Running on main thread
Kicking off code on a thread!
[197793.060450753] [launch] [info] Started thread with unit /opt/basis/unit/pybind_test.unit.so
This is a callback from a Python thread.
InprocTestTrigger()
This is a callback from a Python thread.
InprocTestTrigger()
Sending a trigger to another unit()
[197794.067803003] [/py_b] [info] Got an inproc trigger
Got a trigger from another unit
[197794.068057087] [/py_b] [info] Current t_count 1
[197794.069997545] [/py_a] [info] Got an inproc trigger
Got a trigger from another unit
[197794.070058420] [/py_a] [info] Current t_count 1
InprocTestTrigger()
This is a callback from a Python thread.
InprocTestTrigger()
Sending a trigger to another unit()
Got a trigger from another unit
[197795.066806504] [/py_a] [info] Got an inproc trigger
This is a callback from a Python thread.
[197795.066806504] [/py_b] [info] Got an inproc trigger
Got a trigger from another unit
[197795.066937087] [/py_a] [info] Current t_count 1
[197795.066944337] [/py_b] [info] Current t_count 2

I won’t go into the details, but this is running Python code in both separate C++ threads as well as threads kicked off from Python, with confirmation that two C++ threads are holding two separate GILs.

Hark, a bug report

As I was writing this post, I came upon this post to Sourceware claiming a bug against glibc. I’m not quite sure the context, but they run into other issues around multiple copies of glibc, and suggest an even better workaround. Instead of injecting from the main namespace, they reach into it from the side namespace via dlmopen (LM_ID_BASE, LIBC_SO, RTLD_LAZY);. I like it. I’m not going to rewrite what I have to match it, but I like it.

Cleaning up loose ends

Importing numpy - locales are “fun”

After getting this working, I tried to import numpy, and got crashes around string parsing. This was very odd, and after some more tinkering, turned out to be locale problems. Did you know that when you call islower in C(++), the results depend on the locale for your thread? That’s right, more thread local shenanagins. It somewhat makes sense in a C fashion that you might want to easily isolate changes to locales, and do it via a thread locale. The other choice is globally setting the locale via setlocale. Neither option feels great to me. Notably, musl chose not to really implement them - I wonder how widely they are intentionally used nowadays.

I fixed this by implementing my own non-TLS using versions of them, expand if you’re curious. I probably should have just copied from musl.
const int32_t **__ctype_tolower_loc(void) {
  static int32_t table[384] = {0};
  static const int32_t *ptr = NULL;

  if (!ptr) {
    for (int i = 0; i < 384; ++i) {
        if(i >= 'A' || i <= 'Z') {
            table[i] = i | 32;
        }
        else {
            table[i] = i;
        }
    }
    ptr = table;
  }

  return &ptr;
}

const int32_t **__ctype_toupper_loc(void) {
  static int32_t table[384] = {0};
  static const int32_t *ptr = NULL;

  if (!ptr) {
    for (int i = 0; i < 384; ++i) {
        if(i >= 'A' || i <= 'Z') {
            table[i] = i - 32;
        }
        else {
            table[i] = i;
        }
    }
    ptr = table;
  }  

  return &ptr;
}


// This was borrowed from stack overflow post, sorry
static const unsigned short b_loc_table[384] = {
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x2003, //
    0x2002, //
    0x2002, //
    0x2002, //
    0x2002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, 0x0002, 0x0002,
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x0002, //
    0x6001, //
    0xc004, //!
    0xc004, //"
    0xc004, // #
    0xc004, //$
    0xc004, //%
    0xc004, //&
    0xc004, //'
    0xc004, //(
    0xc004, //)
    0xc004, //*
    0xc004, //+
    0xc004, //,
    0xc004, //-
    0xc004, //.
    0xc004, ///
    0xd808, // 0
    0xd808, // 1
    0xd808, // 2
    0xd808, // 3
    0xd808, // 4
    0xd808, // 5
    0xd808, // 6
    0xd808, // 7
    0xd808, // 8
    0xd808, // 9
    0xc004, //:
    0xc004, //;
    0xc004, //<
    0xc004, //=
    0xc004, //>
    0xc004, //?
    0xc004, //@
    0xd508, // A
    0xd508, // B
    0xd508, // C
    0xd508, // D
    0xd508, // E
    0xd508, // F
    0xc508, // G
    0xc508, // H
    0xc508, // I
    0xc508, // J
    0xc508, // K
    0xc508, // L
    0xc508, // M
    0xc508, // N
    0xc508, // O
    0xc508, // P
    0xc508, // Q
    0xc508, // R
    0xc508, // S
    0xc508, // T
    0xc508, // U
    0xc508, // V
    0xc508, // W
    0xc508, // X
    0xc508, // Y
    0xc508, // Z
    0xc004, //[
    0xc004, //
    0xc004, //]
    0xc004, //^
    0xc004, //_
    0xc004, //`
    0xd608, // a
    0xd608, // b
    0xd608, // c
    0xd608, // d
    0xd608, // e
    0xd608, // f
    0xc608, // g
    0xc608, // h
    0xc608, // i
    0xc608, // j
    0xc608, // k
    0xc608, // l
    0xc608, // m
    0xc608, // n
    0xc608, // o
    0xc608, // p
    0xc608, // q
    0xc608, // r
    0xc608, // s
    0xc608, // t
    0xc608, // u
    0xc608, // v
    0xc608, // w
    0xc608, // x
    0xc608, // y
    0xc608, // z
    0xc004, //{
    0xc004, //|
    0xc004, //}
    0xc004, //~
    0x0002, //
    0x0000, // €
    0x0000, //
    0x0000, // ‚
    0x0000, // ƒ
    0x0000, // „
    0x0000, // …
    0x0000, // †
    0x0000, // ‡
    0x0000, // ˆ
    0x0000, // ‰
    0x0000, // Š
    0x0000, // ‹
    0x0000, // Œ
    0x0000, //
    0x0000, // Ž
    0x0000, //
    0x0000, //
    0x0000, // ‘
    0x0000, // ’
    0x0000, // “
    0x0000, // ”
    0x0000, // •
    0x0000, // –
    0x0000, // —
    0x0000, // ˜
    0x0000, // ™
    0x0000, // š
    0x0000, // ›
    0x0000, // œ
    0x0000, //
    0x0000, // ž
    0x0000, // Ÿ
    0x0000, //
    0x0000, // ¡
    0x0000, // ¢
    0x0000, // £
    0x0000, // ¤
    0x0000, // ¥
    0x0000, // ¦
    0x0000, // §
    0x0000, // ¨
    0x0000, // ©
    0x0000, // ª
    0x0000, // «
    0x0000, // ¬
    0x0000, // ­
    0x0000, // ®
    0x0000, // ¯
    0x0000, // °
    0x0000, // ±
    0x0000, // ²
    0x0000, // ³
    0x0000, // ´
    0x0000, // µ
    0x0000, // ¶
    0x0000, // ·
    0x0000, // ¸
    0x0000, // ¹
    0x0000, // º
    0x0000, // »
    0x0000, // ¼
    0x0000, // ½
    0x0000, // ¾
    0x0000, // ¿
    0x0000, // À
    0x0000, // Á
    0x0000, // Â
    0x0000, // Ã
    0x0000, // Ä
    0x0000, // Å
    0x0000, // Æ
    0x0000, // Ç
    0x0000, // È
    0x0000, // É
    0x0000, // Ê
    0x0000, // Ë
    0x0000, // Ì
    0x0000, // Í
    0x0000, // Î
    0x0000, // Ï
    0x0000, // Ð
    0x0000, // Ñ
    0x0000, // Ò
    0x0000, // Ó
    0x0000, // Ô
    0x0000, // Õ
    0x0000, // Ö
    0x0000, // ×
    0x0000, // Ø
    0x0000, // Ù
    0x0000, // Ú
    0x0000, // Û
    0x0000, // Ü
    0x0000, // Ý
    0x0000, // Þ
    0x0000, // ß
    0x0000, // à
    0x0000, // á
    0x0000, // â
    0x0000, // ã
    0x0000, // ä
    0x0000, // å
    0x0000, // æ
    0x0000, // ç
    0x0000, // è
    0x0000, // é
    0x0000, // ê
    0x0000, // ë
    0x0000, // ì
    0x0000, // í
    0x0000, // î
    0x0000, // ï
    0x0000, // ð
    0x0000, // ñ
    0x0000, // ò
    0x0000, // ó
    0x0000, // ô
    0x0000, // õ
    0x0000, // ö
    0x0000, // ÷
    0x0000, // ø
    0x0000, // ù
    0x0000, // ú
    0x0000, // û
    0x0000, // ü
    0x0000, // ý
    0x0000, // þ
    0x0000, // ÿ
    0x0020, //
    0x0000, //
    0x0000, //
    0x0000, //
    0x0000, //
    0x0000, //
    0x0028, //
    0x0000, //
    0x0043, //
    0x0000, //
    0x0029, //
    0x0000, //
    0x0000, //
    0x0000,
    0x0000, //
    0x0000, //
    0x003c, //
    0x0000, //
    0x003c, //
    0x0000, //
    0x0000, //
    0x0000, //
    0x0000, //
    0x0000, 0x002d, 0x0000, 0x0000, 0x0000, 0x0000,
    0x0000, //
    0x0028, //
    0x0000, //
    0x0052, //
    0x0000, //!
    0x0029, //"
    0x0000, // #
    0x0000, //$
    0x0000, //%
    0x0000, //&
    0x0000, //'
    0x0075, //(
    0x0000, //)
    0x0000, //*
    0x0000, //+
    0x0000, //,
    0x0000, //-
    0x002c, //.
    0x0000, ///
    0x0000, // 0
    0x0000, // 1
    0x0000, // 2
    0x0000, // 3
    0x003e, // 4
    0x0000, // 5
    0x003e, // 6
    0x0000, // 7
    0x0000, // 8
    0x0000, // 9
    0x0000, //:
    0x0000, //;
    0x0020, //<
    0x0000, //=
    0x0031, //>
    0x0000, //?
    0x002f, //@
    0x0000, // A
    0x0034, // B
    0x0000, // C
    0x0020, // D
    0x0000, // E
    0x0000, // F
    0x0000, // G
    0x0000, // H
    0x0000, // I
    0x0020, // J
    0x0000, // K
    0x0031, // L
    0x0000, // M
    0x002f, // N
    0x0000, // O
    0x0032, // P
    0x0000, // Q
    0x0020, // R
    0x0000, // S
    0x0000, // T
    0x0000, // U
    0x0000, // V
    0x0000, // W
    0x0020, // X
    0x0000, // Y
    0x0033, // Z
    0x0000, //[
    0x002f,
    0x0000, //]
    0x0034, //^
    0x0000, //_
    0x0020, //`
    0x0000, // a
    0x0000, // b
    0x0000, // c
    0x0000, // d
    0x0000, // e
    0x0041, // f
    0x0000, // g
    0x0045, // h
    0x0000, // i
    0x0000, // j
    0x0000, // k
    0x0000, // l
    0x0000, // m
    0x0078, // n
    0x0000, // o
    0x0000, // p
    0x0000, // q
    0x0000, // r
    0x0000, // s
    0x0073, // t
    0x0000, // u
    0x0073, // v
    0x0000, // w
    0x0000, // x
    0x0000, // y
    0x0000, // z
    0x0000, //{
    0x0061, //|
    0x0000, //}
    0x0065, //~
    0x0000  //
};

const unsigned short **__ctype_b_loc(void) {
  static const unsigned short *ptr = b_loc_table;
  return &ptr;
}

After fixing this import numpy works, and that’s as far as I took things.

Some notes on debugging modules with dlmopen

By default, lldb can’t read the frames of stack traces from modules loaded via dlmopen. You’ll get a stack trace that looks something like this:

    frame #106: 0x0000ffffeddbb1f4
    frame #107: 0x0000ffffeddbb610
    frame #108: 0x0000ffffeddbb9ec
    frame #109: 0x0000ffffedd7eff0
    frame #110: 0x0000ffffedd7f388
    frame #111: 0x0000ffffedd80568
    frame #112: 0x0000fffff7e666e4 pybind_test.unit.so`pybind_test::pybind_test(unit::pybind_test::Args const&, std::optional<std::basic_string_view<char, std::char_traits<char>>> const&)::$_0::operator()(this=0x0000aaaaaab9a878) const at pybind_test.cpp:86:5
    frame #113: 0x0000fffff7e6667c pybind_test.unit.so`void std::__invoke_impl<void, pybind_test::pybind_test(unit::pybind_test::Args const&, std::optional<std::basic_string_view<char, std::char_traits<char>>> const&)::$_0>((null)=__invoke_other @ 0x0000ffffed9ff82f, __f=0x0000aaaaaab9a878) at invoke.h:61:14

The way to fix this is:

  1. Get at the base address in /proc/<proc>/maps for the module in question. The first number appears to be the start of the address space for that module.
  2. Load the module via target modules add <so_path>
  3. Associate the symbols with the addresses via target modules load --file "<so_path>" --slide 0x<base_address>

Notably, you can only do this after module load, as the base address will be different every time.

After doing this, running bt will get you a slightly more sane 10

    frame #141: 0x0000ffffeddbb1f4 libpython3.8.so.1.0`_PyEval_EvalCodeWithName(_co=0x0000fffff4bdb920, globals=<unavailable>, locals=<unavailable>, args=0x0000000000000000, argcount=0, kwnames=0x0000000000000000, kwargs=0x0000000000000000, kwcount=0, kwstep=2, defs=0x0000000000000000, defcount=0, kwdefs=0x0000000000000000, closure=0x0000000000000000, name=0x0000000000000000, qualname=0x0000000000000000) at ceval.c:4298:14
    frame #142: 0x0000ffffeddbb610 libpython3.8.so.1.0`PyEval_EvalCodeEx(_co=<unavailable>, globals=<unavailable>, locals=<unavailable>, args=0x0000000000000000, argcount=0, kws=0x0000000000000000, kwcount=0, defs=0x0000000000000000, defcount=0, kwdefs=0x0000000000000000, closure=0x0000000000000000) at ceval.c:4327:12
    frame #143: 0x0000ffffeddbb9ec libpython3.8.so.1.0`PyEval_EvalCode(co=<unavailable>, globals=<unavailable>, locals=<unavailable>) at ceval.c:718:12
    frame #144: 0x0000ffffedd7eff0 libpython3.8.so.1.0`run_mod [inlined] run_eval_code_obj(locals=0x0000fffff54ce680, globals=0x0000fffff54ce680, co=0x0000fffff4bdb920) at pythonrun.c:1166:9
    frame #145: 0x0000ffffedd7efb4 libpython3.8.so.1.0`run_mod(mod=0x0000ffffe80015c0, filename=0x0000fffff5447af0, globals=0x0000fffff54ce680, locals=0x0000fffff54ce680, flags=0x0000000000000000, arena=0x0000fffff5e8c7b0) at pythonrun.c:1188:9
    frame #146: 0x0000ffffedd7f388 libpython3.8.so.1.0`PyRun_StringFlags(str="\nimport threading\nimport time\n\nimport numpy as np\n\nnp.array(50)\n\ndef do():\n  while True:\n      time.sleep(1)\n      print('This is a callback from a Python thread.')\n      \nt = threading.Thread(target = do)\nt.daemon = True\nt.start()\n", start=257, globals=0x0000fffff54ce680, locals=0x0000fffff54ce680, flags=0x0000000000000000) at pythonrun.c:1061:15
    frame #147: 0x0000ffffedd80568 libpython3.8.so.1.0`PyRun_SimpleStringFlags(command="\nimport threading\nimport time\n\nimport numpy as np\n\nnp.array(50)\n\ndef do():\n  while True:\n      time.sleep(1)\n      print('This is a callback from a Python thread.')\n      \nt = threading.Thread(target = do)\nt.daemon = True\nt.start()\n", flags=0x0000000000000000) at pythonrun.c:486:9
    frame #148: 0x0000fffff7e666e4 pybind_test.unit.so`pybind_test::pybind_test(unit::pybind_test::Args const&, std::optional<std::basic_string_view<char, std::char_traits<char>>> const&)::$_0::operator()(this=0x0000aaaaaab9a878) const at pybind_test.cpp:86:5
    frame #149: 0x0000fffff7e6667c pybind_test.unit.so`void std::__invoke_impl<void, pybind_test::pybind_test(unit::pybind_test::Args const&, std::optional<std::basic_string_view<char, std::char_traits<char>>> const&)::$_0>((null)=__invoke_other @ 0x0000ffffed9ff82f, __f=0x0000aaaaaab9a878) at invoke.h:61:14

If you are in the same situation, here’s a script written by ChatGPT to do this. Use via command script import load_all_slid_modules.py and load_all_slid_modules. Notably, the script doesn’t work well across restarts, it tries to avoid reloading modules that it already saw, but this means things don’t work when the modules are reset. Pardon the excess emoji, I left them in so it’s obvious that I didn’t write it.

load_all_slid_modules.py
import lldb
import os
import re

def __lldb_init_module(debugger, internal_dict):
    debugger.HandleCommand('command script add -f load_all_slid_modules.load_all_slid_modules load_all_slid_modules')
    print("✅ LLDB command 'load_all_slid_modules' installed")

def load_all_slid_modules(debugger, command, result, internal_dict=None):
    target = debugger.GetSelectedTarget()
    process = target.GetProcess()
    pid = process.GetProcessID()

    try:
        with open(f"/proc/{pid}/maps", "r") as f:
            maps = f.readlines()
    except Exception as e:
        result.PutCString(f"❌ Failed to read /proc/{pid}/maps: {e}")
        return

    seen = set()
    added = 0

    for line in maps:
        if "r-xp" not in line:
            continue

        m = re.match(r"([0-9a-f]+)-[0-9a-f]+ r-xp .*? (/.+\.so(?:\.\d+)*)(?: |$)", line)
        if not m:
            continue

        base_str, path = m.groups()
        if not os.path.isfile(path):
            continue

        base_addr = int(base_str, 16)
        real_path = os.path.realpath(path)

        key = (real_path, base_addr)
        if key in seen:
            continue
        seen.add(key)

        cmd = f'target modules add "{real_path}"'
        debugger.HandleCommand(cmd)

        cmd = f'target modules load --file "{real_path}" --slide 0x{base_addr:x}'
        debugger.HandleCommand(cmd)
        result.PutCString(f"✅ Added: {real_path} at 0x{base_addr:x}")
        added += 1

    if added == 0:
        result.PutCString("🟢 No new slid modules found to load.")
    else:
        result.PutCString(f"🎯 Loaded {added} new module(s).")

Hopefully either (A) this post gets enough reach that searches for “lldb dlmopen no symbols” point to this script or (B) LLVM adds support for doing this automatically, and others don’t have to be as confused as I was.

I have no idea how to do this in gdb but they apparently silently fixed it a few years ago, in gdb 13.1. For posterity, I’m running lldb 18.1.8 which is fairly modern.

Safely calling python code from other threads

Two steps are needed.

  1. In the main thread, call PyThreadState *main_tstate = PyEval_SaveThread(); once.
  2. Any time after that, wrap your Python API using code via
    PyGILState_STATE g = PyGILState_Ensure();
    PyRun_SimpleString("print('Hello, world.')");
    PyGILState_Release(g);
    

    That’s it. Properly destructing and knowing when PyEval_RestoreThread is needed left as an exercise to the reader.

Automagically shimming symbols from another module

Use X macros. X macros are my favorite dumb C trick whenever I have to do different things on the same list of symbols multiple times (like defining+declaring a bunch of variables, making enum string conversions, etc, etc).

In this case, I first defined a list of symbols like so

#define X_PYTHON_API \
    X_PY(Py_InitializeEx) \
    X_PY(Py_Finalize) \
    X_PY(Py_IsInitialized) \
    X_PY(PyRun_SimpleStringFlags) \
    ...

then in the header, declared a bunch of function pointers

class pybind_test : public unit::pybind_test::Base {
public:
  ...
private:
  // for each symbol, create a function pointer of the same type as in the Python API
  #define X_PY(f) decltype(::f)* f = nullptr;
  X_PYTHON_API
  #undef X_PY
}

and in the constructor, initialized them

pybind_test::pybind_test(const Args &args,
                         const std::optional<std::string_view> &name_override)
    : unit::pybind_test::Base(args, name_override), pub(args.pub) {
  void *pyhandle = dlmopen(LM_ID_NEWLM, "libpython_shim.so",
                           RTLD_LAZY | RTLD_LOCAL | RTLD_DEEPBIND);
  ...
// For each symbol, find it on the shared object, then cast it to the proper type
// and store on the object
#define X_PY(f)                                                                \
  this->f = reinterpret_cast<decltype(this->f)>(dlsym(pyhandle, #f));          \
  if (!this->f) {                                                              \
    const char *error = dlerror();                                             \
    BASIS_LOG_FATAL("error while getting '" #f "': {}'", error);               \
  }
  X_PYTHON_API
#undef X_PY
}

Now a simple call to Py_InitializeEx(0) will resolve to the member function pointer, transparently. If you forget to add a symbol to the list, you’ll get a linker error as long as you didn’t link Python. Don’t link Python in this case, we never ever want to confuse things and run the Python API on something other than our handles. If I were to put the shim dll into production I’d do the same thing with the forwarded symbols.

What to do about pybind.

I wouldn’t use pybind for this, it’s too hard. Sticking more about this under another expandable section as I think most people won’t care.

I figured that I could “just” extract the symbols I needed from pybind, and redirect them to a wrapper that delegated to the proper python handle. I even had some fun assembly code to do it. This didn’t work for three reasons:

  1. I’m unsure if I could properly associate a pybind call with the python handle it needed. I think I could get clever with thread local variables and make it work, but I’m worried about other threads that I don’t create. I don’t want to have to wrap all thread creation to “inherit” the python handle, at least at this point.
  2. Static variables are in use here. The clever redirection I had to delegate wouldn’t work.
  3. I wouldn’t be able to use any of the template helpers. I might still be able to get this to work with careful use of macros, but it would be safer just to fork pybind, which I’m not going to do.

What then instead?

For Basis, the API surface between a Unit and the framework surrounding it is really really small. As best as I can tell I need a few things for generic Python bindings:

  1. To be able to convert a byte buffer into a deserialized python message (needed to accept messages from python)p
  2. To do the convert a python message into a serialized byte buffer
  3. To be able to create an object to hold the inputs and outputs from each callback
  4. To be able to call each callback with all of the above
  5. Logging framework injection

Optional bonuses for later:

  1. Define a base class for python Units to inherit from with the API for your Unit
    • This is not strictly neccessary, but will help with tooling
  2. Be able to query python objects directly for members or run small lambdas on them to get at the members for synchronization
    • Mainly a performance win, so that we can avoid extra deserializations. For first pass, we can deserialize incoming messages first as cpp (to be able to synchronize on a member variable or similar), and then as py (to pass to the Unit).

None of this requires any crazy features. Some of the helpers pybind provides are really nice, but in this case it’s not needed.

TLDR

  1. Use dlmopen per python interpreter you want to run
  2. Inject a few pthread and string/locale related functions
  3. It works, at least until you find some other thread local storage related crash

Well, what now?

Overall, I would not recommend doing this yourself. From what I can tell, once you fix thread local variable creation and locale related code, things look fine, but when things go wrong it’s a royal pain to debug. I’m not sure the extent at which things could break here. Past the initial experiment I mainly did this because I thought it could be done (also, because some of the stuff I learned would be useful to others using dlmopen for the first time). There are no advantages here other than slightly better isolation between modules and the ability to work on older Python versions.

It also looks like glibc also only supports 16 namespaces, so you’d be limited to 15 or so Python interpreters with this method, though girfan claims you can patch it for more.

I’m going to clean up the code for this and post it as a Draft PR, for posterity. PR is up here. For the real implementation of this I’ll probably just use Python3.12 subinterpreters, which have independent GIL. rospy messages load just fine in newer versions of Python, with a little bit of path trickery, so I really shouldn’t need to use 3.8. Maybe I’ll actually get around to writing the bindings I was trying to work on, rather than getting sidetracked for a week on something nobody asked for.

As I said at the top, I’ll probably put out more blog posts, later. The next one will probably be on the acutal Python bindings I’m going to write.11


  1. won’t post who here just as policy of “my viewpoints are my own and not associcated with my employer, but go check my LinkedIn if you want to know”12 

  2. no, seriously. the guys at the music store barely believe it. but a “free” $2k amp has turned into a $300 tune up for the amp (it deserved it even if I sold it), a guitar, a speaker to hook it up to, etc. most expensive free thing I’ve ever gotten, beyond my cats. 

  3. somehow everyone I’ve spoken to about Basis is surprised about this - I should have made it more obvious. it’s nowhere in the docs?! 

  4. even this is a little suspect, the py3.8 docs have a bunch of caveats for exactly what state is new. 

  5. okay, full disclosure, other people have called me a Linux expert, what do I know? 

  6. the man docs refer to link-map lists and namespaces, but this is basically the gist of things. you can also query an existing handle for its id to load another shared object into it, but it has weird effects around sharing symbols. 

  7. as I write this, I can actually think of a few ways to solve this, but it’s not worth my time right now. 

  8. you don’t have to understand what’s broken in this case to get a gut feeling that it won’t work right 

  9. i tried to have an isolated shim and load python separately - it didn’t work, I got missing symbol errors later on down the line when python itself used dlopen on a python module. dlmopen explicitly doesn’t support RTLD_GLOBAL, which i feel is a bug - it should instead allow symbol visibility to other modules loaded in the same namespace. 

  10. if you count cpp stack traces that wrap multiple times sane, and if you count 150 frame deep stack traces sane 

  11. also - i discovered how to use footnotes for this post. hopefully i didn’t go too overboard. 

  12. actually they might want the assocation, who knows, will let them speak up if they do :)