Thinking about client side parallelism: split read and write side? - Re: [PATCH] Add PRIV_TOPREQ for per "top request" / req->top state

Wed Mar 11 12:30:42 CET 2015

On 11/03/15 09:02, Poul-Henning Kamp wrote:
> so I think it is a worthwhile exercise to think about it.

And in fact I did ponder this over night and now I think that a mutex on
vrt_privs was a stupid suggestion. In light of H2 and parallel ESI, there is a
fundamental design decision to make: Do we need multiple threads working on any
single http session (file descriptor)?

My current answer would be: yes and no, so here's a brain dump.

* For anything involving VCL and vmods, I think multiple threads would be a bad
  idea. They'd lead to the nightmare of a couple of synchronization issues on
  shared datastructures, as for PRIV_TOPREQ and ESI.

* But we do want to have parallelism for reads and writes on the client file
  descriptor for H2. We don't want to delay read processing just because there
  is data to be written.

For H2, phk mentioned the idea of individual threads for anything from reading a
request to delivering the response and have a lightweight thread watching the
connection inbetween requests, handling any H2 stuff except for requests.

Over the apple-crumble, I had mentioned the idea of using a single thread for
the client-side handling many requests.

They major point with this idea is that a single thread looking after the client
file descriptor could block on writes, and for H2 we want to look after events
on the read side ASAP.

Nonblocking IO is another nightmare, but offloading _writes_ to a single
per-session thread would be an option.

I'd see the read thread maintaining a work-list for the write thread, like
- stream 0x1 send headers of obj 0x1
- stream 0x1 start data of obj 0x1 bytes x-y window w
- stream 0x0 send h2 control stuff
- stream 0x1 update window +w bytes

As the read thread maintains the work list, it could push urgent replies (like
H2 SETTINGS or PING ACK) in front of the list or even reorder it. The read
thread would need to be notified of completion of some work requests to free
resources owned by it.

The read thread would be designed to avoid blocking on IO other than reads from
the client, but it would do all other processing for many request objects in a
serial manner. The reasoning behind this is that anything from cache lookups to
VCL calls should, within the orders of magnitude of time relevant for request
processing, be fast enough to be done in a single thread.

The read thread would maintain the state for many request objects and react to
events from the client or backend threads.

To ensure that the read thread never blocks for any relevant time frame, VMOD
authors would be advised to do any blocking operations only on the backend side.
For core code, I see the following changes be required:

- not wait for fetch threads at all, just note the state we wait for and
  re-check
- replace the waitinglist logic with VBO logic

at the top of the read thread, we'd need the semantics of "wait for any read
event OR VBO event to happen".

* If we don't have a VBO event to wait for, we can (after a timeout) hand off
  to a waiter as we do now (freeing the thread, but keeping the session state
  and all other linked objects), but this will, for H2, be a less common case.

* For the case that we do wait for a VBO event, I'd see an extension of
  the existing waiter interface: have the backend thread signal an event on
  some interface that can be waited for together with the client fd (pipe /
  kqueue event / PORT_SOURCE_USER  etc)

Whenever there is work to be done the read thread would
- call into the delivery process for VBO events (including vcl_deliver) and
  finally hand off work requests to the write thread
- do the usual request processing for anything up to a miss/pass, which
  is then handled async as described above

Regarding data structures, I'd see the following:

* move most of the sess FSM into req

* session
  - simple FSM for connection handling
  - protocol-specific setup including calls into VCL
    - H2 setup / param negotiation
    - PROXY setup
  - array of reqs to be handled

So a single thread would operate on a set of struct reqs. Each req would have
its state and the thread would always follow the existing code path until a miss
or pass.

ESI processing: we look at on object in the deliver process, either because we
had a cache hit or a VBO event. We parse the body, create struct reqs and
process each as we would for any other incoming req, but create an ordered list
of sub-requests so we know the linkage. As long as we don't hit miss or pass, we
issue requests to the write thread. For miss or pass, we kick off the backend
thread and continue parsing what we have. Yes, this is complex.

How to get ahead:

== http/1.1 pipelining as a PoC ==

To check the validity of the concept, we could start with implementing
pipelining. Seen from the perspective of this concept, pipelining is mainly
lightweight H2 without interleaved writes. But handling a set of req objects and
the event handling could already be implemented.

== parallel ESI ==

once we can look after several req objects asynchronously, parallel ESI should
be possible. I honestly thing this will be the hardest part, even harder than
the rest of H2.

== H2 ==

once we have parallel ESI, we should have anything in place to do fully parallel
h2 processing with just two client threads - so we'd only have one additional
thread to what we have now (one client thread + backend threads when required).

Nils