vlib.md revision 340c15c6
1
2VLIB (Vector Processing Library)
3================================
4
5The files associated with vlib are located in the ./src/{vlib,
6vlibapi, vlibmemory} folders. These libraries provide vector
7processing support including graph-node scheduling, reliable multicast
8support, ultra-lightweight cooperative multi-tasking threads, a CLI,
9plug in .DLL support, physical memory and Linux epoll support. Parts of
10this library embody US Patent 7,961,636.
11
12Init function discovery
13-----------------------
14
15vlib applications register for various \[initialization\] events by
16placing structures and \_\_attribute\_\_((constructor)) functions into
17the image. At appropriate times, the vlib framework walks
18constructor-generated singly-linked structure lists, performs a
19topological sort based on specified constraints, and calls the
20indicated functions. Vlib applications create graph nodes, add CLI
21functions, start cooperative multi-tasking threads, etc. etc. using
22this mechanism.
23
24vlib applications invariably include a number of VLIB\_INIT\_FUNCTION
25(my\_init\_function) macros.
26
27Each init / configure / etc. function has the return type clib\_error\_t
28\*. Make sure that the function returns 0 if all is well, otherwise the
29framework will announce an error and exit.
30
31vlib applications must link against vppinfra, and often link against
32other libraries such as VNET. In the latter case, it may be necessary to
33explicitly reference symbol(s) otherwise large portions of the library
34may be AWOL at runtime.
35
36### Init function construction and constraint specification
37
38It's easy to add an init function:
39
40```
41   static clib_error_t *my_init_function (vlib_main_t *vm)
42   {
43      /* ... initialize things ... */
44
45      return 0; // or return clib_error_return (0, "BROKEN!");
46   }
47   VLIB_INIT_FUNCTION(my_init_function);
48```
49
50As given, my_init_function will be executed "at some point," but with
51no ordering guarantees.
52
53Specifying ordering constraints is easy:
54
55```
56   VLIB_INIT_FUNCTION(my_init_function) =
57   {
58      .runs_before = VLIB_INITS("we_run_before_function_1",
59                                "we_run_before_function_2"),
60      .runs_after = VLIB_INITS("we_run_after_function_1",
61                               "we_run_after_function_2),
62    };
63```
64
65It's also easy to specify bulk ordering constraints of the form "a
66then b then c then d":
67
68```
69   VLIB_INIT_FUNCTION(my_init_function) =
70   {
71      .init_order = VLIB_INITS("a", "b", "c", "d"),
72   };
73```
74
75It's OK to specify all three sorts of ordering constraints for a
76single init function, although it's hard to imagine why it would be
77necessary.
78
79
80Node Graph Initialization
81-------------------------
82
83vlib packet-processing applications invariably define a set of graph
84nodes to process packets.
85
86One constructs a vlib\_node\_registration\_t, most often via the
87VLIB\_REGISTER\_NODE macro. At runtime, the framework processes the set
88of such registrations into a directed graph. It is easy enough to add
89nodes to the graph at runtime. The framework does not support removing
90nodes.
91
92vlib provides several types of vector-processing graph nodes, primarily
93to control framework dispatch behaviors. The type member of the
94vlib\_node\_registration\_t functions as follows:
95
96-   VLIB\_NODE\_TYPE\_PRE\_INPUT - run before all other node types
97-   VLIB\_NODE\_TYPE\_INPUT - run as often as possible, after pre\_input
98    nodes
99-   VLIB\_NODE\_TYPE\_INTERNAL - only when explicitly made runnable by
100    adding pending frames for processing
101-   VLIB\_NODE\_TYPE\_PROCESS - only when explicitly made runnable.
102    "Process" nodes are actually cooperative multi-tasking threads. They
103    **must** explicitly suspend after a reasonably short period of time.
104
105For a precise understanding of the graph node dispatcher, please read
106./src/vlib/main.c:vlib\_main\_loop.
107
108Graph node dispatcher
109---------------------
110
111Vlib\_main\_loop() dispatches graph nodes. The basic vector processing
112algorithm is diabolically simple, but may not be obvious from even a
113long stare at the code. Here's how it works: some input node, or set of
114input nodes, produce a vector of work to process. The graph node
115dispatcher pushes the work vector through the directed graph,
116subdividing it as needed, until the original work vector has been
117completely processed. At that point, the process recurs.
118
119This scheme yields a stable equilibrium in frame size, by construction.
120Here's why: as the frame size increases, the per-frame-element
121processing time decreases. There are several related forces at work; the
122simplest to describe is the effect of vector processing on the CPU L1
123I-cache. The first frame element \[packet\] processed by a given node
124warms up the node dispatch function in the L1 I-cache. All subsequent
125frame elements profit. As we increase the number of frame elements, the
126cost per element goes down.
127
128Under light load, it is a crazy waste of CPU cycles to run the graph
129node dispatcher flat-out. So, the graph node dispatcher arranges to wait
130for work by sitting in a timed epoll wait if the prevailing frame size
131is low. The scheme has a certain amount of hysteresis to avoid
132constantly toggling back and forth between interrupt and polling mode.
133Although the graph dispatcher supports interrupt and polling modes, our
134current default device drivers do not.
135
136The graph node scheduler uses a hierarchical timer wheel to reschedule
137process nodes upon timer expiration.
138
139Graph dispatcher internals
140--------------------------
141
142This section may be safely skipped. It's not necessary to understand
143graph dispatcher internals to create graph nodes.
144
145Vector Data Structure
146---------------------
147
148In vpp / vlib, we represent vectors as instances of the vlib_frame_t type:
149
150```c
151    typedef struct vlib_frame_t
152    {
153      /* Frame flags. */
154      u16 flags;
155
156      /* Number of scalar bytes in arguments. */
157      u8 scalar_size;
158
159      /* Number of bytes per vector argument. */
160      u8 vector_size;
161
162      /* Number of vector elements currently in frame. */
163      u16 n_vectors;
164
165      /* Scalar and vector arguments to next node. */
166      u8 arguments[0];
167    } vlib_frame_t;
168```
169
170Note that one _could_ construct all kinds of vectors - including
171vectors with some associated scalar data - using this structure. In
172the vpp application, vectors typically use a 4-byte vector element
173size, and zero bytes' worth of associated per-frame scalar data.
174
175Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries.
176Frames have u32 indices which make use of the alignment property, so
177the maximum feasible main heap offset of a frame is
178CLIB_CACHE_LINE_BYTES * 0xFFFFFFFF: 64*4 = 256 Gbytes.
179
180Scheduling Vectors
181------------------
182
183As you can see, vectors are not directly associated with graph
184nodes. We represent that association in a couple of ways.  The
185simplest is the vlib\_pending\_frame\_t:
186
187```c
188    /* A frame pending dispatch by main loop. */
189    typedef struct
190    {
191      /* Node and runtime for this frame. */
192      u32 node_runtime_index;
193
194      /* Frame index (in the heap). */
195      u32 frame_index;
196
197      /* Start of next frames for this node. */
198      u32 next_frame_index;
199
200      /* Special value for next_frame_index when there is no next frame. */
201    #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0)
202    } vlib_pending_frame_t;
203```
204
205Here is the code in .../src/vlib/main.c:vlib_main_or_worker_loop()
206which processes frames:
207
208```c
209      /*
210       * Input nodes may have added work to the pending vector.
211       * Process pending vector until there is nothing left.
212       * All pending vectors will be processed from input -> output.
213       */
214      for (i = 0; i < _vec_len (nm->pending_frames); i++)
215	cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now);
216      /* Reset pending vector for next iteration. */
217```
218
219The pending frame node_runtime_index associates the frame with the
220node which will process it.
221
222Complications
223-------------
224
225Fasten your seatbelt. Here's where the story - and the data structures
226\- become quite complicated...
227
228At 100,000 feet: vpp uses a directed graph, not a directed _acyclic_
229graph. It's really quite normal for a packet to visit ip\[46\]-lookup
230multiple times. The worst-case: a graph node which enqueues packets to
231itself.
232
233To deal with this issue, the graph dispatcher must force allocation of
234a new frame if the current graph node's dispatch function happens to
235enqueue a packet back to itself.
236
237There are no guarantees that a pending frame will be processed
238immediately, which means that more packets may be added to the
239underlying vlib_frame_t after it has been attached to a
240vlib_pending_frame_t. Care must be taken to allocate new
241frames and pending frames if a (pending\_frame, frame) pair fills.
242
243Next frames, next frame ownership
244---------------------------------
245
246The vlib\_next\_frame\_t is the last key graph dispatcher data structure:
247
248```c
249    typedef struct
250    {
251      /* Frame index. */
252      u32 frame_index;
253
254      /* Node runtime for this next. */
255      u32 node_runtime_index;
256
257      /* Next frame flags. */
258      u32 flags;
259
260      /* Reflects node frame-used flag for this next. */
261    #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \
262      VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH
263
264      /* This next frame owns enqueue to node
265         corresponding to node_runtime_index. */
266    #define VLIB_FRAME_OWNER (1 << 15)
267
268      /* Set when frame has been allocated for this next. */
269    #define VLIB_FRAME_IS_ALLOCATED	VLIB_NODE_FLAG_IS_OUTPUT
270
271      /* Set when frame has been added to pending vector. */
272    #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP
273
274      /* Set when frame is to be freed after dispatch. */
275    #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT
276
277      /* Set when frame has traced packets. */
278    #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE
279
280      /* Number of vectors enqueue to this next since last overflow. */
281      u32 vectors_since_last_overflow;
282    } vlib_next_frame_t;
283```
284
285Graph node dispatch functions call vlib\_get\_next\_frame (...)  to
286set "(u32 \*)to_next" to the right place in the vlib_frame_t
287corresponding to the ith arc (aka next0) from the current node to the
288indicated next node.
289
290After some scuffling around - two levels of macros - processing
291reaches vlib\_get\_next\_frame_internal (...). Get-next-frame-internal
292digs up the vlib\_next\_frame\_t corresponding to the desired graph
293arc.
294
295The next frame data structure amounts to a graph-arc-centric frame
296cache. Once a node finishes adding element to a frame, it will acquire
297a vlib_pending_frame_t and end up on the graph dispatcher's
298run-queue. But there's no guarantee that more vector elements won't be
299added to the underlying frame from the same (source\_node,
300next\_index) arc or from a different (source\_node, next\_index) arc.
301
302Maintaining consistency of the arc-to-frame cache is necessary. The
303first step in maintaining consistency is to make sure that only one
304graph node at a time thinks it "owns" the target vlib\_frame\_t.
305
306Back to the graph node dispatch function. In the usual case, a certain
307number of packets will be added to the vlib\_frame\_t acquired by
308calling vlib\_get\_next\_frame (...).
309
310Before a dispatch function returns, it's required to call
311vlib\_put\_next\_frame (...) for all of the graph arcs it actually
312used.  This action adds a vlib\_pending\_frame\_t to the graph
313dispatcher's pending frame vector.
314
315Vlib\_put\_next\_frame makes a note in the pending frame of the frame
316index, and also of the vlib\_next\_frame\_t index.
317
318dispatch\_pending\_node actions
319-------------------------------
320
321The main graph dispatch loop calls dispatch pending node as shown
322above.
323
324Dispatch\_pending\_node recovers the pending frame, and the graph node
325runtime / dispatch function. Further, it recovers the next\_frame
326currently associated with the vlib\_frame\_t, and detaches the
327vlib\_frame\_t from the next\_frame.
328
329In .../src/vlib/main.c:dispatch\_pending\_node(...), note this stanza:
330
331```c
332  /* Force allocation of new frame while current frame is being
333     dispatched. */
334  restore_frame_index = ~0;
335  if (nf->frame_index == p->frame_index)
336    {
337      nf->frame_index = ~0;
338      nf->flags &= ~VLIB_FRAME_IS_ALLOCATED;
339      if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH))
340	restore_frame_index = p->frame_index;
341    }
342```
343
344dispatch\_pending\_node is worth a hard stare due to the several
345second-order optimizations it implements. Almost as an afterthought,
346it calls dispatch_node which actually calls the graph node dispatch
347function.
348
349Process / thread model
350----------------------
351
352vlib provides an ultra-lightweight cooperative multi-tasking thread
353model. The graph node scheduler invokes these processes in much the same
354way as traditional vector-processing run-to-completion graph nodes;
355plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply
356set the vlib\_node\_registration\_t type field to
357vlib\_NODE\_TYPE\_PROCESS. Yes, process is a misnomer. These are
358cooperative multi-tasking threads.
359
360As of this writing, the default stack size is 2<<15 = 32kb.
361Initialize the node registration's process\_log2\_n\_stack\_bytes member
362as needed. The graph node dispatcher makes some effort to detect stack
363overrun, e.g. by mapping a no-access page below each thread stack.
364
365Process node dispatch functions are expected to be "while(1) { }" loops
366which suspend when not otherwise occupied, and which must not run for
367unreasonably long periods of time.
368
369"Unreasonably long" is an application-dependent concept. Over the years,
370we have constructed frame-size sensitive control-plane nodes which will
371use a much higher fraction of the available CPU bandwidth when the frame
372size is low. The classic example: modifying forwarding tables. So long
373as the table-builder leaves the forwarding tables in a valid state, one
374can suspend the table builder to avoid dropping packets as a result of
375control-plane activity.
376
377Process nodes can suspend for fixed amounts of time, or until another
378entity signals an event, or both. See the next section for a description
379of the vlib process event mechanism.
380
381When running in vlib process context, one must pay strict attention to
382loop invariant issues. If one walks a data structure and calls a
383function which may suspend, one had best know by construction that it
384cannot change. Often, it's best to simply make a snapshot copy of a data
385structure, walk the copy at leisure, then free the copy.
386
387Process events
388--------------
389
390The vlib process event mechanism API is extremely lightweight and easy
391to use. Here is a typical example:
392
393```c
394    vlib_main_t *vm = &vlib_global_main;
395    uword event_type, * event_data = 0;
396
397    while (1)
398    {
399       vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */);
400
401       event_type = vlib_process_get_events (vm, &event_data);
402
403       switch (event_type) {
404       case EVENT1:
405           handle_event1s (event_data);
406           break;
407
408       case EVENT2:
409           handle_event2s (event_data);
410           break;
411
412       case ~0: /* 5-second idle/periodic */
413           handle_idle ();
414           break;
415
416       default: /* bug! */
417           ASSERT (0);
418       }
419
420       vec_reset_length(event_data);
421    }
422```
423
424In this example, the VLIB process node waits for an event to occur, or
425for 5 seconds to elapse. The code demuxes on the event type, calling
426the appropriate handler function. Each call to
427vlib\_process\_get\_events returns a vector of per-event-type data
428passed to successive vlib\_process\_signal\_event calls; it is a
429serious error to process only event\_data\[0\].
430
431Resetting the event\_data vector-length to 0 \[instead of calling
432vec\_free\] means that the event scheme doesn't burn cycles continuously
433allocating and freeing the event data vector. This is a common vppinfra
434/ vlib coding pattern, well worth using when appropriate.
435
436Signaling an event is easy, for example:
437
438```c
439    vlib_process_signal_event (vm, process_node_index, EVENT1,
440        (uword)arbitrary_event1_data); /* and so forth */
441```
442
443One can either know the process node index by construction - dig it out
444of the appropriate vlib\_node\_registration\_t - or by finding the
445vlib\_node\_t with vlib\_get\_node\_by\_name(...).
446
447Buffers
448-------
449
450vlib buffering solves the usual set of packet-processing problems,
451albeit at high performance. Key in terms of performance: one ordinarily
452allocates / frees N buffers at a time rather than one at a time. Except
453when operating directly on a specific buffer, one deals with buffers by
454index, not by pointer.
455
456Packet-processing frames are u32\[\] arrays, not
457vlib\_buffer\_t\[\] arrays.
458
459Packets comprise one or more vlib buffers, chained together as required.
460Multiple particle sizes are supported; hardware input nodes simply ask
461for the required size(s). Coalescing support is available. For obvious
462reasons one is discouraged from writing one's own wild and wacky buffer
463chain traversal code.
464
465vlib buffer headers are allocated immediately prior to the buffer data
466area. In typical packet processing this saves a dependent read wait:
467given a buffer's address, one can prefetch the buffer header
468\[metadata\] at the same time as the first cache line of buffer data.
469
470Buffer header metadata (vlib\_buffer\_t) includes the usual rewrite
471expansion space, a current\_data offset, RX and TX interface indices,
472packet trace information, and a opaque areas.
473
474The opaque data is intended to control packet processing in arbitrary
475subgraph-dependent ways. The programmer shoulders responsibility for
476data lifetime analysis, type-checking, etc.
477
478Buffers have reference-counts in support of e.g. multicast replication.
479
480Shared-memory message API
481-------------------------
482
483Local control-plane and application processes interact with the vpp
484dataplane via asynchronous message-passing in shared memory over
485unidirectional queues. The same application APIs are available via
486sockets.
487
488Capturing API traces and replaying them in a simulation environment
489requires a disciplined approach to the problem. This seems like a
490make-work task, but it is not. When something goes wrong in the
491control-plane after 300,000 or 3,000,000 operations, high-speed replay
492of the events leading up to the accident is a huge win.
493
494The shared-memory message API message allocator vl\_api\_msg\_alloc uses
495a particularly cute trick. Since messages are processed in order, we try
496to allocate message buffering from a set of fixed-size, preallocated
497rings. Each ring item has a "busy" bit. Freeing one of the preallocated
498message buffers merely requires the message consumer to clear the busy
499bit. No locking required.
500
501Debug CLI
502---------
503
504Adding debug CLI commands to VLIB applications is very simple.
505
506Here is a complete example:
507
508```c
509    static clib_error_t *
510    show_ip_tuple_match (vlib_main_t * vm,
511                         unformat_input_t * input,
512                         vlib_cli_command_t * cmd)
513    {
514        vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main);
515        return 0;
516    }
517
518    /* *INDENT-OFF* */
519    static VLIB_CLI_COMMAND (show_ip_tuple_command) =
520    {
521        .path = "show ip tuple match",
522        .short_help = "Show ip 5-tuple match-and-broadcast tables",
523        .function = show_ip_tuple_match,
524    };
525    /* *INDENT-ON* */
526```
527
528This example implements the "show ip tuple match" debug cli
529command. In ordinary usage, the vlib cli is available via the "vppctl"
530application, which sends traffic to a named pipe. One can configure
531debug CLI telnet access on a configurable port.
532
533The cli implementation has an output redirection facility which makes it
534simple to deliver cli output via shared-memory API messaging,
535
536Particularly for debug or "show tech support" type commands, it would be
537wasteful to write vlib application code to pack binary data, write more
538code elsewhere to unpack the data and finally print the answer. If a
539certain cli command has the potential to hurt packet processing
540performance by running for too long, do the work incrementally in a
541process node. The client can wait.
542
543Handing off buffers between threads
544-----------------------------------
545
546Vlib includes an easy-to-use mechanism for handing off buffers between
547worker threads. A typical use-case: software ingress flow hashing. At
548a high level, one creates a per-worker-thread queue which sends packets
549to a specific graph node in the indicated worker thread. With the
550queue in hand, enqueue packets to the worker thread of your choice.
551
552### Initialize a handoff queue
553
554Simple enough, call vlib_frame_queue_main_init:
555
556```c
557   main_ptr->frame_queue_index
558       = vlib_frame_queue_main_init (dest_node.index, frame_queue_size);
559```
560
561Frame_queue_size means what it says: the number of frames which may be
562queued. Since frames contain 1...256 packets, frame_queue_size should
563be a reasonably small number (32...64). If the frame queue producer(s)
564are faster than the frame queue consumer(s), congestion will
565occur. Suggest letting the enqueue operator deal with queue
566congestion, as shown in the enqueue example below.
567
568Under the floorboards, vlib_frame_queue_main_init creates an input queue
569for each worker thread.
570
571Please do NOT create frame queues until it's clear that they will be
572used. Although the main dispatch loop is reasonably smart about how
573often it polls the (entire set of) frame queues, polling unused frame
574queues is a waste of clock cycles.
575
576### Hand off packets
577
578The actual handoff mechanics are simple, and integrate nicely with
579a typical graph-node dispatch function:
580
581```c
582    always_inline uword
583    do_handoff_inline (vlib_main_t * vm,
584         	       vlib_node_runtime_t * node, vlib_frame_t * frame,
585    		       int is_ip4, int is_trace)
586    {
587      u32 n_left_from, *from;
588      vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b;
589      u16 thread_indices [VLIB_FRAME_SIZE];
590      u16 nexts[VLIB_FRAME_SIZE], *next;
591      u32 n_enq;
592      htest_main_t *hmp = &htest_main;
593      int i;
594
595      from = vlib_frame_vector_args (frame);
596      n_left_from = frame->n_vectors;
597
598      vlib_get_buffers (vm, from, bufs, n_left_from);
599      next = nexts;
600      b = bufs;
601
602      /*
603       * Typical frame traversal loop, details vary with
604       * use case. Make sure to set thread_indices[i] with
605       * the desired destination thread index. You may
606       * or may not bother to set next[i].
607       */
608
609      for (i = 0; i < frame->n_vectors; i++)
610        {
611          <snip>
612          /* Pick a thread to handle this packet */
613          thread_indices[i] = f (packet_data_or_whatever);
614          <snip>
615
616          b += 1;
617          next += 1;
618          n_left_from -= 1;
619        }
620
621       /* Enqueue buffers to threads */
622       n_enq =
623        vlib_buffer_enqueue_to_thread (vm, hmp->frame_queue_index,
624                                       from, thread_indices, frame->n_vectors,
625                                       1 /* drop on congestion */);
626       /* Typical counters,
627      if (n_enq < frame->n_vectors)
628        vlib_node_increment_counter (vm, node->node_index,
629    				 XXX_ERROR_CONGESTION_DROP,
630    				 frame->n_vectors - n_enq);
631      vlib_node_increment_counter (vm, node->node_index,
632    			         XXX_ERROR_HANDED_OFF, n_enq);
633      return frame->n_vectors;
634}
635```
636
637Notes about calling vlib_buffer_enqueue_to_thread(...):
638
639* If you pass "drop on congestion" non-zero, all packets in the
640inbound frame will be consumed one way or the other. This is the
641recommended setting.
642
643* In the drop-on-congestion case, please don't try to "help" in the
644enqueue node by freeing dropped packets, or by pushing them to
645"error-drop." Either of those actions would be a severe error.
646
647* It's perfectly OK to enqueue packets to the current thread.
648