l3_forward_power_man.rst revision 97f17497
1..  BSD LICENSE
2    Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
3    All rights reserved.
4
5    Redistribution and use in source and binary forms, with or without
6    modification, are permitted provided that the following conditions
7    are met:
8
9    * Redistributions of source code must retain the above copyright
10    notice, this list of conditions and the following disclaimer.
11    * Redistributions in binary form must reproduce the above copyright
12    notice, this list of conditions and the following disclaimer in
13    the documentation and/or other materials provided with the
14    distribution.
15    * Neither the name of Intel Corporation nor the names of its
16    contributors may be used to endorse or promote products derived
17    from this software without specific prior written permission.
18
19    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
20    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
21    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
22    A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
23    OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
24    SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
25    LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
26    DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
27    THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
28    (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29    OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30
31L3 Forwarding with Power Management Sample Application
32======================================================
33
34Introduction
35------------
36
37The L3 Forwarding with Power Management application is an example of power-aware packet processing using the DPDK.
38The application is based on existing L3 Forwarding sample application,
39with the power management algorithms to control the P-states and
40C-states of the Intel processor via a power management library.
41
42Overview
43--------
44
45The application demonstrates the use of the Power libraries in the DPDK to implement packet forwarding.
46The initialization and run-time paths are very similar to those of the :doc:`l3_forward`.
47The main difference from the L3 Forwarding sample application is that this application introduces power-aware optimization algorithms
48by leveraging the Power library to control P-state and C-state of processor based on packet load.
49
50The DPDK includes poll-mode drivers to configure Intel NIC devices and their receive (Rx) and transmit (Tx) queues.
51The design principle of this PMD is to access the Rx and Tx descriptors directly without any interrupts to quickly receive,
52process and deliver packets in the user space.
53
54In general, the DPDK executes an endless packet processing loop on dedicated IA cores that include the following steps:
55
56*   Retrieve input packets through the PMD to poll Rx queue
57
58*   Process each received packet or provide received packets to other processing cores through software queues
59
60*   Send pending output packets to Tx queue through the PMD
61
62In this way, the PMD achieves better performance than a traditional interrupt-mode driver,
63at the cost of keeping cores active and running at the highest frequency,
64hence consuming the maximum power all the time.
65However, during the period of processing light network traffic,
66which happens regularly in communication infrastructure systems due to well-known "tidal effect",
67the PMD is still busy waiting for network packets, which wastes a lot of power.
68
69Processor performance states (P-states) are the capability of an Intel processor
70to switch between different supported operating frequencies and voltages.
71If configured correctly, according to system workload, this feature provides power savings.
72CPUFreq is the infrastructure provided by the Linux* kernel to control the processor performance state capability.
73CPUFreq supports a user space governor that enables setting frequency via manipulating the virtual file device from a user space application.
74The Power library in the DPDK provides a set of APIs for manipulating a virtual file device to allow user space application
75to set the CPUFreq governor and set the frequency of specific cores.
76
77This application includes a P-state power management algorithm to generate a frequency hint to be sent to CPUFreq.
78The algorithm uses the number of received and available Rx packets on recent polls to make a heuristic decision to scale frequency up/down.
79Specifically, some thresholds are checked to see whether a specific core running an DPDK polling thread needs to increase frequency
80a step up based on the near to full trend of polled Rx queues.
81Also, it decreases frequency a step if packet processed per loop is far less than the expected threshold
82or the thread's sleeping time exceeds a threshold.
83
84C-States are also known as sleep states.
85They allow software to put an Intel core into a low power idle state from which it is possible to exit via an event, such as an interrupt.
86However, there is a tradeoff between the power consumed in the idle state and the time required to wake up from the idle state (exit latency).
87Therefore, as you go into deeper C-states, the power consumed is lower but the exit latency is increased. Each C-state has a target residency.
88It is essential that when entering into a C-state, the core remains in this C-state for at least as long as the target residency in order
89to fully realize the benefits of entering the C-state.
90CPUIdle is the infrastructure provide by the Linux kernel to control the processor C-state capability.
91Unlike CPUFreq, CPUIdle does not provide a mechanism that allows the application to change C-state.
92It actually has its own heuristic algorithms in kernel space to select target C-state to enter by executing privileged instructions like HLT and MWAIT,
93based on the speculative sleep duration of the core.
94In this application, we introduce a heuristic algorithm that allows packet processing cores to sleep for a short period
95if there is no Rx packet received on recent polls.
96In this way, CPUIdle automatically forces the corresponding cores to enter deeper C-states
97instead of always running to the C0 state waiting for packets.
98
99.. note::
100
101    To fully demonstrate the power saving capability of using C-states,
102    it is recommended to enable deeper C3 and C6 states in the BIOS during system boot up.
103
104Compiling the Application
105-------------------------
106
107To compile the application:
108
109#.  Go to the sample application directory:
110
111    .. code-block:: console
112
113        export RTE_SDK=/path/to/rte_sdk
114        cd ${RTE_SDK}/examples/l3fwd-power
115
116#.  Set the target (a default target is used if not specified). For example:
117
118    .. code-block:: console
119
120        export RTE_TARGET=x86_64-native-linuxapp-gcc
121
122    See the *DPDK Getting Started Guide* for possible RTE_TARGET values.
123
124#.  Build the application:
125
126    .. code-block:: console
127
128        make
129
130Running the Application
131-----------------------
132
133The application has a number of command line options:
134
135.. code-block:: console
136
137    ./build/l3fwd_power [EAL options] -- -p PORTMASK [-P]  --config(port,queue,lcore)[,(port,queue,lcore)] [--enable-jumbo [--max-pkt-len PKTLEN]] [--no-numa]
138
139where,
140
141*   -p PORTMASK: Hexadecimal bitmask of ports to configure
142
143*   -P: Sets all ports to promiscuous mode so that packets are accepted regardless of the packet's Ethernet MAC destination address.
144    Without this option, only packets with the Ethernet MAC destination address set to the Ethernet address of the port are accepted.
145
146*   --config (port,queue,lcore)[,(port,queue,lcore)]: determines which queues from which ports are mapped to which cores.
147
148*   --enable-jumbo: optional, enables jumbo frames
149
150*   --max-pkt-len: optional, maximum packet length in decimal (64-9600)
151
152*   --no-numa: optional, disables numa awareness
153
154See :doc:`l3_forward` for details.
155The L3fwd-power example reuses the L3fwd command line options.
156
157Explanation
158-----------
159
160The following sections provide some explanation of the sample application code.
161As mentioned in the overview section,
162the initialization and run-time paths are identical to those of the L3 forwarding application.
163The following sections describe aspects that are specific to the L3 Forwarding with Power Management sample application.
164
165Power Library Initialization
166~~~~~~~~~~~~~~~~~~~~~~~~~~~~
167
168The Power library is initialized in the main routine.
169It changes the P-state governor to userspace for specific cores that are under control.
170The Timer library is also initialized and several timers are created later on,
171responsible for checking if it needs to scale down frequency at run time by checking CPU utilization statistics.
172
173.. note::
174
175    Only the power management related initialization is shown.
176
177.. code-block:: c
178
179    int main(int argc, char **argv)
180    {
181        struct lcore_conf *qconf;
182        int ret;
183        unsigned nb_ports;
184        uint16_t queueid;
185        unsigned lcore_id;
186        uint64_t hz;
187        uint32_t n_tx_queue, nb_lcores;
188        uint8_t portid, nb_rx_queue, queue, socketid;
189
190        // ...
191
192        /* init RTE timer library to be used to initialize per-core timers */
193
194        rte_timer_subsystem_init();
195
196        // ...
197
198
199        /* per-core initialization */
200
201        for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
202            if (rte_lcore_is_enabled(lcore_id) == 0)
203                continue;
204
205            /* init power management library for a specified core */
206
207            ret = rte_power_init(lcore_id);
208            if (ret)
209                rte_exit(EXIT_FAILURE, "Power management library "
210                    "initialization failed on core%d\n", lcore_id);
211
212            /* init timer structures for each enabled lcore */
213
214            rte_timer_init(&power_timers[lcore_id]);
215
216            hz = rte_get_hpet_hz();
217
218            rte_timer_reset(&power_timers[lcore_id], hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id, power_timer_cb, NULL);
219
220            // ...
221        }
222
223        // ...
224    }
225
226Monitoring Loads of Rx Queues
227~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
228
229In general, the polling nature of the DPDK prevents the OS power management subsystem from knowing
230if the network load is actually heavy or light.
231In this sample, sampling network load work is done by monitoring received and
232available descriptors on NIC Rx queues in recent polls.
233Based on the number of returned and available Rx descriptors,
234this example implements algorithms to generate frequency scaling hints and speculative sleep duration,
235and use them to control P-state and C-state of processors via the power management library.
236Frequency (P-state) control and sleep state (C-state) control work individually for each logical core,
237and the combination of them contributes to a power efficient packet processing solution when serving light network loads.
238
239The rte_eth_rx_burst() function and the newly-added rte_eth_rx_queue_count() function are used in the endless packet processing loop
240to return the number of received and available Rx descriptors.
241And those numbers of specific queue are passed to P-state and C-state heuristic algorithms
242to generate hints based on recent network load trends.
243
244.. note::
245
246    Only power control related code is shown.
247
248.. code-block:: c
249
250    static
251    attribute ((noreturn)) int main_loop( attribute ((unused)) void *dummy)
252    {
253        // ...
254
255        while (1) {
256        // ...
257
258        /**
259         * Read packet from RX queues
260         */
261
262        lcore_scaleup_hint = FREQ_CURRENT;
263        lcore_rx_idle_count = 0;
264
265        for (i = 0; i < qconf->n_rx_queue; ++i)
266        {
267            rx_queue = &(qconf->rx_queue_list[i]);
268            rx_queue->idle_hint = 0;
269            portid = rx_queue->port_id;
270            queueid = rx_queue->queue_id;
271
272            nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst, MAX_PKT_BURST);
273            stats[lcore_id].nb_rx_processed += nb_rx;
274
275            if (unlikely(nb_rx == 0)) {
276                /**
277                 * no packet received from rx queue, try to
278                 * sleep for a while forcing CPU enter deeper
279                 * C states.
280                 */
281
282                rx_queue->zero_rx_packet_count++;
283
284                if (rx_queue->zero_rx_packet_count <= MIN_ZERO_POLL_COUNT)
285                    continue;
286
287                rx_queue->idle_hint = power_idle_heuristic(rx_queue->zero_rx_packet_count);
288                lcore_rx_idle_count++;
289            } else {
290                rx_ring_length = rte_eth_rx_queue_count(portid, queueid);
291
292                rx_queue->zero_rx_packet_count = 0;
293
294                /**
295                 * do not scale up frequency immediately as
296                 * user to kernel space communication is costly
297                 * which might impact packet I/O for received
298                 * packets.
299                 */
300
301                rx_queue->freq_up_hint = power_freq_scaleup_heuristic(lcore_id, rx_ring_length);
302            }
303
304            /* Prefetch and forward packets */
305
306            // ...
307        }
308
309        if (likely(lcore_rx_idle_count != qconf->n_rx_queue)) {
310            for (i = 1, lcore_scaleup_hint = qconf->rx_queue_list[0].freq_up_hint; i < qconf->n_rx_queue; ++i) {
311                x_queue = &(qconf->rx_queue_list[i]);
312
313                if (rx_queue->freq_up_hint > lcore_scaleup_hint)
314
315                    lcore_scaleup_hint = rx_queue->freq_up_hint;
316            }
317
318            if (lcore_scaleup_hint == FREQ_HIGHEST)
319
320                rte_power_freq_max(lcore_id);
321
322            else if (lcore_scaleup_hint == FREQ_HIGHER)
323                rte_power_freq_up(lcore_id);
324            } else {
325                /**
326                 *  All Rx queues empty in recent consecutive polls,
327                 *  sleep in a conservative manner, meaning sleep as
328                 * less as possible.
329                 */
330
331                for (i = 1, lcore_idle_hint = qconf->rx_queue_list[0].idle_hint; i < qconf->n_rx_queue; ++i) {
332                    rx_queue = &(qconf->rx_queue_list[i]);
333                    if (rx_queue->idle_hint < lcore_idle_hint)
334                        lcore_idle_hint = rx_queue->idle_hint;
335                }
336
337                if ( lcore_idle_hint < SLEEP_GEAR1_THRESHOLD)
338                    /**
339                     *   execute "pause" instruction to avoid context
340                     *   switch for short sleep.
341                     */
342                    rte_delay_us(lcore_idle_hint);
343                else
344                    /* long sleep force ruining thread to suspend */
345                    usleep(lcore_idle_hint);
346
347               stats[lcore_id].sleep_time += lcore_idle_hint;
348            }
349        }
350    }
351
352P-State Heuristic Algorithm
353~~~~~~~~~~~~~~~~~~~~~~~~~~~
354
355The power_freq_scaleup_heuristic() function is responsible for generating a frequency hint for the specified logical core
356according to available descriptor number returned from rte_eth_rx_queue_count().
357On every poll for new packets, the length of available descriptor on an Rx queue is evaluated,
358and the algorithm used for frequency hinting is as follows:
359
360*   If the size of available descriptors exceeds 96, the maximum frequency is hinted.
361
362*   If the size of available descriptors exceeds 64, a trend counter is incremented by 100.
363
364*   If the length of the ring exceeds 32, the trend counter is incremented by 1.
365
366*   When the trend counter reached 10000 the frequency hint is changed to the next higher frequency.
367
368.. note::
369
370    The assumption is that the Rx queue size is 128 and the thresholds specified above
371    must be adjusted accordingly based on actual hardware Rx queue size,
372    which are configured via the rte_eth_rx_queue_setup() function.
373
374In general, a thread needs to poll packets from multiple Rx queues.
375Most likely, different queue have different load, so they would return different frequency hints.
376The algorithm evaluates all the hints and then scales up frequency in an aggressive manner
377by scaling up to highest frequency as long as one Rx queue requires.
378In this way, we can minimize any negative performance impact.
379
380On the other hand, frequency scaling down is controlled in the timer callback function.
381Specifically, if the sleep times of a logical core indicate that it is sleeping more than 25% of the sampling period,
382or if the average packet per iteration is less than expectation, the frequency is decreased by one step.
383
384C-State Heuristic Algorithm
385~~~~~~~~~~~~~~~~~~~~~~~~~~~
386
387Whenever recent rte_eth_rx_burst() polls return 5 consecutive zero packets,
388an idle counter begins incrementing for each successive zero poll.
389At the same time, the function power_idle_heuristic() is called to generate speculative sleep duration
390in order to force logical to enter deeper sleeping C-state.
391There is no way to control C- state directly, and the CPUIdle subsystem in OS is intelligent enough
392to select C-state to enter based on actual sleep period time of giving logical core.
393The algorithm has the following sleeping behavior depending on the idle counter:
394
395*   If idle count less than 100, the counter value is used as a microsecond sleep value through rte_delay_us()
396    which execute pause instructions to avoid costly context switch but saving power at the same time.
397
398*   If idle count is between 100 and 999, a fixed sleep interval of 100 μs is used.
399    A 100 μs sleep interval allows the core to enter the C1 state while keeping a fast response time in case new traffic arrives.
400
401*   If idle count is greater than 1000, a fixed sleep value of 1 ms is used until the next timer expiration is used.
402    This allows the core to enter the C3/C6 states.
403
404.. note::
405
406    The thresholds specified above need to be adjusted for different Intel processors and traffic profiles.
407
408If a thread polls multiple Rx queues and different queue returns different sleep duration values,
409the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
410in order to avoid a potential performance impact.
411