Reliability Enhancement of Redundancy Management in AFDX Networks
Meng Li, Guchuan Zhu, Yvon Savaria, Michaël Lauer

To cite this version:
Meng Li, Guchuan Zhu, Yvon Savaria, Michaël Lauer. Reliability Enhancement of Redundancy Management in AFDX Networks. IEEE Transactions on Industrial Informatics, Institute of Electrical and Electronics Engineers, 2017, 13 (5), pp.2118-2129. <10.1109/TII.2017.2732345>. <hal-01585141>

HAL Id: hal-01585141
https://hal.laas.fr/hal-01585141
Submitted on 11 Sep 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Reliability Enhancement of Redundancy Management in AFDX Networks

Meng Li, Guchuan Zhu, Senior Member, IEEE, Yvon Savaria, Fellow, IEEE, and Michaël Lauer

Abstract—AFDX is a safety critical network in which a redundancy management mechanism is employed to enhance the reliability of the network. However, as stated in the ARINC664-P7 standard, there still exists a potential problem, which may fail redundant transmissions due to sequence inversion in the redundant channels. In this paper, we explore this phenomenon and provide its mathematical analysis. It is revealed that the variable jitter and the transmission latency difference between two successive frames are the two main sources of sequence inversion. Thus, two methods are proposed and investigated to mitigate the effects of jitter pessimism, which can eliminate the potential risk. A case study is carried out and the obtained results confirm the validity and applicability of the developed approaches.

Index Terms—Reliability Enhancement, AFDX, Virtual Link, Fault Tolerance.

NOMENCLATURE

\( (+) \) \quad \max (+, 0) \)
\( L_{\text{max}} \) \quad The maximum frame length.
\( T_i \) \quad The period of VL\(_i\).
\( \sigma_i \) \quad The maximum frame size plus 20 bytes overhead.
\( \tau_i \) \quad A variable delay of VL\(_i\) and \( 0 \leq \tau_i \leq T_i \).
\( O_i \) \quad Time offset of the first frame of VL\(_i\).
\( u_{T_i, \tau_i, \sigma_i} \) \quad The staircase arrival curve for VL\(_i\).
\( J_{\text{2e}} \) \quad The end-to-end jitter upper bound.
\( N_j^i \) \quad The maximum number of VL\(_j\) within \( T_i \).
\( M_j^i(k) \) \quad Residual bytes in the worst case when \( (k+1)\text{th} \) frame arrives taking VL\(_i\) as the reference.
\( D_{i,j} \) \quad The minimum release time difference between adjacent frames of VL\(_i\) and VL\(_j\), where the frame of VL\(_j\) is ahead.
\( D_{i,j}(q) \) \quad The release time difference between the \( q\text{th} \) frame of VL\(_j\) and the reference frame of VL\(_i\).

I. INTRODUCTION

RELIABILITY is one of the main concerns for safety critical systems (See, e.g., [1]–[4]). A typical example of such systems is avionics communication network for which failures may be catastrophic. Therefore, guaranteeing a reliable communication among avionics systems at every flight phase is critical for aircrafts. To ensure that stringent reliability requirements are met, certain standards, e.g., ARINC 429, have been developed and successfully deployed since the late 1970s [5]. However, as the amount of electronic components in an aircraft continues to increase, legacy avionics communication protocols are at their limit in terms of performance and design complexity. Among the available technologies for handling the new challenges in avionics systems design, we can find an Ethernet-based technology, namely the Avionics Full Duplex Switched Ethernet (AFDX) [6], which features high speed, low cost, high flexibility, and reduced weight because of less wiring.

Built on the basis of Ethernet technology, the AFDX not only offers a high available bandwidth and a high communication speed, but also provides deterministic performance, which is the most prominent challenge to using such a technology in avionics. In AFDX networks, determinism is enforced mainly through the concept of Virtual Link (VL), which defines a logical unidirectional connection and a bounded data transmission bandwidth. Besides, the allocated bandwidth is reserved by VL’s maximum frame size (MFS) and the so-called Bandwidth Allocation Gap (BAG), which defines the minimum time interval between the first bits of two successive frames of a VL at the ingress of the networks. Furthermore, the AFDX network is composed of two independent and redundant networks, which provides the required reliability for ensuring its determinism. Consequently, the unavoidable faults on single paths can be tolerated by the redundancy management mechanism.

Nevertheless, although the AFDX was originally developed for safety critical avionic applications, it has not yet been used in critical systems that require the highest level of reliability, e.g., flight control systems [7]. Much efforts are still required to prove that AFDX networks can achieve the highest reliability requirement of critical functions, i.e., a failure probability of \( 10^{-9} \) per flight hour [8]. Specifically, as pointed out in ARINC664-P7 (see Section 3.2.6 in [9]), the redundant transmission mechanism fails if the following two events occur simultaneously: (1) a frame is lost during transmission on one of the redundant networks; (2) the subsequent frame on the network with frame loss arrives earlier to the destination End Systems (ES) than the copy of the lost frame sent through the other network, which is called a sequence inversion in the redundant channels. Obviously, this is a potential risk that could compromise the network reliability. In real avionics communication networks, frame loss, even if observed with a very small probability, is inevitable. Therefore in order to guarantee the reliability of the redundancy management...
mechanism, one must prevent the second condition from occurring. This is a real challenge to system designers, due to the lack of an analytical framework for this problem.

The motivation of this paper is to provide a mathematical analysis of RM failures in the AFDX protocol. It is revealed that the sequence inversion phenomenon, which can result in the invalidity of the redundancy management mechanism, is due to the variable jitter and the transmission latency difference between two successive frames. To tackle this problem, two methods that can contribute to eliminate the sequence inversion problem are proposed. One of these methods is based on local synchronization (LS) [10], [11] and the other exploits the notion of transmission latency difference minimization (TLDM) proposed in this work. This allows enhancing the reliability of RM. We show that these two approaches help mitigating the delay difference between two redundant networks in the worst case. A case study is carried out and the obtained results confirm the validity and the applicability of the developed approaches.

To the best of our knowledge, this is the first work that presents a formal mathematical analysis on the sequence inversion problem. Specifically, the main contributions of this paper are:

- identifying the sources of sequence inversion and providing a mathematical analysis regarding potential failures in RM;
- introducing two approaches that can eliminate potential failures due to frame sequence inversion of the redundant networks.

The aim of the present work focuses mainly on enhancing the determinism and the reliability of AFDX networks to take a step forward towards the application of this promising technology to highly safety-critical avionics systems, such as flight control systems.

The remaining of the paper is organized as follows. Section II introduces the context of AFDX networks. Section III describes some potential failures in redundant AFDX networks and provides a corresponding mathematical analysis. In Section IV two approaches are developed to enhance the reliability of RM. In Section V, a case study is carried out to validate the developed approaches and to evaluate the obtained performance. Finally, some concluding remarks and directions for future research are provided in Section VI.

II. THE CONTEXT OF AFDX NETWORKS AND RELATED WORK

A. Basis of AFDX Networks

An AFDX network is typically composed of three types of elements: ESs, switches and physical links. Each ES is connected to the switches via redundant physical links, denoted by Network A (-A suffix to switches) and Network B (-B suffix to switches) as shown in Fig. 1. Full duplex physical links are adopted to eliminate transmission collisions, which help to ensure deterministic timing performance. In addition, a star topology is applied in switch connections, which makes the network scalable. Usually, it is supposed that the switches have the capability of handling parallel processing. Hence, there is no interference between the packets forwarded to different outputs.

![Fig. 1: An example of AFDX network architecture.](image)

The determinism of AFDX networks is enforced mainly through the concept of VL. Specifically, in AFDX networks, only one ES can be the source of a VL and the routing of VLs is statically defined off-line. Furthermore, a VL can be composed of up to four Sub-VLs to improve the bandwidth utilization efficiency [12].

![Fig. 2: VL flow regulation.](image)

As shown in Fig. 2, input frames, either periodic or aperiodic, are regulated by the BAG, through which the instantaneous frame rate of a VL is limited. Therefore, the maximum bandwidth allocated to a VL is determined by its MFS and BAG [9]. According to the ARINC664 standard, the MFS should be in the range of 64 to 1518 bytes, including a header of 47 bytes. It also needs to take into account an overhead of 20 bytes (Interframe Gap+Preamble+Start Frame Delimiter) during frame transmission. The BAG should be a power of 2 multiplied by 1 ms within the set \{1, 2, 4, 8, 16, 32, 64, 128\}(ms).

![Fig. 3: The jitter of a VL cannot exceed 500 $\mu$s in a source ES [9].](image)

Scheduling in an ES or a switch is performed on a per VL basis, which may introduce jitters due to the congestion of VLs at the outputs. According to the AFDX standard, the jitter of
a VL cannot exceed 500 $\mu$s in a source ES as shown in Fig. 3. Furthermore, traffic policing is applied in switches to protect the network from babbling-idiot failures [13]. In addition, the technology latency in a switch, which is the time required to process frames that should be less than 100 $\mu$s irrespective of traffic load [9]. Usually, the technology latency is assumed to be upper bounded during analysis. The characteristics of VLs and the traffic shaping and policing mechanisms are essential for guaranteeing that the end-to-end delay of each frame can be upper bounded.

### B. Redundancy Management

As shown in Fig. 1, in an AFDX network, the frames in a VL are transmitted through two redundant and independent paths to achieve a high level of communication reliability. This redundancy mechanism assures a reliable communication against the loss of one complete network (Network A or Network B). For each transmitted frame, a sequence number (SN) is added to enable receivers to reconstruct an ordered stream of frames without duplication. In general, SN ranges from 0 to 255, and it is initially set to 0 and increased by 1 for each consecutive transmission of the same VL. The SN wraps around to 1 following the value of 255. Denoting by $i$ the value of a SN, then the wrap-around operation for SNs can be computed as:

$$i \oplus 1 = (i \mod 255) + 1. \quad (1)$$

Furthermore, two redundant frames with identical SNs must be received in an interval less than a predefined SkewMax. Otherwise, the latter reception is considered as a new frame. Hence, SkewMax is the upper bound of transmission delay difference for the redundant frames with identical SN.

### C. Related Work

For safety critical systems, it is essential to guarantee a reliable real-time communication. Thus, the computation of tight and deterministic delay upper bounds is one of the major issues for both communication network design and network certification [14]. Much effort has been dedicated to estimate the upper bounds for data transmission delays in order to guarantee timing behavior of the network based on formal analysis.

Network calculus is a mathematical tool that has been widely applied to performance analysis of communication networks by considering worst case scenarios. It was first introduced by Cruz based on min-plus algebra [15], [16], and then detailed by Le Boudec and Thiran [17]. A principle of “Pay Bursts Only Once” is proposed in [17] based on the property of the convolution of service curves, which contributes to tightening delay bound estimations. Significant improvements in delay upper bound estimation have been achieved by the introduction of a grouping technique, which leads to an approximate gain of up to 40% for a realistic AFDX configuration by considering the effect of serialization stream [18]. A stochastic extension of network calculus has also attracted much interest, and the application of probabilistic bounds in the analysis of AFDX networks can be found in [19]. In the framework of network calculus, traffic inputs are modeled by arrival curves, among which the most popular one is the fluid model. However, staircase models are more accurate for describing the property of packetization, although such models are known to be complex. In [20], a combination of fluid modelling and staircase modelling is introduced to make a trade-off between tighter bounds and computation complexity. It has been reported in [14] that the staircase model can lead to, on average, a gain of 18% for randomly generated configurations. As network calculus is able to deal with both periodic and aperiodic flows, partial synchronization of the source flows can be taken into account during analysis. In [21] periodic flows with known offsets in source ESs are considered to eliminate some pessimistic scenarios, which lead to a reduction of delay upper bounds. In [22], the event-stream model formulated with a staircase model is applied to obtain upper bounds of traffic. Another solution to determine the delay upper bounds is the trajectory approach, which considers the worst-case scenario experienced by a frame along its path [23]–[29]. This technique has been applied in the analysis of AFDX networks in [24], [27] based on the FIFO policy. The grouping technique is also taken into account to mitigate the pessimism in

![Diagram of Redundancy Management in destination ES](image-url)
the computation of upper bounds with the trajectory approach is characterized. Although it has been reported in [30] that the trajectory approach introduces optimism in some corner cases, the problems have been identified and corrected [31]. However, a formal proof for the fix proposed in [31] is still required. Recently, another approach, namely the forward end-to-end delays analysis (FA), has been proposed to obtain the delay upper bounds [32]. Similar to the trajectory approach, this method focuses on one frame and analyzes iteratively the components (ES and Switches) through which the frame passed. In [33], an improvement of the FA approach has been achieved by considering the serialization of frames sharing a common link, i.e., the grouping technique.

Besides the research on end-to-end delay analysis of AFDX networks, some other work focuses on traffic scheduling and redundancy management. In [34], a traffic phase shifting technique is proposed to improve the bandwidth utilization by assigning offsets to periodic traffic flows. However, the improvement is achieved at the expense of increasing end-to-end delay due to the fact that the frames are buffered at switches to wait for their time slots for transmission. A deduplication-aware Deficit Round Robin scheduling scheme is proposed in [35] to offer flexible scheduling and implement fast deduplication. In [36], mixed-criticality traffic scheduling was investigated to enhance resource efficiency. In [13], the frame management in AFDX networks is analyzed and a modified design with a priority queue is proposed. Although it can offer a better data integrity and a higher QoS, the improvement is achieved at the expense of higher latencies. Moreover, it is not a generic solution and its applicability highly depends on the properties of applications. In [37], three redundancy management algorithms (RMA1, RMA3, and RMA13) are analyzed using model checking. It is reported that RMA1 and RMA3 have difficulty to handle the redundant networks when one network fails. It seems that RMA13 is the best choice considering safety properties. However, compared with the FVW policy, RMA13 only accepts the frame from the same network as the last frame, which degrades the availability provided by the redundancy mechanism and can cause higher frame loss rates.

Based on the above-mentioned works, the QoS of AFDX networks can be improved by using different approaches. However, to the best of our knowledge, there is little work on the rigorous analysis of the sequence inversion problem. The motivation of this paper is then to identify the sources of sequence inversion and to develop solutions for eliminating the potential failures.

III. TRANSMISSION FAILURES IN AFDX NETWORKS

A. Frame Loss Resulting from Sequence Inversion

Although AFDX networks provide a highly reliable communication via redundant networks, the RM may fail in some special cases. Such possible failure cases have been identified in the standard ARINC664-P7 (see, Section 3.2.6 in [9]), which may occur when a frame is lost on the faster network.

For example, let us consider two redundant networks, Network A and Network B, that transmit their frames every BAG interval as shown in Fig. 5. Suppose that one frame on the faster network, e.g., A2 on Network A, is lost during transmission, e.g., due to bit errors corrupting the frame contents. To tolerate such failures, redundancy is employed in AFDX networks to increase the network reliability. However, if frame A3 arrives earlier than the frame B2 as shown in this example, a frame loss failure happens in spite of the redundant transmission. This results from the destination ES applying the FVW policy. Essentially, frame loss in the redundant AFDX network is due to frame sequence inversion of Network A and Network B at the destination ES.

B. Mathematical Analysis of the Frame Sequence Inversion

In this section, we provide a detailed mathematical analysis of the frame sequence inversion phenomenon. The analysis is based on three assumptions: (1) the redundant frames are fed to the 2 redundant networks simultaneously at the source ES; (2) Network A and Network B have identical topology and configurations, which include the same set of VLs; (3) the technological latency in both source ESs and switches is upper bounded. Note that these assumptions are used only for the purpose of simplifying the presentation and the relaxation of these assumptions will not introduce any technical difficulty.

Denote by $D_{\text{worst}}$ the worst-case delay upper bound experienced by the frames with maximum size in a VL. Let the transmission latency be the transmission time over the physical links. Thus $D_{\text{best}}$, the minimum frame delay, can be taken as the sum of technology latencies and transmission latency, which is determined by the routing of the corresponding VL and the minimum frame size. The difference between $D_{\text{worst}}$ and $D_{\text{best}}$ is due to the variance of frame size and the jitter caused by the influence of other VLs that share the output ports in source ES or in switches. Based on the assumptions above, the VLs in both networks have the same parameters with respect to $D_{\text{worst}}$ and $D_{\text{best}}$. For example, in the case shown in Fig. 5, the delay of A3 cannot be smaller than $D_{\text{best}}$ and the delay of B2 cannot exceed $D_{\text{worst}}$. Note that data transmission is considered to be completed when the last bit of the frame is received. Then, the reception is accomplished at $t_{A3}$ for A3 and $t_{B2}$ for B2, respectively. Assume that the
The first part on the left-hand side of this inequality represents the maximum jitter introduced by the VL frame with the maximum size and other VLs during transmission, and the second part denotes the maximum transmission latency difference between two successive frames. Therefore in order to meet the condition (6), it is required to mitigate the pessimism in jitter estimation and to reduce transmission latency difference between two successive frames.

In general, less pessimistic upper bounds can be obtained with more realistic models. Therefore, the staircase model, which is more accurate than the affine model, is employed to achieve better estimations. Relevant research can be found in [14], [20]. The experiment reported in [14] shows that the delay upper bounds can be improved up to 18% on average by using the staircase model instead of the affine arrival curve. However, the condition (6) shows that restricting the jitter upper bound within one BAG still cannot guarantee the elimination of sequence inversion. In the next section, two approaches that can contribute to eliminate the resulting potential failures are proposed.

IV. APPROACHES TO ELIMINATE THE POTENTIAL RISK OF FRAME SEQUENCE INVERSION

As shown in (6), to avoid the SN inversion, the sum of jitter upper bound and transmission latency difference has to be constrained within one BAG. This section addresses the possible solutions for further reducing the jitter and the transmission latency difference. Section IV-A reviews the LS mechanism that considers the release time differences of periodic VLs that can reduce the jitter, and a method to analyse the jitter, whereas Section IV-B presents a method to bound the $D_{TLD}$ term of (5).

A. Local Synchronization

The jitter upper bound estimation is based on the worst case, where it is assumed that frames in all VLs arrive simultaneously. However, this situation will not happen when some applications are executed sequentially on a single processor, which is common in practice. For example, the AFDX ESs are often paired with the ARINC653 operating system (OS). Thus frames of certain VLs are produced with a static and periodic manner on distinct pre-defined time slots. Therefore, LS is a possible solution to mitigate the jitter by exploring the periodic VL characteristics in source ES. Relevant research on LS in ESs can be found in [38] and [21]. It is proposed in [38] to reduce the end-to-end delay by taking into account partition scheduling, which helps to eliminate impossible scenarios (all periodic VLs simultaneously send frames to a scheduler) by introducing a correlation between the release of VLs in each ES. In [21], all VLs are assumed to be periodic and the end-to-end delay upper bounds are improved by taking into account offsets between periodic VLs. In this section, we further develop this idea while leveraging the staircase arrival curve to improve the results based on [38] and [21]. First, we consider a scenario with two VLs, in which the minimum release time difference between adjacent frames is analyzed and a condition to avoid the interference between the two VLs
is deduced. Then the analysis is extended to a general case. Finally, an example is given to illustrate the effect of LS.

A periodic VL, e.g., VL\_i, can be characterized by a triplet \( \{T_i, \sigma_i, O_i\} \), where \( T_i \) is the period of the flow and \( T_i = \text{BAG}_i \), \( \sigma_i \) is equal to the MFS plus 20 bytes overhead during transmission on physical links, and \( O_i \) represents a time offset of the first frame. Then the staircase model is applied in the following analysis. In such a model, the jitter of a frame is caused by the residual bytes left for transmission when the frame arrives at the scheduler.

\[ D_{\text{diff}} = \min \left( \{O_i - O_j\} \mod \min\{T_i, T_j\}, \right. \]
\[ \left. \min\{T_i, T_j\} - \{(O_i - O_j) \mod \min\{T_i, T_j\}\} \right\}. \]

\( (7) \)

A sketch of the proof is provided below. Suppose that \( f_i(n_i) = O_i + n_i T_i \) and \( f_j(n_j) = O_j + n_j T_j \) represent frame starts of VL\_i and VL\_j, respectively. \( n_i \) and \( n_j \) are two nonnegative integers. Assume that \( T_i \geq T_j \). According to the periodicity, we have that \( T_i = k T_j \), where \( k = 2^n \) and \( n \in \{0, \ldots, 7\} \). Then

\[ |f_i(n_i) - f_j(n_j)| = |O_i - O_j + n_i T_i - n_j T_j| = |O_i - O_j + (k n_i - n_j) T_j|. \]

In this case, the release time difference should be smaller than \( T_j \). Then the minimum of \( |f_i(n_i) - f_j(n_j)| \) is either \( |O_i - O_j| \mod T_j \) or \( T_j - |(O_i - O_j) \mod T_j| \). Thus, \( (7) \) holds true.

Let \( \frac{\text{max\{\sigma_i, \sigma_j\}}{C} \leq D_{\text{diff}} \) if the condition is satisfied, the two VLs have no influence on each other, although they share the output port of the same source ES. Once the VLs are delivered from the source ES, they are serialized. If the frame dispatched earlier does not experience any congestion in all switches along its path, it will never interfere with a frame released later. Although jitter may be introduced by a frame dispatched earlier, when congestion happens, it is due to the jitter propagation caused by other VLs. Obviously, LS contributes to reduce the jitter, as the number of interfering VLs to take into account is diminished.

In addition, if the start time order of the two VLs is fixed, e.g., VL\_i always starts ahead of VL\_j, then the requirements can be relaxed. In this scenario, VL\_j has no influence on VL\_i if the condition \( \frac{\sigma_j}{C} \leq (\min\{T_i, T_j\} - (|O_i - O_j| \mod \min\{T_i, T_j\})) \) holds and VL\_i has no influence on VL\_j if the condition \( \frac{\sigma_i}{C} \leq (|O_j - O_i| \mod \min\{T_i, T_j\}) \) can be met. This method can be extended when a set of VLs (>2) is considered and the corresponding analysis is given as follows.

\[ N_j = \left\lceil \frac{T_i - D_{i,j}}{T_j} \right\rceil. \]

\( (8) \)

If \( T_i \leq T_j \), \( N_j = 1 \), and it means that there is at most one frame of VL\_j within \( T_i \), which is true due to the VL regulation as shown in Fig. 2. If \( T_i > T_j \), suppose that \( T_i = m T_j \), where \( m = 2^n \) and \( n \in \{0, \ldots, 7\} \). Then the number of VL\_j within \( T_i \) can be obtained according to the rule of VL regulation.
Thus, there are \( m + 1 \) frames of \( VL_j \) within \( T_i \) when \( D_{i,j} = 0 \) and there exist \( m \) frames of \( VL_j \) within \( T_i \) when \( 0 < D_{i,j} < T_j \). Thus, (8) holds true.

For \( VL_j \in I, j \neq i \), let \( VL_j(q), q = 1, \ldots, N_j \), be the \( q \)th frame before a frame of \( VL_i \) in the worst case in \( T_i \). For each pair \( (VL_i, VL_j(q)), q = 1, \ldots, N_j \), the release time difference is computed individually. Denote by \( D_j = \{D_{i,j}(q), \ldots, D_{i,j}(2), D_{i,j}(1)\} \) a set of release time differences for each \( j \neq i \). Specifically, \( D_{i,j}(1) = 0 \), when \( VL_j \) is an aperiodic \( VL \). Define \( D \) as a sorted vector such that \( D = \{D^{(1)}, \ldots, D^{(N)}\} = \{\sum_{VL_i \in I, j \neq i} D_j\} \) and \( l = \sum_{VL_i \in I, j \neq i} N_j \) is the total number of frames between two consecutive frames of \( VL_j \) in the worst case. Then \( D^{(k)} = D_{i,j}(b), 1 \leq k \leq l \), and in this case, \( \sigma(k) = \sigma_j \), where \( \sigma_j \) is the maximum frame size associated with \( VL_j \).

Let \( M_i^{(l)} \) be the residual bytes left for transmission in the worst case when a frame of \( VL_i \) arrives. In other words, \( M_i^{(l)} \) is the total number of bytes that contributes to the jitter of \( VL_i \) in the worst-case scenario. We then have

\[
M_i^{(l)} = \left( M_i^{(l-1)} + \sigma^{(l)} - D^{(l)} C \right)^{+}, l \geq 1.
\] (9)

The proof of (9) is given as follows. The recursive computation of \( M_i^{(l)} \) starts by setting \( M_i^{(0)} = (\sigma_i - (T_i - D^{(1)})C)^{+} \). Then for each recursive step, \( M_i^{(k)}, 1 \leq k < l \), is computed. \( M_i^{(k)} \) involves three parts: the residual bytes when the previous frame arrives \( M_i^{(k-1)} \), the maximum frame size of the previous frame \( \sigma(k) \), and the data size that can be transmitted in \( D^{(k)} - D^{(k+1)} \) with the rate \( C \). As the residual bytes are nonnegative, it leads to

\[
M_i^{(k)} = \left( M_i^{(k-1)} + \sigma^{(k)} - \left( D^{(k)} - D^{(k+1)} \right) C \right)^{+}, \tag{10}
\]

where \( D^{(k)}, D^{(k+1)} \in D \). Finally, the computation stops when \( k = l \) and we have \( M_i^{(l)} = \left( M_i^{(l-1)} + \sigma^{(l)} - D^{(l)} C \right)^{+} \).

Let \( \alpha_T \) denote the arrival curve of the considered \( VL \) aggregate \( I \). Under the staircase model, the arrival curve of a single \( VL \) can be expressed by the following function:

\[
u_{T,\tau,\sigma}(t) = \left[ \frac{t + \tau}{T} \right] \sigma + \sigma; t, \tau \geq 0; T, \sigma > 0, \tag{11}\]

where \( \sigma \) is the burst transmission of the \( VL \), \( \tau \) is a variable delay, \( \sigma = L_{\text{max}} + 20 \), and \( T = \text{BAG} \). Obviously, both periodic and aperiodic \( VL \)s in source ESSs can be upper bounded by \( \nu_{T,0,\sigma}(t) \) in the worst case, due to the BAG regulation. Suppose that the service rate offered to aggregate \( I \) is \( C \) and \( C \geq \sum_{VL_i \in I} \nu_{T_i}^{\tau_i} \). As discussed in Section III-C, the constant rate service, \( \beta(t) = Ct \), can be applied in source ESSs. Based on the above analysis, \( M_i^{(l)} + \sigma_i \) is the worst-case backlog of aggregate \( I \), when a frame of \( VL_i \) arrives. Obviously, \( M_i^{(l)} + \sigma_i \) is upper bounded by the backlog upper bound of aggregate \( I \).

Based on Theorem 1.4.1 of [17], we have:

\[
M_i^{(l)} + \sigma_i \leq \sup_{t \geq 0} \left\{ \sum_{VL_j \in I} u_{T_j,0,\sigma_j}(t) - \beta(t) \right\} = \sup_{t \geq 0} \left\{ \sum_{VL_j \in I} \left[ \frac{t}{T_j} \right] \sigma_j + \sum_{VL_j \in I} \sigma_j - Ct \right\} \leq \sup_{t \geq 0} \left\{ \sum_{VL_j \in I} \left[ \frac{t}{T_j} \right] \sigma_j + \sum_{VL_j \in I} \sigma_j - Ct \right\} = \sum_{VL_j \in I} \sigma_j.
\]

Indeed, taking the LS into account may lead to an aggregate arrival curve less conservative than the one obtained by a direct summation of the arrival curves of individual flows. Specifically, when a periodic \( VL \) is taken as the benchmark, it is assumed that a frame of \( VL_i \) is the first one arrived in an arbitrary interval \([s, t]\). Then, for any \( VL_j \) in the aggregated flow \( I \), the maximum number of frames arrived in this interval is given by:

\[
N_j = \left[ \frac{t-s-\tau_j}{T_j} \right] + 1,
\]

where \( \tau_j = T_j - D_{i,j} \) when \( i \neq j \), and \( \tau_j = 0 \) when \( i = j \). According to [40], let \( R(t) \) be the aggregated flow, then:

\[
R(t) - R(s) \leq \sum_{VL_j \in I} \left( \left[ \frac{t-s-\tau_j}{T_j} \right] + 1 \right) \sigma_j = \sum_{VL_j \in I} u_{T_j,-\tau_j,\sigma_j}(t-s) := \alpha_T^i(t-s).
\]

Note that since \( \tau_j > 0 \) for any \( j \neq i \), it is clear from (11) that \( u_{T_j,-\tau_j,\sigma_j}(t) \) does not define an arrival curve of \( VL_j \). Furthermore, taking into account the worst-case residual bytes in the transmission of \( VL_i \), the arrival curve for the aggregated flow \( I \), taking \( VL_i \) as the benchmark, can be given by:

\[
\alpha_T^i(t) = \alpha_T^i(t)+M_i^{(l)} = \sum_{VL_j \in I} u_{T_j,-\tau_j,\sigma_j}(t)+M_i^{(l)}, t \geq 0, \tag{12}
\]

where \( \tau_j = T_j - D_{i,j} \) for \( i \neq j \) and \( \tau_j = 0 \) for \( i = j \).

Then \( \alpha_T^i(t) \) can be used for the end-to-end delay analysis of \( VL_i \) combined with the approach presented in [14]. If \( M_i^{(l)} \) can be reduced by applying LS, the introduced jitter is mitigated accordingly. Consequently, the end-to-end delay upper bound can be reduced accordingly.

3) An illustration example: To illustrate the effect of LS, we consider an example of 3 VLs with \( \sigma_i = 1500 \) bytes and a BAG of 1 ms. Their offsets are \( O_1 = 0 \), \( O_2 = 100 \mu s \) and \( O_3 = 200 \mu s \), respectively. Then the set of release time difference is given in Table I.

<table>
<thead>
<tr>
<th>TABLE I: Time Interval between Frames</th>
</tr>
</thead>
<tbody>
<tr>
<td>VL Pairs ((i,j))</td>
</tr>
<tr>
<td>---------------------</td>
</tr>
<tr>
<td>( D_{i,j} (\mu s) )</td>
</tr>
</tbody>
</table>
Based on (8), it can be obtained that \( l = 2 \). By considering LS, the residual bytes when the frame of VL1 under analysis arrives can be computed by:

\[
M_1^{(1)} = \left( M_1^{(0)} + \sigma^{(1)} - (D^{(1)} - D^{(2)})C \right)^+, \\
M_1^{(2)} = \left( M_1^{(1)} + \sigma^{(2)} - D^{(2)}C \right)^+ = 0,
\]

where \( \sigma^{(2)} = \sigma^{(1)} = 1500 \text{ bytes}, \ D^{(1)} = 900 \mu \text{s}, \ D^{(2)} = 800 \mu \text{s}, \ M_1^{(0)} = 250 \text{ bytes}, \) and \( C = 100 \text{ Mbps} \). Similarly, we can get \( M_2^{(2)} = 250 \text{ bytes} \) and \( M_3^{(2)} = 500 \text{ bytes} \) for VL2 and VL3, respectively. In contrast, by applying the conventional approaches without LS, the residual bytes for each VL are \( M_1^{(2)} = M_2^{(2)} = M_3^{(2)} = 3\sigma = 4500 \text{ bytes} \) in the worst case. In this case, the residual bytes are significantly reduced with LS.

It is shown in the above analysis that LS contributes to reduce the residual bytes for a periodic VL. Consequently, the jitter for the periodic VL is mitigated, which contributes to avoiding the incidence of frame sequence inversion. The other periodic VLs also benefit from this approach. In fact, the jitters for the other periodic VLs may be further mitigated by properly allocating the offsets of periodic VLs. Moreover, the aperiodic VLs can also benefit from the LS. Unlike the worst case-based analysis where all the periodic and aperiodic VLs are supposed to arrive simultaneously, with the LS mechanism, the number of interfering VLs or the amount of interfering backlog for aperiodic VLs can be reduced. Compared with the approach in [21], our jitter upper bounds are obtained by analyzing the residual number of bytes with respect to LS, instead of using the safe arrival curve of the aggregated flows. Therefore, tighter upper bounds can be achieved. For example, the end-to-end delay upper bound of \( v1 \) from \( e1 \) to \( e6 \) in the case study presented in [21] can be reduced to 96 \( \mu \text{s} \) from 116 \( \mu \text{s} \) due to the fact that the VLs \( v1 \) and \( v2 \) have no influence on each other according to our model. It is worth noting that LS can also help to eliminate certain impossible scenarios in switches to further improve jitter estimation as presented in [21]. This feature is taken into account in the case study presented in Section V.

**B. Transmission Latency Difference Minimization**

It can be seen from (6) that the transmission latency difference between two continuous frames defined in (5) is another factor that may cause sequence inversion. Thus, we consider a scheme aiming at reducing the second term on the left-hand side of the inequality (6) by TLDM.

In traditional delay analysis, much attention has been paid to the MFS, as the minimum length has no effect on the jitter upper bounds. Normally, the default minimum length predetermined by the specification is assigned to each VL. In fact, this makes the transmission latency difference even larger according to (5). In the worst case, the frames with the MFS and the frames with minimum length are delivered alternately as shown in Fig. 8. In this scenario, half the received frames experience the worst-case transmission latency, which increases the occurrence probability of the sequence inversion phenomenon.

Based on (5), for a predefined VL routing scheme, the transmission latency mitigation can be formulated as an optimization problem aimed at minimizing the maximum size difference between two continuous frames:

\[
\min_i \max (L(i) - L(i+1))^+, \tag{13}
\]

where the wrap-around operation \( i \oplus 1 \) is defined in (1). It can be further simplified as the following problem:

\[
\min_i (L_{\text{max}}(i) - L_{\text{min}}(i+1)). \tag{14}
\]

Obviously, the optimal value of (13) and (14) is zero. It can be achieved when every frame in a VL is set to the identical frame size, \( L_{\text{max}} = L_{\text{min}} \). However, the configuration for each VL in practice cannot be simply assigned in such a way due to diverse requirements and data source types. In this case, the transmission latency difference can be mitigated by properly selecting the value of \( L_{\text{min}} \), and both (13) and (14) are upper bounded by \( L_{\text{max}} - L_{\text{min}} \). Even though the optimum of (13) or (14) is not achieved, the TLDM helps control the transmission latency difference by carefully selecting \( L_{\text{min}} \). Therefore, this approach contributes to satisfy the inequality (6) so that the sequence inversion can be avoided.

To illustrate how TLDM contributes to reduce the transmission latency difference between two consecutive frames, we consider a case in which a VL has a MFS of 600 bytes and a default minimum length of 64 bytes. The VL traverses 2 switches to reach its destination. Then the transmission latency difference can be up to \( \frac{(600-64)\times 8}{128} \times 3 = 128.64 \mu \text{s} \), if \( C = 100 \text{ Mbps} \). When \( L_{\text{min}} = 500 \text{ bytes} \), the upper bound of transmission latency difference can be reduced to 24 \( \mu \text{s} \), less than 20% compared with 128.64 \( \mu \text{s} \). The optimal value of transmission latency difference is zero and it can be achieved with \( L_{\text{min}} = 600 \text{ bytes} \). Since the minimum frame length is not used during the worst case delay analysis, enforcing \( L_{\text{max}} = L_{\text{min}} \) does not change the performance of the network in the worst case. This example confirms that the specification of frame size has an impact on the transmission reliability and should be carefully designed.

Design rules allowing improving transmission reliability can be generally given as follows:

- assign identical or similar frame size for all the frames in a VL;
- if the message is too large and needs to be fragmented, assign an equal size to each fragment;
- if Sub-VL aggregation is performed as in [12] to optimize bandwidth utilization, the pre-processing is required first to adjust Sub-VLs with similar frame size into a group. Then Sub-VL aggregation strategy is applied to each group to avoid large transmission latency differences.
C. Discussion of LS and TLDM

The LS approach aims at mitigating the impossible interference between VLs by considering the synchronization mechanism in source ESs. However, it also adds latencies by introducing offsets for periodic VLs in general. In this case, a trade-off should be made by considering the practical constraints and design preferences. Furthermore, in the analysis of the LS approach, release jitters given by specific applications for periodic VLs should be taken into account, which is one of our directions for future work. In this case, the release time difference, which leads to the worst-case residual bytes, should be applied to handle this issue.

The TLDM approach focuses on the size difference between two continuous frames, which can be reduced by properly assigning the minimum frame size, \( L_{\text{min}} \), of a VL. As this approach does not change the maximum frame size, the worst-case performance will not be affected. However, as the data carried by VLs are generated by different functions, padding data is required when the data size is smaller than \( L_{\text{min}} \), which will introduce an overhead for data transmission.

In conclusion, these two methods can be applied separately or in combination to achieve a better performance. It is worth noting that there is no conflict between the grouping technique and LS/TLDM. Indeed, they can be employed jointly as illustrated in the case study in Section V.

V. CASE STUDY

In this section, the proposed approaches are illustrated by a case study with a network shown in Fig. 9, which is adapted from a benchmark configuration reported in [21], [27], [30], [41]–[43] while including more VLs. The VL parameters are specified referring to the realistic cases in the references that are given in Table II and Table III, in which VL1-8 are periodic VLs and each period \( T \) is equal to its BAG. As the VL parameters highly depend on the application requirements in realistic AFDX networks, they must be specified on a case-by-case basis.

In this case study, we assume that the physical link offers a constant rate \( C = 100 \) Mbps. Suppose that VL1 is the data flow of interest. First, the end-to-end jitter upper bound obtained from the staircase model is 1.248 ms, in which the grouping technique is also taken into account. As the LS is not applied at this step, VL1 is assumed to be influenced by other VLs that share the same output ports either in source ES or in switches. In addition, the minimum frame size of VL1, \( L_{\text{min}} \), is assigned to 64 bytes as default. Since VL1 traverses two switches in its communication path, the transmission latency difference in the worst case can be obtained with \( (L_{\text{max}} - L_{\text{min}}) \times 8/C \times n \), where \( n = 3 \). In this scenario, the maximum transmission latency difference is 128.64 \( \mu \)s, which is more that 10% of its BAG. Considering a transmission latency difference of 128.64 \( \mu \)s, the worst-case delay difference can be up to 1.377 ms, which clearly exceeds its BAG, the safe upper bound. In the following analysis, the approaches presented in Section IV are applied step by step to mitigate the delay differences.

The LS focuses on the periodic VL1-8. According to Table II, VL1 is always ahead of VL2-8, then the temporal interval between each pair is calculated based on (7) and listed in Table IV, in which the required transmission time for VL2-8 is also given. We further compute the residual bytes, which may introduce a jitter into VL1. The calculation is based on (9). In this example, \( M_{\text{vl}}^{(i)} = 0 \), where \( l=7 \). In other words, VL2-8 have sufficient time to be delivered before the arrival of VL1 and hence, they have no impact on VL1 in terms of jitter. The jitter upper bound can be further improved by reducing the number of involved VLs. The obtained result is 0.974 ms, which is less than its BAG. In this case, its burst does not introduce jitter in the worst case when the staircase arrival curve model is employed. Therefore, the end-to-end jitter could be reduced by 0.050 ms, and then the upper bound becomes 0.924 ms.

<table>
<thead>
<tr>
<th>TABLE III: Parameters of aperiodic VLs</th>
</tr>
</thead>
<tbody>
<tr>
<td>VL</td>
</tr>
<tr>
<td>BAG (ms)</td>
</tr>
<tr>
<td>( \sigma ) (byte)</td>
</tr>
<tr>
<td>Number of Hops</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>TABLE IV: Temporal Interval between Frames and the Transmission Time Requirement (in ( \mu )s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>VL Pairs ((i, j))</td>
</tr>
<tr>
<td>( D_{i,j} )</td>
</tr>
<tr>
<td>( \sigma_{ji} / C )</td>
</tr>
</tbody>
</table>

Till now, although a large improvement has been achieved, the requirement cannot be met when considering the fixed transmission latency difference of 128.64 \( \mu \)s in the worst case.
The sum of the jitter and the latency difference is 1.053 ms, which is very close to the safe upper bound.

Thereafter, the TLDM is applied. As illustrated in Section IV-B, the fixed transmission latency difference can be improved by more than 80% if $L_{\text{min}}$ is 500 bytes, and then the transmission delay difference is 0.948 ms < 1 ms. The optimal result for transmission latency difference is zero, when the VL guarantees that all the frames have an identical frame size. With either of the two configurations, it can be verified that the transmission delay difference will not exceed the BAG and the sequence inversion will never happen for VL1.

Finally, the delay differences in the worst case for all other VLS are computed using the staircase model and the approach based on LS and TLDM, respectively. The improvement with TLDM is achieved under the condition that the frame size difference is restricted within 100 bytes. As shown in Fig. 10, there is a potential risk of failures for the redundant transmission of VL1 and VL36-VL40, as the delay differences obtained based on the staircase model are larger than their BAGs (1 ms). When the approaches of LS and TLDM are applied, the delay differences for all the VLS meet the condition (6). Ultimately, the reliability of AFDX networks is enhanced, as the failures due to sequence inversion have been eliminated, according to the analysis presented in Section III-C.

It is worth noting that ultimately, one can assign a VL to each application. Therefore, the constraints on frame size difference can always be satisfied by adding VLS. In essence, this amounts to a trade-off between the reliability and the bandwidth utilization efficiency. Furthermore, the work presented in this paper is based on network calculus and hence, it can be easily extended to the analysis of networks with different topologies, size, and configurations. Furthermore, the LS is applied in a source ES and TLDM considers one VL at a time. Thus, both of them can be applied in scalable networks without any difficulty, while providing a consistent performance.

VI. Conclusion

In this paper, sequence inversion, a potential failure source in the redundant transmission management of AFDX networks is addressed, and a quantitative analysis of this phenomenon is carried out. It has been found that the main reasons for the sequence inversion phenomenon in redundant networks are the jitter and frame size differences. In order to eliminate the resulting potential failures, two approaches are developed. They allow tightening jitter estimation by reducing the number of VLS involved and diminishing transmission latency differences. A case study is carried out to illustrate the proposed approaches. The results confirm that the developed approaches are feasible and effective.

It is worth noting that the focus of the present work was put on enhancing the reliability of AFDX networks. In future work, the degree of automation of the proposed methods will be considered through the development and the implementation of suitable tools. Furthermore, more scheduling policies, e.g., fixed priority scheduling, can be considered to tighten the jitter estimation and then to prevent potential failures.

ACKNOWLEDGMENT

The authors thank the Editor-In-Chief, the Associate Editor, and the anonymous reviewers for their constructive comments, which helped to improve the quality and the presentation of this paper. This work was supported in part by CSC, NSERC-CRIAQ CRD project AVIO402, MITACS Acceleration Quebec program, MDEIE-CRIAQ Quebec-China project, and the industrial partners Thales Canada Inc. and Bombardier Aerospace.

REFERENCES


**Meng Li** received the B.E. and M.S. degrees in electronic engineering from Beijing University of Aeronautics and Astronautics, Beijing, China, in 2004 and 2007, respectively. He received the Ph.D. degree in electrical engineering from Polytechnique Montréal, Montréal, QC, Canada, in 2016. Since June 2016, he has been with the Department of Electrical Engineering, Polytechnique Montréal, Montréal, QC, Canada, as a Postdoctoral Fellow. His research interests include fault tolerance, parallel computing, task scheduling, communication networks, and real-time systems.

**Guchuan Zhu** (M’07–SM’12) received the M.S. degree in electrical engineering from Beijing Institute of Aeronautics and Astronautics, Beijing, China, in 1992; the Ph.D. degree in mathematics and control from École des Mines de Paris, Paris, France, in 1992; and the graduate Diploma degree in computer science from Concordia University, Montréal, QC, Canada, in 1999.

Dr. Zhu joined École Polytechnique de Montréal, Montréal, QC, Canada, in 2004, where he is currently a Professor in the Department of Electrical Engineering. His current research interests include control of distributed parameter systems, nonlinear and robust control, and optimization with their applications to microsystems, aerospace systems, communication networks, and smart grid.
Yvon Savaria (S’77, M’86, SM’97, F’08) received the B.Ing. and M.Sc.A. degrees in electrical engineering from École Polytechnique Montreal in 1980 and 1982 respectively. He also received the Ph.D. in electrical engineering in 1985 from McGill University. Since 1985, he has been with Polytechnique Montreal, where he is currently professor in the department of electrical engineering.

He has carried work in several areas related to microelectronic circuits and microsystems such as testing, verification, validation, clocking methods, defect and fault tolerance, effects of radiation on electronics, high-speed interconnects and circuit design techniques, CAD methods, reconfigurable computing and applications of microelectronics to telecommunications, aerospace, image processing, video processing, radar signal processing, and digital signal processing acceleration. He is currently involved in several projects that relate to aircraft embedded systems, green IT, wireless sensor network, virtual network, computational efficiency and application specific architecture design. He holds 16 patents, has published 124 journal papers and 396 conference papers, and he was the thesis advisor of 155 graduate students who completed their studies.

He was program co-chairman of ASAP2006 and the general co-chair of ASAP2007. He has been working as a consultant or was sponsored for carrying research by Bombardier, CNRC, Design Workshop, DREO, Ericsson, Genesis, Gennum, Huawei, Hyperchip, ISR, LTRIM, Miranda, MiroTech, Nortel, Octasic, PMC-Sierra, Technocap, Thales, Tundra and VXP. He is a member of the Regroupement Stratégique en Microélectronique du Québec (RESMIQ), of the Ordre des Ingénieurs du Québec (OIQ), and was a member of CMC Microsystems board since 1999 and chairman of that board from 2008 to 2010. He was awarded in 2001 a Tier 1 Canada Research Chair (www.chairs.gc.ca) on design and architectures of advanced microelectronic systems that he held until June 2015. He also received in 2006 a Synergy Award of the Natural Sciences and Engineering Research Council of Canada.

Michaël Lauer is an associate professor in real-time systems at University of Toulouse. He is conducting research at LAAS-CNRS in the Dependable Computing and Fault Tolerance group (TSF) since 2013. His areas of interest are real-time and resilient systems, adaptive fault-tolerance, and multi/multicores processors. He received the diploma of Engineering from the Institut National Polytechnique de Toulouse (ENSEEIHT) in Telecommunications and Networks. He received the Leopold Escande Award in 2012 for his PhD work on the verification of real-time requirements in critical embedded systems. He conducted postdoctoral research at Polytechnic School of Montreal on the improvement of the criticality level of AFDX networks and the use of time-triggered architecture in modern avionics. He is contributing to several research projects with industrial partners especially in the avionics and automotive domains.