Embedded Self-diagnostics - Protocol Designs for QoS

6. Protocol Designs for QoS

7.2 Embedded Self-diagnostics

The embedded self-diagnostics aggregate statistics from different protocol layers. As these statistics are typically necessary to the node’s operation, the self-diagnostics incurs only small program and data memory overhead. The diagnostics data is passed to the gateway on application layer using the underlying protocol stack. Thus, the diagnostics does not require changing the communication protocols.

As all of the self-diagnostics information may not be needed at the same time, the diagnostics data is divided into several independent categories. Each category is as-sociated with a certain collection period that determines how often the diagnostics data is sent to the gateway. Only the categories of interest are collected. In addition, different nodes may be instructed to collect a different set of diagnostics. Thus, the overhead, energy-usage, and the impact to other traffic can be minimized with selec-tive diagnostics collection, making the approach feasible for the resource constrained WSNs. In practice, each category is transmitted to the gateway in a separate packet.

The categories and collected statistics are summarized in Table 14.

7.2.1 Node information

Node information includes generic performance statistics and allows detecting per-formance problems that manifest as increased queue usage, node reboots, route chan-ges, or network scans. Thus, the node diagnostics can be always active and used to switch on other, more extensive diagnostics when a symptom for a misbehavior is detected.

In addition to the performance diagnostics, node information includes remaining en-ergy estimation used to determine when to replace the batteries. For practical reasons, the prototype implemented this with battery voltage. While the voltage indicates when the battery is about to deplete, it is inaccurate as an lifetime estimator due to non-linear relation between voltage and remaining energy. Thus, in many cases it would be preferable if a node could estimate its lifetime in percentage value or at real-time value.

7.2.2 Network and node events

Events assign a reason for specific outcomes, e.g. a network scan event occurred be-cause a next hop link was lost. This information is crucial for detecting and analyzing problems that cannot be expressed as simple counters.

7.2. Embedded Self-diagnostics 75

Table 14.Embedded self-diagnostics information grouped by category.

Category Statistic Description

Node Voltage Latest voltage measurement

Information Queue statistics Average and maximum queue usage and delays

Role Cluster head or member node

Boots Boot counter

Network scans Network scan counter

Route changes Cumulative number of route changes Network and Event The descriptor of an occurred event node events Reason A reason for the event

Network Neighbor Neighbor identifier (e.g. unique address) Topology Link quality Link quality indication

Channel Frequency that the neighbor operates on Sleep schedule Duty cycle timing relative to the sender Cluster Channel usage Average and maximum channel usage

traffic RX/TX counters The number of attempted and failed operations

Link Neighbor Neighbor identifier

traffic RX/TX counters The number of attempted and failed operations Activity MCU activity Time spent in active and idle states

Radio states Time spent in RX, TX, idle, and sleep modes

Route Path List of forwarding nodes

Routing Latency End-to-end latency

latency Energy Consumed energy to forward a packet Hop count Number of unique hops to the sink Software Boot reason Last boot reason: assertion, low voltage, ...

errors Call stack List of function addresses

7.2.3 MCU and transceiver activity

Activity diagnostic expresses the fraction of time spent in MCU active (tmcu), radio reception (trx), and radio transmission (ttx) states. It allows detecting unusually high transceiver or controller activity that might indicate other problems. In addition, the activity diagnostic allows estimating the average power consumption of a sensor node when static power consumptions of different operation modes is known. This approach is similar to the [41] but extended here to allow remote diagnostics. The

76 7. Network Diagnostics

average power consumptionPis calculated as

P = tmcu·Pmcu+ (1−t_mcu)·Psleep

+ t_rx·P_rx+ttx·P_tx+ (1−t_rx−t_tx)·P_{o f f}, (13) where Pmcu, Psleep, Prx, Ptx, and Po f f are static power consumptions of MCU ac-tive, MCU sleep, radio reception, radio transmission, and radio off states, respec-tively. These are platform specific constants and can be stored e.g. in the diagnostics database, thus reducing overhead.

7.2.4 Route and routing latency

The route diagnostic describes end-to-end data forwarding. It contains the routing path as a list of node addresses. Each forwarding node updates the list. The in-formation can be used to detect unusually long routes and allow understanding how the routes change over time. Routing latency describes end-to-end latency. Each forwarding nodenupdates routing latencytnas

t_n=t_n−1+ (t₁−t₀) +T_toa, (14) wheretn−1is the latency in a packet that is received from the previous hopn−1,t₁is the forwarding time,t₀is the reception time, andT_toais time-on-air that is estimated from the transceiver data rate and packet length. The latency information is a part of the route diagnostic but could also be piggybacked to data packets for continuous latency monitoring.

7.2.5 Cluster and link traffic

The hop-by-hop traffic is described with attempt and success counters of receptions and transmissions, which allows calculating link reliabilities and estimating the used bandwidth. The cluster traffic diagnostic describes the aggregate traffic flowing in and out from a node. The link traffic diagnostic is more descriptive as it maintains separate counters for each neighbor but has a higher overhead as the number of links is typically higher than the number of clusters in the network. Depending on the required level of detail, either diagnostic may be switched on.

7.2. Embedded Self-diagnostics 77

7.2.6 Network topology

Network topology describes the structure of the network and is essential when trying to determine how problems in one node can affect the rest of the network.

The neighbor information includes link quality (e.g. Received Signal Strength Indica-tor (RSSI)), channel, and sleep schedules. The link quality information approximates the relative distance between nodes, thus allowing more accurate comprehension on the topology. In addition, the information allows detecting when a node has only low quality links, which requires a user interaction to add new nodes in the vicinity to ensure reliable data forwarding.

The sleep schedules relate to the low duty cycle operation. Assuming that each node receives data on their own active period and forwards the data on the active period of a next hop neighbor, duty cycling incurs a significant forwarding delay. Considering these delays, it is possible to calculate the optimal routing delay between a node and a gateway. This allows detecting performance problems, when the optimal delay is compared against the actual diagnosed delay.

7.2.7 Software errors

Due to the resource constraints and tight coupling between software and hardware, embedded WSN devices are typically programmed with C or assembly languages that lack the advanced exception handling and memory overwrite protection of higher level languages. As a result, the embedded programming is error prone and some errors surface only in the actual deployments as the environment or network compo-sition differs from the testing phase.

The proposed software diagnostics indicates the reason and the place in code where the problem occurred. As the information need to be transmitted only when a prob-lem occurs, bandwidth is not typically required and the approach can be used in actual deployments to catch errors not found during testing. Two types of programming er-rors are covered. First, the diagnostics provides information on serious erer-rors that prevent the execution of the embedded software, such as memory corruption, stack overflows, hardware failure, and other unexpected or unhandled events. This is real-ized by placing assertion statements to the code. Second, the self-diagnostics allows detecting infinite code loops with a software watchdog timer. Unlike the typical ap-proach of using hardware watchdog timers to reboot the device, this apap-proach allows identifying the problematic code segment.

78 7. Network Diagnostics

If the boolean condition assigned to a assertion is false or software watchdog timer triggers, the self-diagnostics reboots the device to ensure a clean state as presented in Fig. 31. The diagnostics require a persistent memory, such as EEPROM, to maintain information while the node boots. The persistent memory holds an incremental boot counter, last boot reason, and the call stack of the executed program. This information is transmitted to a gateway after a boot.

In document Designs for the Quality of Service Support in Low-Energy Wireless Sensor Network Protocols (sivua 89-93)