Discussion:
[OSPL-Dev] Losing heartbeats and removing from reliability protocol
Francis, Raymond
2011-07-01 12:54:42 UTC
Permalink
Dear Community,

I was wondering if anyone could explain what causes heartbeats to be lost
from certain nodes, and subsequent removal from the reliable protocol? What
does "removing node from reliable protocol" really mean? From what I see,
it seems to prevent any more data being received by the removed node from
removing node.

We have other applications using the same network, and they all work fine.
I have a suspicion that it could be issues with switcher as dds within a
vlan seems to hold, but crossing to a vlan in another switcher fails. I
have also added the Reconnection tag to the General tag for the networking
service. But it seems that once the reliable protocol fails, and a node is
seemingly lost there is no way of recovering it.

Raymond Francis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://dev.opensplice.org/pipermail/developer/attachments/20110701/4cb6770e/attachment.htm>
Hans van't Hag
2011-07-01 13:22:25 UTC
Permalink
Losing heartbeats and removing from reliability protocol

Hi Ray,



In our native-networking?s reliability-protocol, we expect acknowledges from
participating nodes on receiving packets of reliable information. When a
node ?voluntarily? stops participating in that protocol everything is fine
(as he?ll let the other nodes know), yet if a node become unresponsive (i.e.
does NOT acknowledge its data ?in-time?) its ?non-participating? status will
be determined by means of a heartbeat mechanism.



The configurable topology-discovery mechanism of OpenSplice?s
native-networking protocol allows for timely/swift discovery of remote-node
presence/absense , whereas the configurable ?reactivity-control? mechanisms
(related to retransmission-timing for reach individual priority-band in
OpenSplice) allows for detection of prolongued reactivity-issues (like
low-priority reliable network channels not getting enough
processing-resources to timely acknowledge the data they receive).



So your observed issue could be either a topology/connectivity or
channel/reactivity issue and the solution could be as easy as ?relaxing? the
discovery timing or as complex as having structural switch-issues and/or
reconsidering the configured flow-control for each utilized priority-band to
assure that there?s always enough processing and timely processing-resources
on a node available to handle your reliable priority-bands.



Not sure what you mean with ?issues with switcher? ..



The ?reconnect? option typically applies to recovering from reactivity
issues rather than communication/topology issues so I can?t say based upon
the provide information what the problem and/or the solution is.



We would be happy to provide some consultancy on our specific issues of
course ..



Thanks,

Hans







* *

*Hans van 't Hag*

OpenSplice DDS Product Manager

PrismTech Netherlands

Email: hans.vanthag at prismtech.com

Tel: +31742472572

Fax: +31742472571

Gsm: +31624654078



PrismTech is a global leader in standards-based, performance-critical
middleware. Our products enable our OEM, Systems Integrator, and End User
customers to build and optimize high-performance systems primarily for
Mil/Aero, Communications, Industrial, and Financial Markets.
------------------------------

*From:* developer-bounces at opensplice.org [mailto:
developer-bounces at opensplice.org] *On Behalf Of *Francis, Raymond
*Sent:* Friday, July 01, 2011 2:55 PM
*To:* 'developer at opensplice.org'
*Subject:* [OSPL-Dev] Losing heartbeats and removing from reliability
protocol



Dear Community,

I was wondering if anyone could explain what causes heartbeats to be lost
from certain nodes, and subsequent removal from the reliable protocol? What
does "removing node from reliable protocol" really mean? From what I see,
it seems to prevent any more data being received by the removed node from
removing node.

We have other applications using the same network, and they all work fine.
I have a suspicion that it could be issues with switcher as dds within a
vlan seems to hold, but crossing to a vlan in another switcher fails. I
have also added the Reconnection tag to the General tag for the networking
service. But it seems that once the reliable protocol fails, and a node is
seemingly lost there is no way of recovering it.

Raymond Francis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://dev.opensplice.org/pipermail/developer/attachments/20110701/d83e57c1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 3349 bytes
Desc: not available
URL: <Loading Image...>
Loading...