S2S connection availability & state recovery

Server to server connections in xmpp is peculiar when compared to client to server connections.
In the latter case, it is clearly understood that on termination of the connection, the user is considered as logged out - but there is no such behaviour specified for s2s connections.
Actually, the xmpp spec is silent about these connections entirely (other than how to establish a new one and constraints on the stream).

Let us consider a simple example to illustrate what I am trying to get to :

1) ServerA hosting domainA has userA@domainA.
2) ServerB hosting domainB has userB@domainB.
3) Assume they have 'both' subscriptions to each other - so userA & userB can see each others presence.
4) Further, assume that both are online as userA@domainA/resA and userB@domainB/resB.

So after step 4 we have -
a) ServerA has a outbound s2s to ServerB over which it pushed userA's presence (and probed for userB).
b) ServerB has a outbound s2s to ServerA over which it pushed userB's presence (and probed for userA).
It should be noted that for s2s, all outbound stanza's go in its own socket connection - so (a) and (b) are two different socket connection with xml streams in opposite directions.

Now comes the interesting part - what happens when the connections break ? (one or both).
The spec is,silent about this and leaves that as an implementation detail.
What makes it interesting is a follow up of scenario above like this :

5.ServerA & ServerB break connection (assume both for simplicity).
After some time -
6.a) userA logs out.
6.b) ServerA crashes.
6.c) ServerA is temporarily unavailable (network issues, etc).

6.a will result in ServerA reopening connection to ServerB and sending the user status update - the happy path.

The other two - 6.b & 6.c have no easy solution - and actually each server implementation handles it in its own way (and usually not very appropriately if I am not wrong).
So, it brings up to the weird situation where a remote user is shown as online while he might not be ... or is not reachable.
This question is particularly relevant since most xmpp server's have the 'feature' to terminate outbound/inbound connections after some 'time' (usually if connection is idle for a period of time).
Which mean, step (5) is going to happen sooner or later.

Hence the simplistic logic which is used for client sessions - if broken, consider unavailable - can definitetly not be used.
Simplistic workarounds like - when outbound is reestablished : send probe is also extremely expensive. (if 6.b/6.c did not happen - then 6.a will take care of keeping server's in sync !)
Presence is usually the bulk of xmpp traffic - and this problem directly pertains to that.

Hence - when does a server refresh the status of a remote user ? (sending a presence probe)
How often does it send it ?
Is there any better 'solution' to this problem ?

Interesting thoughts with implementation specific logic in place currently - which can potentially be a strain on open federation for admins worried about s2s traffic, load and the like.
Comments:

I frequently run into the "phantom user" problem, since I run a server on my frequently-suspended home laptop behind an ADSL connection. Even just getting people added to my buddy list is a pain...

Posted by Robert Mibus on March 27, 2007 at 01:40 AM IST #

This is an interesting topic. I originally developed the S2S code in our server software (SoapBox) about 3 years ago and have had to explain it to others quite often -- it's really confusing and has a lot of nuances like you mention.


The way we have chosen to handle the "phantom user" problem as Robert puts it, is to be optimistic, but use fatal delivery stanza errors as a sign that an entire domain is no longer available.


In your scenarios 6.b/c it is reasonable to assume that at some point an entity on ServerB is going to try to contact another entity on ServerA (or maybe, as you propose, a "ping" of some sort is sent on an interval). I think it is also reasonable to assume (without a rediculous amount of traffic overhead) that the last presence received from ServerA before it crashed is valid, until such time that ServerB can no longer reach ServerA. When this happens ServerB should grab every entity that had sent presence from ServerA to ServerB and send out unavailable's on behalf of that domain.


When ServerA re-establishes a connection with ServerB (which it most likely will do at some point when it becomes available again if it needs to send out probes), ServerB would probe ServerA for all presences that are prudent (i.e. for all bare jid's that have broadcast presence and have the appropriate presence subscriptions). That's how it could work today.


However, I think in the ideal world there would be a generic "probe all" request that a server can send when ServerB notices ServerA comes back online -- to avoid the double traffic issues that are present today.

Posted by JD Conley on March 27, 2007 at 02:16 AM IST #

@ Robert Mibus
This is exactly the side effect of what is described above.
You end up with users who are seen as online, but are unreachable or go offline as soon as you try to talk to them.


@ JD

Those were interesting thoughts.
The reason why I did not pursue the ping idea too much was basically 'cos :
a) A variation of 6.b (especially due to watchdog's, etc) is that the serverA will go down and come back up before serverB 'detects' it - so a simple (re)connection attempt will succeed, while presence state is out of sync.
b) It indicates that server will need to contact the remote server(s) periodically. In the above case, if support userB@serverB had a 'to' subscription to userA@serverA, then there is no need for serverB to talk to serverA except for the ping.
But hopefully, we can disregard (b) for most practical cases (?) ...

Also, 6.c has interesting sideeffects - but I guess you are right in mentioning that we do need a 'ping all' between servers (the server could choose to respond to it judiciously though).
Maybe we can adapt a scaled down xep 198 for the purpose of approaching this problem - if there is no status change since last update for a particular checkpoint (to include 6.a), return a noop response.
Else trigger a partial/full update (a partial update might be more expensive since servers might have to checkpoint presence updates now).
If a 'ping all' is recieved subsequent to server having set all the remote sessions to unavailable, it could use that as a trigger for sending probes & indicating out of sync state.
In either case, a proprietory solution would be suboptimal - would be great if there was some direction or standard to this problem (and ofcourse, which gets reasonably widely implemented :-) )

Whatever be the solution that comes out for this problem, one thing to be kept in mind is that you can have multiple inbound streams from a domain though you might have only a single outbound to it.
What I am getting at is, if serverA is a server pool with nodes : serverA1, serverA2, serverA3 and serverB is a single node domain.
Then from serverB's point of view, there will be multiple inbound streams from domainA.


Posted by Mridul on March 27, 2007 at 03:14 PM IST #

As XMPP is an open protocol I think 6b is highly likely. People will continue to build new servers, and soon we'll have spammers coming on and DoS attacks to worry about.

Yes, we need something standard in place. Perhaps a "Best practices" XEP.

True, you can definitely have multiple incoming's and you have to keep that in mind. But I think the "un"-availability of a remote domain is only really relevant to each node in the domain (as they may have NIC failures, be located across the globe, etc) -- though that I'll contend is an implementation detail.

Posted by JD Conley on March 27, 2007 at 03:40 PM IST #

I've heard claims that stream compression helps with the problem of presence traffic. And they were not made on April, 1st.

Posted by Philipp Hancke on April 01, 2007 at 06:10 AM IST #

Hi Philipp,
Compression will reduce the actual number of bytes transfered - but the problem we have here is more systemic : of presence traffic being quite high - especially for s2s case where there is a probe and a push.
With the problem mentioned above, this becomes worse since we will need periodic probes over s2s (and handling their responses that is).
We have been considering ways to minimise both aspects - a) the traffic due s2s and b) how to handle unreliable s2s availability.
(b) is made worse by the fact that, we can solve the problem between two deployments of our servers, but customers expect remote users to see their user's presence properly too (not just one way) - which requires that there needs to be a consensus & preferably a common workable approach to solving this problem and hopefully get implemented widely.

Regards,
Mridul

Posted by Mridul on April 01, 2007 at 07:41 PM IST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

mridul

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
Blogroll

No bookmarks in folder