Spreading rumors!

I’ve been learning Golang in fits and starts, and it felt like it was time to build something tangible to build more Go muscle. Searching for a suitable mini-project I thought it might be a good idea to explain Gossip protocols to myself better by actually writing a mini-distributed system that coordinated information using gossip. Its a technology I find fascinating and was always keen to see how it works in practice.

I’ve seen gossip protocols in the wild and used systems that rely on them (I’ve worked a lot with Cassandra DB, where it’s used as the key to keep nodes coordinates), so this post is my quick tour of what gossip is and a small, working implementation in Go.

Gossip (a.k.a. epidemic) protocols spread information through a cluster the way rumors spread through a crowd: each node periodically talks to a few random peers and exchanges state. There’s no central coordinator; instead, updates “diffuse” until (with high probability) everyone converges. Note that gossip does not really guarantee convergence, but more on that later.

The classic starting point is Epidemic Algorithms for Replicated Database Maintenance (Demers et al., PODC ’87), which introduced the core ideas behind modern gossip protocols.

The key mechanism of a gossip protocol is a simple, iterative process where nodes in a network share information with random peers, much like how rumors spread through a community. This decentralized approach ensures that information eventually propagates to all nodes without needing a central coordinator.

The entire process can be broken down into a few core components.

The Gossip Cycle Link to heading

Each node in the network periodically executes a “gossip round” every few seconds. In each round, a node performs three basic steps:

Select a Peer - The node randomly selects one or more of its peers from a list of known nodes in the network.
Exchange Information - The node initiates a state exchange with the selected peer. They share their information and reconcile any differences.
Update State - Based on the exchange, one or both nodes update their internal state with the newer information.

This cycle repeats continuously, ensuring that updates gradually spread throughout the network in waves.

Looking at it a slightly different way, the process is comprised of into two functions: spreading new information (dissemination) and fixing inconsistencies (anti-entropy).

How Information Spreads: Dissemination Link to heading

Dissemination is the process of spreading new updates through the network as quickly as possible. Think of it as sharing fresh, “hot” gossip. There are a few common patterns for this:

Push - A node actively sends its new information to a random peer. This is fast but can be wasteful if the peer already has the info.
Pull - A node asks a random peer for any updates it might be missing. This is more efficient when updates are infrequent.
Push-Pull - Nodes first exchange a summary of their data, then pull only the specific pieces they’re missing. This is often the most efficient method.

As we will see, our implementation uses a very simple push model. In each “gossip round,” a node sends its entire dataset to a random peer. This is a straightforward way to ensure new information spreads. To resolve conflicts when merging data, the our code uses a “Last-Write-Wins” strategy, where the data with the newest timestamp is kept. More on this too later.

Keeping Everyone in Sync: Anti-Entropy Link to heading

Anti-entropy is a background repair process designed to eliminate any differences between nodes that may have occurred over time (e.g., due to network errors or a node being temporarily offline). Its goal is to reduce disorder and ensure all nodes eventually converge on the same state, a concept known as eventual consistency.

Our implementation, as we will see, combines dissemination and anti-entropy into a single action: periodically pushing the full state. This approach is extremely simple to build and understand. It successfully spreads new information (dissemination) and corrects old inconsistencies (anti-entropy) at the same time. However, in practice it’s highly inefficient. Sending the entire dataset when only a tiny piece of information has changed creates a lot of unnecessary network traffic and doesn’t scale well. In a production system, these two functions are typically handled separately. New updates (deltas) are pushed immediately for rapid dissemination, while a more thorough anti-entropy sync (often using efficient methods like Merkle trees to find differences) runs less frequently in the background to guarantee convergence.

What We’re Building Link to heading

To make these ideas concrete, we’ll model a small multiplayer game world with no central server. Each node keeps player scores and gossips periodic updates to random peers so the scores converge across the cluster - prioritizing high availability over perfect, instantaneous consistency.

We’re creating a distributed game-state service where multiple nodes maintain player scores and synchronize them via gossip. Think “decentralized leaderboard”: each server has its own view, and periodic peer exchanges drive the system toward the same final state. In our version of gossip, we will use timestamps as a way to version updates.

Setting Up the Project Link to heading

Let’s start with our project structure.

mkdir gossiper
cd gossiper
go mod init github.com/yourusername/gossiper

mkdir -p cmd/server
mkdir -p internal/server
mkdir -p internal/transport

The Core Data Model Link to heading

First, let’s define what we’re synchronizing. We need to track player scores with timestamps for conflict resolution. Create internal/server/game_server.go:

package server

import (
    "sync"
    "time"
)

// PlayerState represents a player's score with timestamp for conflict resolution. This is what will get sync across
// nodes
type PlayerState struct {
    Score     int       `json:"score"`
    Timestamp time.Time `json:"timestamp"`
}

// GameServer manages the distributed game state
type GameServer struct {
    nodeID      string
    peers       []string
    playerState map[string]PlayerState
    mu          sync.RWMutex
}

Why timestamps? In distributed systems, we need a way to resolve conflicts when the same player’s score is updated on different nodes simultaneously. We’re using a Last-Write-Wins strategy - the update with the latest timestamp wins. This is simple but effective for gaming scenarios where the most recent action should take precedence.

Building the Game Server Link to heading

Now let’s implement the core server logic. Add the constructor and state management methods:

func NewGameServer(id, addr string, peers []string) *GameServer {
	return &GameServer{
		ID:        id,
		Address:   addr,
		Peers:     peers,
		PlayerMap: make(map[string]PlayerState),
	}
}

func (gs *GameServer) UpdatePlayerScore(playerId string, score int64) {
	gs.mu.Lock()
	defer gs.mu.Unlock()

	gs.PlayerMap[playerId] = PlayerState{
		Score:     score,
		Timestamp: time.Now().UnixNano(),
	}
	log.Printf("updated score for player %s. New score %d", playerId, score)
}

func (gs *GameServer) GetPlayerState() map[string]PlayerState {
	gs.mu.RLock()
	defer gs.mu.RUnlock()

	result := make(map[string]PlayerState, len(gs.PlayerMap))
	for playerId, state := range gs.PlayerMap {
		result[playerId] = state
	}

	return result
}

The key design decision here is using a read-write mutex (sync.RWMutex). This allows multiple goroutines to read the state simultaneously while ensuring exclusive access for writes. The deep copy in GetState gets a copy to return to the caller.

Implementing the Gossip Protocol Link to heading

Here’s where the magic happens. Let’s add the gossip mechanism that periodically syncs with peers:

package server

import (
	"bytes"
	"encoding/json"
	"fmt"
	"log"
	"math/rand"
	"net/http"
	"sync"
	"time"
)

func (gs *GameServer) Start() {
	go gs.gossipLoop()
}

func (gs *GameServer) gossipLoop() {
	ticker := time.NewTicker(2 * time.Second)
	defer ticker.Stop()

	for range ticker.C {
		if len(gs.Peers) == 0 {
			continue
		}
		peerAddr := gs.Peers[rand.Intn(len(gs.Peers))]
		gs.gossipWithPeer(peerAddr)
	}
}

func (gs *GameServer) gossipWithPeer(peerAddr string) {
	gs.mu.RLock()
	payload, err := json.Marshal(gs.PlayerMap)
	gs.mu.RUnlock()

	if err != nil {
		log.Println("failed to gossip with peer:", err)
	}

	url := fmt.Sprintf("http://%s/gossip", peerAddr)
	resp, err := http.Post(url, "application/json", bytes.NewBuffer(payload))
	if err != nil {
		// If we add logs here, we'll spam the logs a lot in case a peer is down
		// Todo: We'll add failure metrics here later
		return
	}

	defer resp.Body.Close()
}

Why random peer selection? This ensures that information spreads evenly through the network over time. If we always picked peers in order, we might create hotspots or leave some nodes isolated. The 2-second interval is a balance between quick propagation and avoiding network congestion.

This is a good point to actually visualize how information gets spread across a network of gossiping nodes. The default in the animation sets up a 7 node cluster, with each node randomly gossiping with one other peer each iteration (every 2 seconds), giving us a fan-out of 1.

The fan-out value of 1 is what we have implemented in our Go code. In a practical implementation you might need a higher fan-out. You can try varying the fan-out in the visualization, step through the progress, and see what happens. You will notice that as the fan-out increases, the number of time-periods it takes for the system to converge (all nodes go green), reduces dramatically.

We could increase the fan-out in a gossip system, and reduce the convergance time, but that does come with a cost - it utilizes a lot more bandwidth, and a lot of the updates would actually be wasted because a node would have already gotten the updates from some other peer before.

Note The animation does not stop moving after convergence (all nodes go green). The nodes are always sharing state, but the system have no updates to apply locally

Merging States with Conflict Resolution Link to heading

Now let’s implement the crucial merge logic that handles incoming state from peers:

// MergeState merges incoming state with local state using Last-Write-Wins
func (gs *GameServer) MergeState(incomingState map[string]PlayerState) {
    gs.mu.Lock()
    defer gs.mu.Unlock()

    for playerID, incomingPlayer := range incomingState {
        localPlayer, exists := gs.playerState[playerID]

        // If we don't have this player, or incoming is newer, update
        if !exists || incomingPlayer.Timestamp.After(localPlayer.Timestamp) {
            gs.playerState[playerID] = incomingPlayer
            log.Printf("Updated player %s score to %d (timestamp: %v)",
                playerID, incomingPlayer.Score, incomingPlayer.Timestamp)
        }
    }
}

This is where our Last-Write-Wins strategy comes into play. For each player in the incoming state, we check if our timestamp is older. If it is, we accept the incoming value. This simple rule ensures that all nodes eventually converge to the same state.

Adding the HTTP Transport Layer Link to heading

Let’s create the HTTP endpoints for both internal gossip and external API calls. Create internal/transport/http.go:

package transport

import (
	"encoding/json"
	"fmt"
	"net/http"

	"gmathur.dev/gossiper/internal/server"
)

// Server dependencies for use by the HTTP server
type Server struct {
	gs *server.GameServer
}

func NewServer(gs *server.GameServer, _ any) *Server {
	return &Server{gs: gs}
}

func (s *Server) RegisterHandlers() {
	// API handlers
	http.HandleFunc("/gossip", s.HandleGossip)
	http.HandleFunc("/update", s.HandleUpdate)
	http.HandleFunc("/state", s.HandleGetState)
}

Let’s implement each handler:

ifunc (s *Server) HandleGossip(w http.ResponseWriter, r *http.Request) {
	var incomingMap map[string]server.PlayerState
	err := json.NewDecoder(r.Body).Decode(&incomingMap)
	if err != nil {
		http.Error(w, "invalid request body", http.StatusBadRequest)
		return
	}

	// Merge incoming state with local state
	s.gs.MergeState(incomingMap)

	w.WriteHeader(http.StatusOK)
}

func (s *Server) HandleUpdate(w http.ResponseWriter, r *http.Request) {
	playerId := r.URL.Query().Get("playerId")
	scoreStr := r.URL.Query().Get("score")
	if playerId == "" || scoreStr == "" {
		http.Error(w, "missing playerId or score", http.StatusBadRequest)
		return
	}

	var score int
	if _, err := fmt.Sscanf(scoreStr, "%d", &score); err != nil {
		http.Error(w, "invalid playerId or score", http.StatusBadRequest)
		return
	}

	// Update player score
	s.gs.UpdatePlayerScore(playerId, int64(score))

	w.WriteHeader(http.StatusOK)
}

func (s *Server) HandleGetState(w http.ResponseWriter, r *http.Request) {
	state := s.gs.GetPlayerState()
	w.Header().Set("Content-Type", "application/json")
	w.Header().Set("Access-Control-Allow-Origin", "*")

	if err := json.NewEncoder(w).Encode(state); err != nil {
		http.Error(w, "failed to encode state", http.StatusInternalServerError)
		return
	}
}

Why separate internal and public endpoints? The /gossip endpoint is for node-to-node communication and accepts full state transfers. The public /update and /state endpoints are for game clients. This separation allows us to add authentication or rate limiting to public endpoints without affecting internal communication.

Wiring It All Together Link to heading

Finally, let’s create the main entry point. Create cmd/server/main.go:

package main

import (
	"flag"
	"log"
	"net/http"
	"strings"

	"gmathur.dev/gossiper/internal/server"
	"gmathur.dev/gossiper/internal/transport"
)

func main() {
	id := flag.String("id", "node1", "Node ID")
	httpAddr := flag.String("addr", "localhost:8081", "HTTP address")
	peersStr := flag.String("peers", "", "Comma-separated list of peer addresses")
	flag.Parse()

	peers := []string{}
	if *peersStr != "" {
		peers = strings.Split(*peersStr, ",")
	}

	// 1. Create the core node
	gs := server.NewGameServer(*id, *httpAddr, peers)

	// 2. Create the HTTP server and register handlers
	t := transport.NewServer(gs, nil)
	t.RegisterHandlers()

	// 3. Start the node's background processes
	gs.Start()

	// 4. Start the HTTP server
	log.Printf("[%s] HTTP server listening on %s", *id, *httpAddr)
	log.Fatal(http.ListenAndServe(*httpAddr, nil))
}

Building and Running Link to heading

Now let’s see our gossip protocol in action! First, build the project:

go build -o gossiper cmd/server/main.go

Start a three-node cluster in different terminals:

# Terminal 1: Node A
./gossiper -node-id=node-a -http=localhost:8080 \
  -peers="localhost:8081,localhost:8082"

# Terminal 2: Node B
./gossiper -node-id=node-b -http=localhost:8081 \
  -peers="localhost:8080,localhost:8082"

# Terminal 3: Node C
./gossiper -node-id=node-c -http=localhost:8082 \
  -peers="localhost:8080,localhost:8081"

Testing the Gossip Protocol Link to heading

Let’s update a player’s score on one node and watch it propagate:

# Update Alice's score on node A
curl -X POST "http://localhost:8080/update?playerId=alice&score=100"

# Wait a few seconds for gossip to spread
sleep 3

# Check the state on all nodes
echo "Node A state:"
curl -s http://localhost:8080/state | jq

echo "Node B state:"
curl -s http://localhost:8081/state | jq

echo "Node C state:"
curl -s http://localhost:8082/state | jq

You should see Alice’s score on all three nodes! Try updating the same player on different nodes simultaneously:

# Simultaneous updates to test conflict resolution
curl -X POST "http://localhost:8080/update?playerId=alice&score=100" &

curl -X POST "http://localhost:8080/update?playerId=alice&score=100" &

wait
sleep 3

# Check which update won (should be the one with later timestamp)
curl -s http://localhost:8080/state | jq '.alice'

Understanding the Design Tradeoffs Link to heading

Pull vs Push Gossip: We implemented a push model where nodes actively send their state to peers. A pull model (where nodes request state from peers) would reduce unnecessary transfers but add request-response complexity.
Full State Transfer: We send the entire state in each gossip round. This is simple but inefficient for large datasets. A production system might use incremental updates or Merkle trees to identify differences.
Random Peer Selection: This provides good coverage but isn’t optimal. Smart peer selection based on network topology or recent communication patterns could improve efficiency.
Last-Write-Wins: Simple but can lose updates if clocks aren’t synchronized. Vector clocks or CRDTs would provide better conflict resolution but add complexity.

Performance Characteristics Link to heading

Let’s analyze how information spreads. With our settings:

Gossip interval: 2 seconds
Random peer selection
Full mesh topology (every node knows every other node)

Expected propagation time for N nodes: O(log N) gossip rounds. For 3 nodes, most updates converge within 2-4 seconds. For 10 nodes, expect 6-8 seconds.

Why gossip might not converge Link to heading

We had noted earlier that gossip is probabilistic. Here are some reasons why gossip might not converge that are good to know:

Permanent partitions/isolated components - If the graph of nodes isn’t eventually connected (e.g., bad seed lists, firewalls, or a split cluster), updates can’t reach everyone.
Too little fan-out or too infrequent rounds - Low infection rate + churn/loss can stall spread.
Message loss without repair - Dropped updates may never be retried or reconciled.
Conflicting concurrent writes - With last-write-wins (LWW), different replicas can prefer different versions until tie-broken.
Clock/timestamp skew - If versioning relies on timestamps, skew can cause “older” values to overwrite newer ones.

Its interesting to examine how a system like Cassandra solves using tunable consistency levels etc., but that’s something we can examine in detail later.

What’s Next? Link to heading

Now that we have a working gossip-based game server, here are some enhancements to try:

Add Player Sessions: Track active players and remove inactive ones after a timeout.
Implement Delta Gossip: Instead of full state, send only recent changes to reduce bandwidth.
Add Failure Detection: Use heartbeats to detect failed nodes and remove them from peer lists.
Seed Selection: Start with a few seed nodes, and on successful dialing of the seeds, get the seed nodes’ full membership. Merge that view into the new node’s peer set. In the example above, the seed set includes the full toplogy and each node maintains a full connection to all other peers. That would turn out to be wasteful in a large cluster.
Persistent Storage: Add a database layer so state survives restarts.
Metrics and Monitoring: Track gossip latency, message rates, and convergence time.
Security: Add node authentication and encrypt gossip messages.
Dynamic Membership: Allow nodes to join and leave the cluster dynamically.

The beauty of this approach is its simplicity and resilience. There’s no single point of failure, no complex leader election, and the system naturally handles network partitions. Perfect for distribute systems like game servers where eventual consistency is acceptable and you need high availability.

Try killing a node and restarting it - the system continues working. Add network delays to simulate real-world conditions. Scale to 10 nodes and watch how gossip patterns emerge. That’s the elegance of gossip protocols - simple rules creating robust distributed behavior!