Skip to main content

Sharding & Chunk Distribution

Content to be drafted

This page will cover:

  • Erasure coding for availability and reconstruction
  • Chunk assignment to validators
  • Storage responsibilities per shard
  • Data export and import flows

References: Gray Paper sections on erasure coding, data availability, and exports

JAM Erasure Coding and Work Package Bundle Distribution Flow

Overview

JAM (Join-Accumulate Machine) uses a sophisticated erasure coding system based on Reed-Solomon codes in GF(2^16) to ensure data availability and fault tolerance. This document explains the complete flow from work package creation to reconstruction.

Table of Contents

  1. Core Concepts
  2. Erasure Coding Fundamentals
  3. Work Package Structure
  4. Encoding Process
  5. Distribution to Validators
  6. Chunk Collection and Reconstruction
  7. Mathematical Foundation
  8. Implementation Details
  9. Error Handling and Recovery

Core Concepts

Key Parameters

  • Total Validators: 1023
  • Original Shards: 342
  • Recovery Shards: 681 (1023 - 342)
  • Erasure Coding Rate: 342:1023 (approximately 1:3 redundancy)
  • Basic Segment Size: 684 octets
  • Export Segment Size: 4,104 octets
  • Galois Field: GF(2^16)

Data Structures

Work Package

@structure
class WorkPackageSpec:
"""
Work package specification structure.
"""
# h
hash: WorkPackageHash
# l, bundle length
length: Uint[32]
# u
erasure_root: ErasureRoot
# e
exports_root: ExportsRoot
# n
exports_count: Uint[16]

Work Item

@structure
class WorkItem:
"""
Work item structure.
"""
# s
service: ServiceId
# c
code_hash: OpaqueHash
# g
refine_gas_limit: Gas
# a
accumulate_gas_limit: Gas
# e
export_count: Uint[16]
# y
payload: Bytes
# i
import_segments: ImportSpecs
# x
extrinsic: ExtrinsicSpecs

Erasure Coding Fundamentals

Reed-Solomon Encoding in GF(2^16)

JAM uses systematic Reed-Solomon erasure coding with the following properties:

  1. Systematic Nature: Original data is preserved in the first 342 chunks
  2. Recovery Capability: Any 342 out of 1023 chunks can reconstruct the original data
  3. Fault Tolerance: Can handle up to 681 validator failures (≈2/3 of validators)

Finite Field Representation

The system operates in GF(2^16) with:

  • Irreducible Polynomial: x^16 + x^5 + x^3 + x^2 + 1
  • Cantor Basis: Specialized representation for efficient encoding/decoding
  • 16-bit Words: Each data element is treated as a 16-bit field element

Work Package Structure

Segments and Manifests

Import Segments

  • Reference data exported by previous work packages
  • Identified by Merkle root and index
  • Size: 4,104 octets each
  • Require justification proofs for correctness

Export Segments

  • Generated during work package execution
  • Deterministic based on Refine logic
  • Committed via segments-root in work report
  • Stored in D³L (Distributed Decentralized Data Lake)

Extrinsic Data

  • External data blobs introduced with work package
  • Committed via hash inclusion
  • Available to Refine logic as arguments

Data Availability Systems

  1. Audit DA: Short-term storage for auditable work package bundles
  2. D³L: Long-term storage for exported segments (minimum 28 days)

Encoding Process

Step 1: Data Preparation

# zero padding if octet pairs are not multiple of bytes_per_chunk

def prepare_data(data: bytes) -> bytes:
length = len(data)
# Pad to multiple of 684 bytes
if length % 684 != 0:
target_size = ((length // 684) + 1) * 684
padding_size = target_size - length
data = data + (b"\x00" * padding_size)
return data

Step 2: Chunking and Octet Pair Formation

def create_octet_pairs(data: bytes, original_shards: int) -> list:
bytes_per_chunk = math.ceil(len(data) / (2 * original_shards))
octet_pairs = []

# Create octet pairs (16-bit words)
for i in range(0, len(data), 2):
octet_pairs.append(data[i:i+2])

# Pad octet pairs if necessary
length = len(octet_pairs)
if length % bytes_per_chunk != 0:
target_size = ((length // bytes_per_chunk) + 1) * bytes_per_chunk
padding_size = target_size - length
octet_pairs.extend([b"00"] * padding_size)

return octet_pairs

Step 3: Unzip Operation

The unzip function distributes data across multiple sequences for parallel processing:

def unzip(data: list, n: int, k: int) -> list:
"""
Divide array into k sequences d_0, d_1, ..., d_k-1
n = original_shards (342)
k = bytes_per_chunk
"""
result = []
for i in range(k):
sequence = []
for j in range(n):
sequence.append(data[(j * k) + i])
result.append(sequence)
return result

Step 4: Reed-Solomon Encoding

Each sequence is encoded independently:

def encode_sequences(sequences: list, recovery_shards: int) -> list:
for sequence in sequences:
# Generate recovery data
recovery = reed_solomon_leopard.encode(sequence, recovery_shards)
# Append recovery data to original
sequence.extend(recovery)
return sequences

Step 5: Transposition and Chunk Formation

def create_chunks(encoded_sequences: list) -> list:
# Transpose matrix: rows become columns
chunks = []
for i in range(len(encoded_sequences[0])): # 1023 iterations
chunk = b""
for j in range(len(encoded_sequences)): # bytes_per_chunk iterations
chunk += encoded_sequences[j][i]
chunks.append(chunk)
return chunks

Distribution to Validators

Availability Specifier Creation

For each work package, an availability specifier is created:

AvailabilitySpecifier {
package_hash: Hash,
erasure_root: Hash, // Merkle root of audit bundle chunks
segment_root: Hash, // Merkle root of exported segments + paged proofs
export_count: u32
}

Chunk Distribution Strategy

  1. Audit DA Chunks: Distributed across all 1023 validators

    • Contains: work package + extrinsic data + import segments + justifications
    • Short-lived: kept until block finality
  2. D³L Chunks: Distributed for exported segments

    • Contains: exported segments + paged proofs
    • Long-lived: minimum 28 days retention

Validator Responsibilities

As Guarantors:

  1. Collect and validate work packages
  2. Fetch required import segments from D³L
  3. Verify segment justifications
  4. Execute work package logic
  5. Encode and distribute audit bundle
  6. Create and distribute export segments

As Data Providers:

  1. Store assigned chunks reliably
  2. Respond to chunk requests promptly
  3. Maintain data for required duration
  4. Participate in reconstruction when needed

Chunk Collection and Reconstruction

Prerequisites for Reconstruction

To reconstruct data, a node needs:

  • Minimum Chunks: 342 out of 1023 chunks
  • Chunk Indices: Knowledge of which validator provided each chunk
  • Chunk Authenticity: Verification that chunks are uncorrupted

Reconstruction Process

Step 1: Chunk Collection

def collect_chunks(required_indices: set, timeout: int) -> dict:
"""
Collect chunks from validators
Returns: dict mapping validator_index to chunk_data
"""
collected = {}
for validator_idx in range(1023):
if len(collected) >= 342:
break
try:
chunk = request_chunk_from_validator(validator_idx, timeout)
if verify_chunk(chunk, validator_idx):
collected[validator_idx] = chunk
except (TimeoutError, ValidationError):
continue
return collected

Step 2: Chunk Preparation

def prepare_chunks_for_decoding(chunks: dict) -> list:
"""
Prepare collected chunks for Reed-Solomon decoding
"""
prepared_chunks = []
for validator_idx, chunk_data in chunks.items():
# Split chunk into symbols (octet pairs)
symbols = []
for i in range(0, len(chunk_data), 2):
symbols.append((chunk_data[i:i+2], validator_idx))
prepared_chunks.append(symbols)
return prepared_chunks

Step 3: Matrix Transposition

def transpose_chunks(prepared_chunks: list) -> list:
"""
Transpose chunk matrix for parallel Reed-Solomon decoding
"""
if not prepared_chunks:
return []

# Transpose: convert chunks to sequences by position
transposed = []
for pos in range(len(prepared_chunks[0])):
sequence = []
for chunk in prepared_chunks:
sequence.append(chunk[pos])
transposed.append(sequence)
return transposed

Step 4: Reed-Solomon Decoding

def decode_sequences(transposed_sequences: list, 
original_shards: int,
recovery_shards: int) -> list:
"""
Decode each transposed sequence independently
"""
for sequence in transposed_sequences:
original_data = {}
recovery_data = {}

# Separate original and recovery symbols
for symbol, validator_idx in sequence:
if validator_idx < original_shards:
original_data[validator_idx] = symbol
else:
recovery_data[validator_idx - original_shards] = symbol

# Perform Reed-Solomon decoding
restored = reed_solomon_leopard.decode(
original_shards,
recovery_shards,
original_data,
recovery_data
)

# Add restored symbols back to sequence
for idx, symbol in restored.items():
sequence.append((symbol, idx))

return transposed_sequences

Step 5: Data Reconstruction

def reconstruct_original_data(decoded_sequences: list, 
original_shards: int) -> bytes:
"""
Reconstruct original data from decoded sequences
"""
# Sort sequences by validator index
sorted_sequences = []
for sequence in decoded_sequences:
sorted_seq = sorted(sequence, key=lambda x: x[1])
sorted_sequences.append(sorted_seq)

# Reassemble data
reconstructed_data = b""
for shard_idx in range(original_shards):
for seq_idx in range(len(sorted_sequences)):
symbol, _ = sorted_sequences[seq_idx][shard_idx]
reconstructed_data += symbol

return reconstructed_data

Mathematical Foundation

Galois Field Operations

The encoding operates in GF(2^16) using a Cantor basis for efficient computation:

Field Generator Polynomial

f(x) = x^16 + x^5 + x^3 + x^2 + 1

Cantor Basis Elements

The field uses 16 basis elements (v₀ through v₁₅) that provide computational advantages over the standard polynomial basis.

Element Representation

Each 16-bit word m_i = m_(i,15)...m_(i,0) maps to field element:

m̃_i = Σ(j=0 to 15) m_(i,j) * v_j

Polynomial Interpolation

For encoding 342 message words into 1023 code words:

  1. Interpolation: Find polynomial p(y) of degree ≤ 341 such that:

    p(0̃) = m̃₀, p(1̃) = m̃₁, ..., p(341̃) = m̃₃₄₁
  2. Evaluation: Compute recovery symbols:

    r̃₃₄₂ = p(342̃), r̃₃₄₃ = p(343̃), ..., r̃₁₀₂₂ = p(1022̃)

Error Correction Capability

The code can correct up to:

  • Erasures: 681 missing chunks (known locations)
  • Errors: 340 corrupted chunks (unknown locations)
  • Mixed: e errors + 2e erasures ≤ 681

Implementation Details

Configuration Parameters

class ChainConfig:
num_validators = 1023
erasure_coding_original_shards = 342
erasure_coding_recovery_shards = 681
segment_size = 4104
basic_segment_size = 684
max_package_imports = 3072
max_package_exports = 3072
max_package_extrinsics = 128
max_bundle_size = 13_791_360 # ~13.6MB

Performance Considerations

Encoding Complexity

  • Time: O(n log n) for n data symbols
  • Space: O(n) for intermediate storage
  • Parallelism: Independent encoding of symbol sequences

Decoding Complexity

  • Time: O(k log k) for k received symbols
  • Space: O(k) for reconstruction
  • Early Termination: Stop when 342 valid chunks collected

Memory Management

Streaming Processing

def stream_encode(data_stream, chunk_size=1024*1024):
"""
Process large data streams without loading entire dataset
"""
buffer = b""
chunks = []

for data_chunk in data_stream:
buffer += data_chunk

# Process complete segments
while len(buffer) >= 684:
segment = buffer[:684]
buffer = buffer[684:]

encoded_chunks = encode_segment(segment)
chunks.extend(encoded_chunks)

# Handle remaining data
if buffer:
padded_segment = pad_to_684(buffer)
encoded_chunks = encode_segment(padded_segment)
chunks.extend(encoded_chunks)

return chunks

Error Handling and Recovery

Failure Modes

  1. Validator Unavailability

    • Detection: Timeout on chunk requests
    • Recovery: Request from additional validators
    • Mitigation: Over-request beyond minimum 342
  2. Chunk Corruption

    • Detection: Hash verification failure
    • Recovery: Request replacement chunk
    • Mitigation: Verify chunk integrity before use
  3. Network Partitions

    • Detection: Systematic request failures
    • Recovery: Alternative network routes
    • Mitigation: Distributed validator selection

Robust Collection Strategy

def robust_chunk_collection(segment_root, segment_index, 
min_chunks=342, safety_margin=50):
"""
Collect chunks with fault tolerance
"""
target_chunks = min_chunks + safety_margin
collected_chunks = {}
failed_validators = set()

validator_list = list(range(1023))
random.shuffle(validator_list)

for validator_idx in validator_list:
if len(collected_chunks) >= target_chunks:
break

if validator_idx in failed_validators:
continue

try:
chunk = request_chunk_with_retry(
validator_idx, segment_root, segment_index
)

if verify_chunk_integrity(chunk, validator_idx):
collected_chunks[validator_idx] = chunk
else:
failed_validators.add(validator_idx)

except Exception as e:
failed_validators.add(validator_idx)
log_failure(validator_idx, e)

if len(collected_chunks) < min_chunks:
raise InsufficientChunksError(
f"Only collected {{len(collected_chunks)}} chunks, "
f"need {{min_chunks}}"
)

return collected_chunks

Data Integrity Verification

def verify_reconstructed_data(reconstructed_data, expected_hash):
"""
Verify integrity of reconstructed data
"""
computed_hash = blake2b(reconstructed_data)
if computed_hash != expected_hash:
raise DataIntegrityError(
f"Hash mismatch: expected {{expected_hash}}, "
f"got {{computed_hash}}"
)
return True

def verify_segment_justification(segment_data, justification_proof,
segment_root, segment_index):
"""
Verify that segment data matches its commitment
"""
# Reconstruct Merkle path
computed_root = compute_merkle_root(
segment_data, justification_proof, segment_index
)

if computed_root != segment_root:
raise JustificationError(
f"Segment justification failed for index {{segment_index}}"
)

return True

Workflow Summary

Complete Flow Diagram

1. Work Package Creation
├── Define work items
├── Specify imports/exports
└── Set gas limits

2. Guarantor Processing
├── Validate authorization
├── Collect import segments from D³L
├── Verify segment justifications
├── Execute work package
└── Generate work report

3. Data Encoding
├── Prepare audit bundle
├── Apply erasure coding (342:1023)
├── Create 1023 chunks
└── Generate availability specifier

4. Distribution Phase
├── Send chunks to validators
├── Store in Audit DA (short-term)
├── Store exports in D³L (long-term)
└── Update availability commitments

5. Auditing Phase
├── Collect 342+ chunks
├── Verify chunk integrity
├── Reconstruct audit bundle
├── Re-execute work package
└── Compare results

6. Recovery Operations
├── Handle validator failures
├── Request replacement chunks
├── Verify reconstructed data
└── Update availability status

Key Success Metrics

  • Availability: ≥342 chunks accessible for any work package
  • Latency: Chunk collection within epoch boundaries
  • Integrity: 100% data correctness after reconstruction
  • Resilience: Function with up to 681 validator failures

This comprehensive system ensures that JAM can maintain data availability and correctness even in highly adversarial environments while efficiently utilizing network and storage resources.