Sharding & Chunk Distribution
Content to be drafted
This page will cover:
- Erasure coding for availability and reconstruction
- Chunk assignment to validators
- Storage responsibilities per shard
- Data export and import flows
References: Gray Paper sections on erasure coding, data availability, and exports
JAM Erasure Coding and Work Package Bundle Distribution Flow
Overview
JAM (Join-Accumulate Machine) uses a sophisticated erasure coding system based on Reed-Solomon codes in GF(2^16) to ensure data availability and fault tolerance. This document explains the complete flow from work package creation to reconstruction.
Table of Contents
- Core Concepts
- Erasure Coding Fundamentals
- Work Package Structure
- Encoding Process
- Distribution to Validators
- Chunk Collection and Reconstruction
- Mathematical Foundation
- Implementation Details
- Error Handling and Recovery
Core Concepts
Key Parameters
- Total Validators: 1023
- Original Shards: 342
- Recovery Shards: 681 (1023 - 342)
- Erasure Coding Rate: 342:1023 (approximately 1:3 redundancy)
- Basic Segment Size: 684 octets
- Export Segment Size: 4,104 octets
- Galois Field: GF(2^16)
Data Structures
Work Package
@structure
class WorkPackageSpec:
"""
Work package specification structure.
"""
# h
hash: WorkPackageHash
# l, bundle length
length: Uint[32]
# u
erasure_root: ErasureRoot
# e
exports_root: ExportsRoot
# n
exports_count: Uint[16]
Work Item
@structure
class WorkItem:
"""
Work item structure.
"""
# s
service: ServiceId
# c
code_hash: OpaqueHash
# g
refine_gas_limit: Gas
# a
accumulate_gas_limit: Gas
# e
export_count: Uint[16]
# y
payload: Bytes
# i
import_segments: ImportSpecs
# x
extrinsic: ExtrinsicSpecs
Erasure Coding Fundamentals
Reed-Solomon Encoding in GF(2^16)
JAM uses systematic Reed-Solomon erasure coding with the following properties:
- Systematic Nature: Original data is preserved in the first 342 chunks
- Recovery Capability: Any 342 out of 1023 chunks can reconstruct the original data
- Fault Tolerance: Can handle up to 681 validator failures (≈2/3 of validators)
Finite Field Representation
The system operates in GF(2^16) with:
- Irreducible Polynomial: x^16 + x^5 + x^3 + x^2 + 1
- Cantor Basis: Specialized representation for efficient encoding/decoding
- 16-bit Words: Each data element is treated as a 16-bit field element
Work Package Structure
Segments and Manifests
Import Segments
- Reference data exported by previous work packages
- Identified by Merkle root and index
- Size: 4,104 octets each
- Require justification proofs for correctness
Export Segments
- Generated during work package execution
- Deterministic based on Refine logic
- Committed via segments-root in work report
- Stored in D³L (Distributed Decentralized Data Lake)
Extrinsic Data
- External data blobs introduced with work package
- Committed via hash inclusion
- Available to Refine logic as arguments
Data Availability Systems
- Audit DA: Short-term storage for auditable work package bundles
- D³L: Long-term storage for exported segments (minimum 28 days)
Encoding Process
Step 1: Data Preparation
# zero padding if octet pairs are not multiple of bytes_per_chunk
def prepare_data(data: bytes) -> bytes:
length = len(data)
# Pad to multiple of 684 bytes
if length % 684 != 0:
target_size = ((length // 684) + 1) * 684
padding_size = target_size - length
data = data + (b"\x00" * padding_size)
return data
Step 2: Chunking and Octet Pair Formation
def create_octet_pairs(data: bytes, original_shards: int) -> list:
bytes_per_chunk = math.ceil(len(data) / (2 * original_shards))
octet_pairs = []
# Create octet pairs (16-bit words)
for i in range(0, len(data), 2):
octet_pairs.append(data[i:i+2])
# Pad octet pairs if necessary
length = len(octet_pairs)
if length % bytes_per_chunk != 0:
target_size = ((length // bytes_per_chunk) + 1) * bytes_per_chunk
padding_size = target_size - length
octet_pairs.extend([b"00"] * padding_size)
return octet_pairs
Step 3: Unzip Operation
The unzip function distributes data across multiple sequences for parallel processing:
def unzip(data: list, n: int, k: int) -> list:
"""
Divide array into k sequences d_0, d_1, ..., d_k-1
n = original_shards (342)
k = bytes_per_chunk
"""
result = []
for i in range(k):
sequence = []
for j in range(n):
sequence.append(data[(j * k) + i])
result.append(sequence)
return result
Step 4: Reed-Solomon Encoding
Each sequence is encoded independently:
def encode_sequences(sequences: list, recovery_shards: int) -> list:
for sequence in sequences:
# Generate recovery data
recovery = reed_solomon_leopard.encode(sequence, recovery_shards)
# Append recovery data to original
sequence.extend(recovery)
return sequences
Step 5: Transposition and Chunk Formation
def create_chunks(encoded_sequences: list) -> list:
# Transpose matrix: rows become columns
chunks = []
for i in range(len(encoded_sequences[0])): # 1023 iterations
chunk = b""
for j in range(len(encoded_sequences)): # bytes_per_chunk iterations
chunk += encoded_sequences[j][i]
chunks.append(chunk)
return chunks
Distribution to Validators
Availability Specifier Creation
For each work package, an availability specifier is created:
AvailabilitySpecifier {
package_hash: Hash,
erasure_root: Hash, // Merkle root of audit bundle chunks
segment_root: Hash, // Merkle root of exported segments + paged proofs
export_count: u32
}
Chunk Distribution Strategy
-
Audit DA Chunks: Distributed across all 1023 validators
- Contains: work package + extrinsic data + import segments + justifications
- Short-lived: kept until block finality
-
D³L Chunks: Distributed for exported segments
- Contains: exported segments + paged proofs
- Long-lived: minimum 28 days retention
Validator Responsibilities
As Guarantors:
- Collect and validate work packages
- Fetch required import segments from D³L
- Verify segment justifications
- Execute work package logic
- Encode and distribute audit bundle
- Create and distribute export segments
As Data Providers:
- Store assigned chunks reliably
- Respond to chunk requests promptly
- Maintain data for required duration
- Participate in reconstruction when needed
Chunk Collection and Reconstruction
Prerequisites for Reconstruction
To reconstruct data, a node needs:
- Minimum Chunks: 342 out of 1023 chunks
- Chunk Indices: Knowledge of which validator provided each chunk
- Chunk Authenticity: Verification that chunks are uncorrupted
Reconstruction Process
Step 1: Chunk Collection
def collect_chunks(required_indices: set, timeout: int) -> dict:
"""
Collect chunks from validators
Returns: dict mapping validator_index to chunk_data
"""
collected = {}
for validator_idx in range(1023):
if len(collected) >= 342:
break
try:
chunk = request_chunk_from_validator(validator_idx, timeout)
if verify_chunk(chunk, validator_idx):
collected[validator_idx] = chunk
except (TimeoutError, ValidationError):
continue
return collected
Step 2: Chunk Preparation
def prepare_chunks_for_decoding(chunks: dict) -> list:
"""
Prepare collected chunks for Reed-Solomon decoding
"""
prepared_chunks = []
for validator_idx, chunk_data in chunks.items():
# Split chunk into symbols (octet pairs)
symbols = []
for i in range(0, len(chunk_data), 2):
symbols.append((chunk_data[i:i+2], validator_idx))
prepared_chunks.append(symbols)
return prepared_chunks
Step 3: Matrix Transposition
def transpose_chunks(prepared_chunks: list) -> list:
"""
Transpose chunk matrix for parallel Reed-Solomon decoding
"""
if not prepared_chunks:
return []
# Transpose: convert chunks to sequences by position
transposed = []
for pos in range(len(prepared_chunks[0])):
sequence = []
for chunk in prepared_chunks:
sequence.append(chunk[pos])
transposed.append(sequence)
return transposed
Step 4: Reed-Solomon Decoding
def decode_sequences(transposed_sequences: list,
original_shards: int,
recovery_shards: int) -> list:
"""
Decode each transposed sequence independently
"""
for sequence in transposed_sequences:
original_data = {}
recovery_data = {}
# Separate original and recovery symbols
for symbol, validator_idx in sequence:
if validator_idx < original_shards:
original_data[validator_idx] = symbol
else:
recovery_data[validator_idx - original_shards] = symbol
# Perform Reed-Solomon decoding
restored = reed_solomon_leopard.decode(
original_shards,
recovery_shards,
original_data,
recovery_data
)
# Add restored symbols back to sequence
for idx, symbol in restored.items():
sequence.append((symbol, idx))
return transposed_sequences
Step 5: Data Reconstruction
def reconstruct_original_data(decoded_sequences: list,
original_shards: int) -> bytes:
"""
Reconstruct original data from decoded sequences
"""
# Sort sequences by validator index
sorted_sequences = []
for sequence in decoded_sequences:
sorted_seq = sorted(sequence, key=lambda x: x[1])
sorted_sequences.append(sorted_seq)
# Reassemble data
reconstructed_data = b""
for shard_idx in range(original_shards):
for seq_idx in range(len(sorted_sequences)):
symbol, _ = sorted_sequences[seq_idx][shard_idx]
reconstructed_data += symbol
return reconstructed_data
Mathematical Foundation
Galois Field Operations
The encoding operates in GF(2^16) using a Cantor basis for efficient computation:
Field Generator Polynomial
f(x) = x^16 + x^5 + x^3 + x^2 + 1
Cantor Basis Elements
The field uses 16 basis elements (v₀ through v₁₅) that provide computational advantages over the standard polynomial basis.
Element Representation
Each 16-bit word m_i = m_(i,15)...m_(i,0) maps to field element:
m̃_i = Σ(j=0 to 15) m_(i,j) * v_j
Polynomial Interpolation
For encoding 342 message words into 1023 code words:
-
Interpolation: Find polynomial p(y) of degree ≤ 341 such that:
p(0̃) = m̃₀, p(1̃) = m̃₁, ..., p(341̃) = m̃₃₄₁ -
Evaluation: Compute recovery symbols:
r̃₃₄₂ = p(342̃), r̃₃₄₃ = p(343̃), ..., r̃₁₀₂₂ = p(1022̃)
Error Correction Capability
The code can correct up to:
- Erasures: 681 missing chunks (known locations)
- Errors: 340 corrupted chunks (unknown locations)
- Mixed: e errors + 2e erasures ≤ 681
Implementation Details
Configuration Parameters
class ChainConfig:
num_validators = 1023
erasure_coding_original_shards = 342
erasure_coding_recovery_shards = 681
segment_size = 4104
basic_segment_size = 684
max_package_imports = 3072
max_package_exports = 3072
max_package_extrinsics = 128
max_bundle_size = 13_791_360 # ~13.6MB
Performance Considerations
Encoding Complexity
- Time: O(n log n) for n data symbols
- Space: O(n) for intermediate storage
- Parallelism: Independent encoding of symbol sequences
Decoding Complexity
- Time: O(k log k) for k received symbols
- Space: O(k) for reconstruction
- Early Termination: Stop when 342 valid chunks collected
Memory Management
Streaming Processing
def stream_encode(data_stream, chunk_size=1024*1024):
"""
Process large data streams without loading entire dataset
"""
buffer = b""
chunks = []
for data_chunk in data_stream:
buffer += data_chunk
# Process complete segments
while len(buffer) >= 684:
segment = buffer[:684]
buffer = buffer[684:]
encoded_chunks = encode_segment(segment)
chunks.extend(encoded_chunks)
# Handle remaining data
if buffer:
padded_segment = pad_to_684(buffer)
encoded_chunks = encode_segment(padded_segment)
chunks.extend(encoded_chunks)
return chunks
Error Handling and Recovery
Failure Modes
-
Validator Unavailability
- Detection: Timeout on chunk requests
- Recovery: Request from additional validators
- Mitigation: Over-request beyond minimum 342
-
Chunk Corruption
- Detection: Hash verification failure
- Recovery: Request replacement chunk
- Mitigation: Verify chunk integrity before use
-
Network Partitions
- Detection: Systematic request failures
- Recovery: Alternative network routes
- Mitigation: Distributed validator selection
Robust Collection Strategy
def robust_chunk_collection(segment_root, segment_index,
min_chunks=342, safety_margin=50):
"""
Collect chunks with fault tolerance
"""
target_chunks = min_chunks + safety_margin
collected_chunks = {}
failed_validators = set()
validator_list = list(range(1023))
random.shuffle(validator_list)
for validator_idx in validator_list:
if len(collected_chunks) >= target_chunks:
break
if validator_idx in failed_validators:
continue
try:
chunk = request_chunk_with_retry(
validator_idx, segment_root, segment_index
)
if verify_chunk_integrity(chunk, validator_idx):
collected_chunks[validator_idx] = chunk
else:
failed_validators.add(validator_idx)
except Exception as e:
failed_validators.add(validator_idx)
log_failure(validator_idx, e)
if len(collected_chunks) < min_chunks:
raise InsufficientChunksError(
f"Only collected {{len(collected_chunks)}} chunks, "
f"need {{min_chunks}}"
)
return collected_chunks
Data Integrity Verification
def verify_reconstructed_data(reconstructed_data, expected_hash):
"""
Verify integrity of reconstructed data
"""
computed_hash = blake2b(reconstructed_data)
if computed_hash != expected_hash:
raise DataIntegrityError(
f"Hash mismatch: expected {{expected_hash}}, "
f"got {{computed_hash}}"
)
return True
def verify_segment_justification(segment_data, justification_proof,
segment_root, segment_index):
"""
Verify that segment data matches its commitment
"""
# Reconstruct Merkle path
computed_root = compute_merkle_root(
segment_data, justification_proof, segment_index
)
if computed_root != segment_root:
raise JustificationError(
f"Segment justification failed for index {{segment_index}}"
)
return True
Workflow Summary
Complete Flow Diagram
1. Work Package Creation
├── Define work items
├── Specify imports/exports
└── Set gas limits
2. Guarantor Processing
├── Validate authorization
├── Collect import segments from D³L
├── Verify segment justifications
├── Execute work package
└── Generate work report
3. Data Encoding
├── Prepare audit bundle
├── Apply erasure coding (342:1023)
├── Create 1023 chunks
└── Generate availability specifier
4. Distribution Phase
├── Send chunks to validators
├── Store in Audit DA (short-term)
├── Store exports in D³L (long-term)
└── Update availability commitments
5. Auditing Phase
├── Collect 342+ chunks
├── Verify chunk integrity
├── Reconstruct audit bundle
├── Re-execute work package
└── Compare results
6. Recovery Operations
├── Handle validator failures
├── Request replacement chunks
├── Verify reconstructed data
└── Update availability status
Key Success Metrics
- Availability: ≥342 chunks accessible for any work package
- Latency: Chunk collection within epoch boundaries
- Integrity: 100% data correctness after reconstruction
- Resilience: Function with up to 681 validator failures
This comprehensive system ensures that JAM can maintain data availability and correctness even in highly adversarial environments while efficiently utilizing network and storage resources.