feat: Add 8 domain papers and RULEBOOK.md
Domain papers distilled from python-numbers-everyone-should-know: - async-overhead: 1,400x sync vs async overhead - collection-membership: 200x set vs list at 1000 items - json-serialization: 8x orjson vs stdlib - exception-flow: 6.5x exception overhead (try/except free) - string-formatting: f-strings > % > .format() - memory-slots: 69% memory reduction with __slots__ - import-optimization: 100ms+ for heavy packages - database-patterns: 98% commit overhead in SQLite RULEBOOK.md: ~200 token distillation for coding subagents 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
109
papers/collection-membership.md
Normal file
109
papers/collection-membership.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Collection Membership: The 200x Performance Cliff
|
||||
|
||||
**Domain Paper: Python Collection Selection for Membership Testing**
|
||||
**Date:** 2026-01-03
|
||||
**Source:** Python Numbers Everyone Should Know benchmarks
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Membership testing (`x in collection`) is one of the most common operations in Python code. The choice of collection type can result in a **200x performance difference** at just 1,000 items.
|
||||
|
||||
**Key Finding**: At 1,000 items, checking if an item exists in a list takes 3.9 microseconds. The same check in a set takes 19 nanoseconds. That is a 206x difference.
|
||||
|
||||
---
|
||||
|
||||
## The Core Numbers
|
||||
|
||||
### Membership Testing Performance
|
||||
|
||||
| Operation | Time | Throughput |
|
||||
|-----------|------|------------|
|
||||
| `item in set` (existing) | 19.0 ns | 52.7M ops/sec |
|
||||
| `key in dict` (existing) | 20.8 ns | 48.1M ops/sec |
|
||||
| `item in list` (first) | 13.9 ns | 72.0M ops/sec |
|
||||
| `item in list` (middle, 500th) | 1,956 ns | 511k ops/sec |
|
||||
| `item in list` (last, 999th) | 3,852 ns | 260k ops/sec |
|
||||
| `item in list` (missing) | 3,915 ns | 255k ops/sec |
|
||||
|
||||
### The 200x Cliff Explained
|
||||
|
||||
```
|
||||
Set membership (any position): ~19 ns O(1)
|
||||
List membership (worst case): ~3,915 ns O(n)
|
||||
________
|
||||
206x slower
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Crossover Analysis
|
||||
|
||||
**Crossover point**: A list with ~15-20 items will match set performance for a full scan. Below that, lists may actually be faster due to lower overhead.
|
||||
|
||||
### When to Use Each Type
|
||||
|
||||
**Use Set when:**
|
||||
- Collection has more than ~20 items
|
||||
- Checking membership more than once
|
||||
- Order does not matter
|
||||
|
||||
**Use Dict when:**
|
||||
- You need to associate values with keys
|
||||
- Checking membership AND need to retrieve associated data
|
||||
|
||||
**Use List when:**
|
||||
- Collection is very small (< 20 items)
|
||||
- You iterate but rarely check membership
|
||||
- Items might not be hashable
|
||||
|
||||
---
|
||||
|
||||
## Practical Rules for Coding Agents
|
||||
|
||||
### Rule 1: Default to Set for Membership
|
||||
|
||||
```python
|
||||
# PREFER
|
||||
allowed_values = {'a', 'b', 'c'}
|
||||
if value in allowed_values:
|
||||
|
||||
# AVOID
|
||||
allowed_values = ['a', 'b', 'c']
|
||||
if value in allowed_values:
|
||||
```
|
||||
|
||||
### Rule 2: Convert Lists Before Repeated Lookups
|
||||
|
||||
```python
|
||||
def process_items(items: list, valid_ids: list):
|
||||
valid_set = set(valid_ids) # Convert once
|
||||
return [item for item in items if item.id in valid_set]
|
||||
```
|
||||
|
||||
### Rule 3: Prefer `dict.get()` Over Check-then-Access
|
||||
|
||||
```python
|
||||
# AVOID (double lookup)
|
||||
if key in config:
|
||||
value = config[key]
|
||||
|
||||
# PREFER (single lookup)
|
||||
value = config.get(key, default)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Scenario | Best Choice | Why |
|
||||
|----------|-------------|-----|
|
||||
| Membership test on 1000+ items | Set | 200x faster than list |
|
||||
| Key-value lookup | Dict | O(1) access with associated data |
|
||||
| Ordered collection, rare membership | List | Lower memory, maintains order |
|
||||
| Very small collection (< 20 items) | List or Set | Negligible difference |
|
||||
|
||||
---
|
||||
|
||||
*Benchmark source: python-numbers-everyone-should-know*
|
||||
Reference in New Issue
Block a user