Skip to content

Defining Tables

Online: https://thekevinscott.github.io/dirsql/guide/tables

Each table in dirsql maps a set of files to rows in an in-memory SQLite table. A table definition has three parts: DDL, a glob pattern, and an extract function.

Table constructor

python
from dirsql import Table

table = Table(
    ddl="CREATE TABLE comments (id TEXT, body TEXT, author TEXT)",
    glob="comments/**/index.jsonl",
    extract=lambda path: [
        {"id": "...", "body": "...", "author": "..."}
    ],
)
rust
use dirsql::{Table, Value};
use std::collections::HashMap;

let table = Table::new(
    "CREATE TABLE comments (id TEXT, body TEXT, author TEXT)",
    "comments/**/index.jsonl",
    |_path| {
        let mut row: HashMap<String, Value> = HashMap::new();
        row.insert("id".into(), Value::Text("...".into()));
        row.insert("body".into(), Value::Text("...".into()));
        row.insert("author".into(), Value::Text("...".into()));
        vec![row]
    },
);
typescript
import type { TableDef } from 'dirsql';

const table: TableDef = {
  ddl: 'CREATE TABLE comments (id TEXT, body TEXT, author TEXT)',
  glob: 'comments/**/index.jsonl',
  extract: (_path) => [
    { id: '...', body: '...', author: '...' },
  ],
};

All three arguments are keyword-only (in Python). In Rust they are positional to Table::new. In TypeScript a table is a plain TableDef object literal — the TS SDK exports the TableDef type (not a class).

ddl

A SQLite CREATE TABLE statement. This defines the schema of the table. dirsql executes this DDL directly against the in-memory database, so any valid SQLite column types and constraints work.

python
# Simple text columns
ddl="CREATE TABLE notes (title TEXT, body TEXT)"

# Typed columns
ddl="CREATE TABLE metrics (name TEXT, value REAL, count INTEGER)"

# With constraints
ddl="CREATE TABLE items (id TEXT PRIMARY KEY, name TEXT NOT NULL)"

The table name is parsed from the DDL. It must be a valid SQLite identifier.

glob

A glob pattern that determines which files feed into this table. Matched relative to the root directory passed to DirSQL.

python
glob="*.json"                  # JSON files in root only
glob="**/*.json"               # JSON files at any depth
glob="comments/**/index.jsonl" # JSONL files in comment subdirectories
glob="data/*.csv"              # CSV files in data/

Glob syntax follows standard Unix globbing rules. ** matches any number of directory levels.

extract

A callable (path: str) -> list[dict] that converts a file into rows.

  • path is the absolute filesystem path of the matched file
  • Return a list of dicts, where each dict maps column names to values
  • Return an empty list to skip a file

dirsql does not read file contents for you. If your extract needs the file body, read it inside the callback using path. Callbacks that derive columns only from the path (or that rely solely on the auto-injected filesystem-fact columns) never touch the file at all.

python
import json

# Single-object JSON files: one row per file
def extract(path):
    with open(path, encoding="utf-8") as f:
        return [json.loads(f.read())]

# JSONL files: one row per line
def extract(path):
    with open(path, encoding="utf-8") as f:
        return [json.loads(line) for line in f]

# Derive a value from the file path alone -- no file read
import os
extract = lambda path: [{"id": os.path.basename(os.path.dirname(path))}]

# Conditionally skip files
def extract(path):
    with open(path, encoding="utf-8") as f:
        data = json.loads(f.read())
    if data.get("draft"):
        return []
    return [data]

Multiple tables

Pass multiple Table definitions to index different file types into separate tables:

python
from dirsql import DirSQL, Table
import json

db = DirSQL(
    "./workspace",
    tables=[
        Table(
            ddl="CREATE TABLE posts (title TEXT, author_id TEXT)",
            glob="posts/*.json",
            extract=lambda path: [json.loads(open(path, encoding="utf-8").read())],
        ),
        Table(
            ddl="CREATE TABLE authors (id TEXT, name TEXT)",
            glob="authors/*.json",
            extract=lambda path: [json.loads(open(path, encoding="utf-8").read())],
        ),
    ],
)
rust
use dirsql::{DirSQL, Table, Value};
use std::collections::HashMap;

// See `row_from_json` in getting-started.md for a reusable helper.
fn row_from_json(raw: &str) -> HashMap<String, Value> {
    let v: serde_json::Value = serde_json::from_str(raw).unwrap();
    let serde_json::Value::Object(obj) = v else { return HashMap::new() };
    obj.into_iter()
        .map(|(k, val)| {
            let v = match val {
                serde_json::Value::String(s) => Value::Text(s),
                serde_json::Value::Number(n) => n
                    .as_i64()
                    .map(Value::Integer)
                    .unwrap_or_else(|| Value::Real(n.as_f64().unwrap_or(0.0))),
                serde_json::Value::Bool(b) => Value::Integer(b as i64),
                serde_json::Value::Null => Value::Null,
                other => Value::Text(other.to_string()),
            };
            (k, v)
        })
        .collect()
}

let db = DirSQL::new(
    "./workspace",
    vec![
        Table::new(
            "CREATE TABLE posts (title TEXT, author_id TEXT)",
            "posts/*.json",
            |path| vec![row_from_json(&std::fs::read_to_string(path).unwrap())],
        ),
        Table::new(
            "CREATE TABLE authors (id TEXT, name TEXT)",
            "authors/*.json",
            |path| vec![row_from_json(&std::fs::read_to_string(path).unwrap())],
        ),
    ],
)?;
typescript
import { DirSQL, type TableDef } from 'dirsql';
import { readFileSync } from 'node:fs';

const tables: TableDef[] = [
  {
    ddl: 'CREATE TABLE posts (title TEXT, author_id TEXT)',
    glob: 'posts/*.json',
    extract: (path) => [JSON.parse(readFileSync(path, 'utf8'))],
  },
  {
    ddl: 'CREATE TABLE authors (id TEXT, name TEXT)',
    glob: 'authors/*.json',
    extract: (path) => [JSON.parse(readFileSync(path, 'utf8'))],
  },
];

const db = new DirSQL({ root: './workspace', tables });

Each table has its own glob and extract function. A file can only match one table (the first matching glob wins).

Ignore patterns

Use the ignore parameter to exclude paths from all tables:

python
db = DirSQL(
    "./workspace",
    ignore=["**/node_modules/**", "**/.git/**"],
    tables=[...],
)
rust
let db = DirSQL::with_ignore(
    "./workspace",
    vec![/* tables */],
    vec!["**/node_modules/**", "**/.git/**"],
)?;
typescript
const db = new DirSQL({
  root: './workspace',
  tables: [/* tables */],
  ignore: ['**/node_modules/**', '**/.git/**'],
});

Ignore patterns are applied before glob matching. Any file matching an ignore pattern is skipped regardless of table globs.

Supported value types

The extract function can return these Python types, which map to SQLite types:

Python typeSQLite type
strTEXT
intINTEGER
floatREAL
boolINTEGER (0/1)
bytesBLOB
NoneNULL

Any other type is converted to its string representation via str().

Released under the MIT License.