Defining Tables

Online: https://thekevinscott.github.io/dirsql/guide/tables

Each table in dirsql maps a set of files to rows in an in-memory SQLite table. A table definition has three parts: DDL, a glob pattern, and an extract function.

Table constructor

PythonRustTypeScript

python

from dirsql import Table

table = Table(
    ddl="CREATE TABLE comments (id TEXT, body TEXT, author TEXT)",
    glob="comments/**/index.jsonl",
    extract=lambda path: [
        {"id": "...", "body": "...", "author": "..."}
    ],
)

rust

use dirsql::{Table, Value};
use std::collections::HashMap;

let table = Table::new(
    "CREATE TABLE comments (id TEXT, body TEXT, author TEXT)",
    "comments/**/index.jsonl",
    |_path| {
        let mut row: HashMap<String, Value> = HashMap::new();
        row.insert("id".into(), Value::Text("...".into()));
        row.insert("body".into(), Value::Text("...".into()));
        row.insert("author".into(), Value::Text("...".into()));
        vec![row]
    },
);

typescript

import type { TableDef } from 'dirsql';

const table: TableDef = {
  ddl: 'CREATE TABLE comments (id TEXT, body TEXT, author TEXT)',
  glob: 'comments/**/index.jsonl',
  extract: (_path) => [
    { id: '...', body: '...', author: '...' },
  ],
};

All three arguments are keyword-only (in Python). In Rust they are positional to Table::new. In TypeScript a table is a plain TableDef object literal — the TS SDK exports the TableDef type (not a class).

`ddl`

A SQLite CREATE TABLE statement. This defines the schema of the table. dirsql executes this DDL directly against the in-memory database, so any valid SQLite column types and constraints work.

python

# Simple text columns
ddl="CREATE TABLE notes (title TEXT, body TEXT)"

# Typed columns
ddl="CREATE TABLE metrics (name TEXT, value REAL, count INTEGER)"

# With constraints
ddl="CREATE TABLE items (id TEXT PRIMARY KEY, name TEXT NOT NULL)"

The table name is parsed from the DDL. It must be a valid SQLite identifier.

`glob`

A glob pattern that determines which files feed into this table. Matched relative to the root directory passed to DirSQL.

python

glob="*.json"                  # JSON files in root only
glob="**/*.json"               # JSON files at any depth
glob="comments/**/index.jsonl" # JSONL files in comment subdirectories
glob="data/*.csv"              # CSV files in data/

Glob syntax follows standard Unix globbing rules. ** matches any number of directory levels.

`extract`

A callable (path: str) -> list[dict] that converts a file into rows.

path is the path of the matched file, relative to the scan root (or absolute when the root passed to DirSQL is absolute)
Return a list of dicts, where each dict maps column names to values
Return an empty list to skip a file

dirsql does not read file contents for you. If your extract needs the file body, read it inside the callback using path. Callbacks that derive columns only from the path (or that rely solely on the auto-injected filesystem-fact columns) never touch the file at all.

python

import json

# Single-object JSON files: one row per file
def extract(path):
    with open(path, encoding="utf-8") as f:
        return [json.loads(f.read())]

# JSONL files: one row per line
def extract(path):
    with open(path, encoding="utf-8") as f:
        return [json.loads(line) for line in f]

# Derive a value from the file path alone -- no file read
import os
extract = lambda path: [{"id": os.path.basename(os.path.dirname(path))}]

# Conditionally skip files
def extract(path):
    with open(path, encoding="utf-8") as f:
        data = json.loads(f.read())
    if data.get("draft"):
        return []
    return [data]

Multiple tables

Pass multiple Table definitions to index different file types into separate tables:

PythonRustTypeScript

python

from dirsql import DirSQL, Table
import json

db = DirSQL(
    "./workspace",
    tables=[
        Table(
            ddl="CREATE TABLE posts (title TEXT, author_id TEXT)",
            glob="posts/*.json",
            extract=lambda path: [json.loads(open(path, encoding="utf-8").read())],
        ),
        Table(
            ddl="CREATE TABLE authors (id TEXT, name TEXT)",
            glob="authors/*.json",
            extract=lambda path: [json.loads(open(path, encoding="utf-8").read())],
        ),
    ],
)

rust

use dirsql::{DirSQL, Table, Value};
use std::collections::HashMap;

// See `row_from_json` in getting-started.md for a reusable helper.
fn row_from_json(raw: &str) -> HashMap<String, Value> {
    let v: serde_json::Value = serde_json::from_str(raw).unwrap();
    let serde_json::Value::Object(obj) = v else { return HashMap::new() };
    obj.into_iter()
        .map(|(k, val)| {
            let v = match val {
                serde_json::Value::String(s) => Value::Text(s),
                serde_json::Value::Number(n) => n
                    .as_i64()
                    .map(Value::Integer)
                    .unwrap_or_else(|| Value::Real(n.as_f64().unwrap_or(0.0))),
                serde_json::Value::Bool(b) => Value::Integer(b as i64),
                serde_json::Value::Null => Value::Null,
                other => Value::Text(other.to_string()),
            };
            (k, v)
        })
        .collect()
}

let db = DirSQL::new(
    "./workspace",
    vec![
        Table::new(
            "CREATE TABLE posts (title TEXT, author_id TEXT)",
            "posts/*.json",
            |path| vec![row_from_json(&std::fs::read_to_string(path).unwrap())],
        ),
        Table::new(
            "CREATE TABLE authors (id TEXT, name TEXT)",
            "authors/*.json",
            |path| vec![row_from_json(&std::fs::read_to_string(path).unwrap())],
        ),
    ],
)?;

typescript

import { DirSQL, type TableDef } from 'dirsql';
import { readFileSync } from 'node:fs';

const tables: TableDef[] = [
  {
    ddl: 'CREATE TABLE posts (title TEXT, author_id TEXT)',
    glob: 'posts/*.json',
    extract: (path) => [JSON.parse(readFileSync(path, 'utf8'))],
  },
  {
    ddl: 'CREATE TABLE authors (id TEXT, name TEXT)',
    glob: 'authors/*.json',
    extract: (path) => [JSON.parse(readFileSync(path, 'utf8'))],
  },
];

const db = new DirSQL({ root: './workspace', tables });

Each table has its own glob and extract function. A file can only match one table (the first matching glob wins).

Ignore patterns

Use the ignore parameter to exclude paths from all tables:

PythonRustTypeScript

python

db = DirSQL(
    "./workspace",
    ignore=["**/node_modules/**", "**/.git/**"],
    tables=[...],
)

rust

let db = DirSQL::with_ignore(
    "./workspace",
    vec![/* tables */],
    vec!["**/node_modules/**", "**/.git/**"],
)?;

typescript

const db = new DirSQL({
  root: './workspace',
  tables: [/* tables */],
  ignore: ['**/node_modules/**', '**/.git/**'],
});

Ignore patterns are applied before glob matching. Any file matching an ignore pattern is skipped regardless of table globs.

Supported value types

The extract function can return these Python types, which map to SQLite types:

Python type	SQLite type
`str`	TEXT
`int`	INTEGER
`float`	REAL
`bool`	INTEGER (0/1)
`bytes`	BLOB
`None`	NULL

Any other type is converted to its string representation via str().

Defining Tables ​

Table constructor ​

ddl ​

glob ​

extract ​

Multiple tables ​

Ignore patterns ​

Supported value types ​

Defining Tables

Table constructor

`ddl`

`glob`

`extract`

Multiple tables

Ignore patterns

Supported value types