eghuro/log_ingest

Fork 0

Files

Alexandr Mansurov 04959583bb Update README

2026-01-22 00:48:39 +01:00

4.9 KiB

Raw Blame History

log_ingest

A Rust CLI tool for loading log files into a SQLite database for analysis.

Overview

Parses application logs containing signature messages and loads them into SQLite for querying. Designed to handle large log volumes (10GB+ per day) with batched inserts and efficient parsing.

Features

Parse signature: messages extracting app info, device details, and feature flags
Support for both plain .log and gzip compressed .log.gz files
File discovery by date range using YYYY/mm/dd directory structure
Batched inserts for performance with large files
Parallel file processing for multi-day ingestion
Indexed columns (session_id, version) for efficient queries
Extensible parser architecture for adding new message types

Installation

cargo build --release

Usage

Process a single file

log_ingest --file /path/to/logs.log --output output.db

Process a date range

log_ingest \
  --from 2026/01/20 \
  --to 2026/01/21 \
  --base-dir /var/log/myapp \
  --filename app.log \
  --output output.db

The tool will look for files at <base-dir>/YYYY/MM/DD/<filename>.gz or <base-dir>/YYYY/MM/DD/<filename> for each day in the range.

Parallel processing

When processing multiple files, parsing runs in parallel by default using all available CPU cores. A single writer thread handles database inserts to avoid SQLite contention.

# Use all CPU cores (default)
log_ingest --from 2026/01/01 --to 2026/01/31 ...

# Limit to 4 threads
log_ingest --threads 4 --from 2026/01/01 --to 2026/01/31 ...

# Sequential processing (disable parallelism)
log_ingest --threads 1 --from 2026/01/01 --to 2026/01/31 ...

Options

Option	Description
`--file <PATH>`	Single log file to process
`--from <DATE>`	Start date (YYYY/mm/dd)
`--to <DATE>`	End date (YYYY/mm/dd)
`--base-dir <PATH>`	Base directory containing log files
`--filename <NAME>`	Log filename (e.g., `app.log`)
`-o, --output <PATH>`	Output SQLite database path
`--batch-size <N>`	Batch size for inserts (default: 10000)
`--threads <N>`	Number of parallel threads (0 = all cores, 1 = sequential)

Database Schema

The schema uses normalized lookup tables to minimize disk usage for large datasets.

-- Lookup tables for low-cardinality text columns
CREATE TABLE apps (id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE);
CREATE TABLE versions (id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE);
CREATE TABLE models (id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE);
CREATE TABLE devices (id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE);
CREATE TABLE os_versions (id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE);
CREATE TABLE app_names (id INTEGER PRIMARY KEY, name TEXT NOT NULL UNIQUE);

-- Main table with foreign keys and millisecond timestamp
CREATE TABLE signature_entries (
    session_id TEXT NOT NULL,
    timestamp_ms INTEGER NOT NULL,  -- Unix epoch milliseconds
    app_id INTEGER NOT NULL REFERENCES apps(id),
    version_id INTEGER NOT NULL REFERENCES versions(id),
    offline_login_usage INTEGER,
    is_password_autofill_enabled INTEGER,
    camera_roll_usage INTEGER,
    os_id INTEGER REFERENCES os_versions(id),
    app_name_id INTEGER REFERENCES app_names(id),
    touch_id INTEGER,
    is_offline_login_enabled INTEGER,
    model_id INTEGER REFERENCES models(id),
    device_id INTEGER REFERENCES devices(id),
    password_autofill_usage INTEGER,
    PRIMARY KEY (session_id, timestamp_ms)
) WITHOUT ROWID;

CREATE INDEX idx_session_id ON signature_entries(session_id);
CREATE INDEX idx_version ON signature_entries(version_id);

Example Queries

-- Percentage of users with password autofill enabled
SELECT
    ROUND(100.0 * SUM(is_password_autofill_enabled) / COUNT(*), 2) as pct
FROM signature_entries;

-- Count by app version
SELECT v.name as version, COUNT(*) as cnt
FROM signature_entries se
JOIN versions v ON se.version_id = v.id
GROUP BY v.name
ORDER BY cnt DESC;

-- Device breakdown
SELECT d.name as device, COUNT(*) as cnt
FROM signature_entries se
JOIN devices d ON se.device_id = d.id
GROUP BY d.name;

-- Convert timestamp_ms to readable datetime
SELECT
    datetime(timestamp_ms / 1000, 'unixepoch') as timestamp,
    session_id
FROM signature_entries
LIMIT 10;

Development

# Build
cargo build

# Run tests
cargo test

# Format
cargo fmt

# Lint
cargo clippy

Cross-Compilation

To build a Linux x86_64 binary from macOS:

Install cargo-zigbuild and Zig:

cargo install cargo-zigbuild
brew install zig

Add the Linux target:

rustup target add x86_64-unknown-linux-gnu

Build:

cargo zigbuild --release --target x86_64-unknown-linux-gnu

The binary will be at target/x86_64-unknown-linux-gnu/release/log_ingest.

License

MIT

4.9 KiB Raw Blame History