Open source "self-hosted web archiving" tool

What is ArchiveBox?

ArchiveBox is an open source, self-hosted web archiving solution, it helps individuals or organizations save web content for offline browsing and ensures long-term data accessibility

The goal is to allow users to proactively save the web content they care about, so as to avoid losing important information due to link failures, content changes, or service offline. Can be archived: Bookmarks, social media content (such as Facebook photos, YouTube videos), research papers, legal evidence, etc.

Core functions and technical characteristics

Various input methods

You can enter what you want to save into the ArchiveBox from multiple sources, including:

Individual URL
Browser bookmarks or history
RSS feed
Pocket, Pinboard and other collection services

Automatically grab and save content in multiple formats

ArchiveBox generates multiple archive formats for each page, such as:

Original HTML, SingleFile HTML, Screenshots PNG, PDF, WARC, etc.
Social media content: TXT text, comments, authors, pictures, etc.
Media content: MP3/MP4, subtitles, metadata, thumbnails, etc.
Code hosting services (GitHub/GitLab): Clone code, README, etc.

Multiple access methods

Command Line Tool (CLI): Complete control and automated script integration
Web application interface: Intuitive operation and preview
Python libraries/ REST APIs/ Webhooks: Convenient for secondary development and integration

data storage mode

Save using a file system without requiring proprietary formats
Archive content is stored in a local folder for long-term use or migration

Installation and deployment methods

ArchiveBox supports multiple installation methods, and the following are recommended:

Docker / Docker Compose (recommended)
Contains all dependencies for easy deployment and upgrade.
Command line installation (for Linux / macOS / Debian, etc.)pip install archivebox archivebox installor use curl | bash one-click script.
supported platforms: Linux, macOS, BSD (native), Windows can be used through Docker or WSL2
resource requirements: Minimum 500 MB RAM, recommended ≥2 GB; file systems that support compressed storage (such as ZFS, BTRFS) are more efficient

Working principle and design concept

ArchiveBox uses a variety of tools (such as wget, headless Chrome) to grab content.
The author believes that the core advantage lies in “decentralization”, avoiding relying on a single service (such as archive.org) for all network archives, saving them by users themselves and sharing them in the future
The project uses the Django framework to build the backend and uses SQLite as the local database; the plug-in system is based on Pluggy; and the REST API uses django-ninja and Pydantic

Quick Get Started Example

initialize project directorymkdir my_archive && cd my_archive archivebox init --setup
Add the URL to archivearchivebox add https://example.com
Launch a local Web service previewarchivebox server
Import history or bookmarks
Support the import of Pocket, Pinboard, browser bookmarks, RSS feeds, etc.

Community feedback and usage scenarios

Developers mentioned in the Reddit discussion that ArchiveBox is a complex but feature-rich Django project that can replace archive.org and enable more formats for grabbing (screenshots, PDF, etc.)

Other users emphasize that it can enhance the autonomy and redundant backup capabilities of network content preservation

Summary list

characteristics	described
type	Open source, self-hosted web archiving tool
support input	URL, bookmark, history, RSS, favorite services
save format	HTML, PDF, PNG, WARC, audio and video, text, code, etc.
use	CLI / Web Interface/ API
Recommended installation methods	Docker or pip + install script
applicable platform	Native to Linux/macOS/BSD;Windows via Docker or WSL2
technology stack	Python、Django、SQLite、Pluggy、django-ninja
design concept	Distributed, data control, autonomous long-term archiving

Github：https://github.com/ArchiveBox/archivebox

Oil tubing: