What is ArchiveBox?
ArchiveBox is an open source, self-hosted web archiving solution, it helps individuals or organizations save web content for offline browsing and ensures long-term data accessibility
The goal is to allow users to proactively save the web content they care about, so as to avoid losing important information due to link failures, content changes, or service offline. Can be archived: Bookmarks, social media content (such as Facebook photos, YouTube videos), research papers, legal evidence, etc.
Core functions and technical characteristics
Various input methods
You can enter what you want to save into the ArchiveBox from multiple sources, including:
- Individual URL
- Browser bookmarks or history
- RSS feed
- Pocket, Pinboard and other collection services
Automatically grab and save content in multiple formats
ArchiveBox generates multiple archive formats for each page, such as:
- Original HTML, SingleFile HTML, Screenshots PNG, PDF, WARC, etc.
- Social media content: TXT text, comments, authors, pictures, etc.
- Media content: MP3/MP4, subtitles, metadata, thumbnails, etc.
- Code hosting services (GitHub/GitLab): Clone code, README, etc.
Multiple access methods
- Command Line Tool (CLI): Complete control and automated script integration
- Web application interface: Intuitive operation and preview
- Python libraries/ REST APIs/ Webhooks: Convenient for secondary development and integration
data storage mode
- Save using a file system without requiring proprietary formats
- Archive content is stored in a local folder for long-term use or migration
Installation and deployment methods
ArchiveBox supports multiple installation methods, and the following are recommended:
- Docker / Docker Compose (recommended)
Contains all dependencies for easy deployment and upgrade. - Command line installation (for Linux / macOS / Debian, etc.)
pip install archivebox archivebox installor usecurl | bashone-click script. - supported platforms: Linux, macOS, BSD (native), Windows can be used through Docker or WSL2
- resource requirements: Minimum 500 MB RAM, recommended ≥2 GB; file systems that support compressed storage (such as ZFS, BTRFS) are more efficient
Working principle and design concept
- ArchiveBox uses a variety of tools (such as wget, headless Chrome) to grab content.
- The author believes that the core advantage lies in “decentralization”, avoiding relying on a single service (such as archive.org) for all network archives, saving them by users themselves and sharing them in the future
- The project uses the Django framework to build the backend and uses SQLite as the local database; the plug-in system is based on Pluggy; and the REST API uses django-ninja and Pydantic
Quick Get Started Example
- initialize project directory
mkdir my_archive && cd my_archive archivebox init --setup - Add the URL to archive
archivebox add https://example.com - Launch a local Web service preview
archivebox server - Import history or bookmarks
Support the import of Pocket, Pinboard, browser bookmarks, RSS feeds, etc.
Community feedback and usage scenarios
Developers mentioned in the Reddit discussion that ArchiveBox is a complex but feature-rich Django project that can replace archive.org and enable more formats for grabbing (screenshots, PDF, etc.)
Other users emphasize that it can enhance the autonomy and redundant backup capabilities of network content preservation
Summary list
| characteristics | described |
|---|---|
| type | Open source, self-hosted web archiving tool |
| support input | URL, bookmark, history, RSS, favorite services |
| save format | HTML, PDF, PNG, WARC, audio and video, text, code, etc. |
| use | CLI / Web Interface/ API |
| Recommended installation methods | Docker or pip + install script |
| applicable platform | Native to Linux/macOS/BSD;Windows via Docker or WSL2 |
| technology stack | Python、Django、SQLite、Pluggy、django-ninja |
| design concept | Distributed, data control, autonomous long-term archiving |
Github:https://github.com/ArchiveBox/archivebox
Oil tubing: