Draft:WebArchiver
Submission declined on 11 May 2026 by Flyingphoenixchips (talk).
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
|
Comment: Hey, thanks for the submission and for disclosing your conflict of interest upfront, that's appreciated. ː) Unfortunately this draft isn't quite ready for acceptance yet. The main issue is notability: Wikipedia requires significant coverage from independent, reliable sources, but the only reference here is the project's own GitHub repository. We'd need to see articles or write-ups about WebArchiver from journalists, researchers, or other third parties with no connection to the project. Given the conflict of interest and that the software is at v0.1.0, it may simply be too early, and there's nothing wrong with coming back once the project has gained more outside coverage. Feel free to keep improving the draft in the meantime. If you can demonstrate notability as per above feel free to resubmit. Flyingphoenixchips (talk) 04:41, 11 May 2026 (UTC)
Comment: In accordance with Wikipedia's Conflict of interest guideline, I disclose that I have a conflict of interest regarding the subject of this article. MulderCW (talk) 13:40, 9 April 2026 (UTC)
| WebArchiver | |
|---|---|
| Stable release | 0.1.0
|
WebArchiver is a free and open-source web archiving tool that creates, reads, replays, and manages WARC files. It provides a web-based graphical interface and REST API for capturing websites, viewing archived content directly in the browser, and downloading standard WARC files. The project was created with the goal of providing a no-cost alternative to commercial web archiving services, allowing users to work with the WARC file format without requiring paid software.[1]
Background
The Web ARChive (WARC) format, standardized as ISO 28500:2017, is the primary file format used for storing web crawl data.[2] It was originally developed by the International Internet Preservation Consortium (IIPC) and is used by major institutions including the Internet Archive and national libraries worldwide.[3]
While the WARC format itself is an open standard, many of the tools for creating and viewing WARC files are either commercial products, command-line-only utilities, or components of larger institutional archiving systems. WebArchiver was developed to address this gap by providing a self-hosted, browser-based tool that handles the complete archiving workflow — from crawling to replay — in a single application.
Architecture
WebArchiver uses a client–server model with a Node.js backend and React frontend, packaged as a Docker container for deployment.
WARC engine
The software includes a custom WARC 1.1 engine that implements the full ISO 28500:2017 specification. It supports all eight WARC record types (warcinfo, response, resource, request, revisit, metadata, conversion, and continuation), per-record gzip compression for random access, SHA-256 digest computation for integrity verification, and deduplication through revisit records.[2]
The parser operates as a streaming finite-state machine that processes WARC files incrementally, supporting multi-gigabyte files without loading them entirely into memory. A CDX (Capture Index) maps each archived URL to its byte offset within the WARC file, enabling direct access to individual records.
Web crawler
The crawler uses Playwright to control a headless Chromium browser, capturing not only the initial HTML but all resources loaded dynamically by JavaScript — including ECMAScript module imports, XHR/Fetch responses, web fonts, and CSS. The crawler waits for all network activity to settle before completing a page capture, ensuring that single-page applications and sites using dynamic module loading are fully archived.[1]
Multiple pages are crawled concurrently using a worker pool model. A per-domain rate limiter serializes requests to the same domain while allowing parallel requests to different domains, preventing excessive load on target servers. The crawler respects robots.txt directives by default.
During each crawl, a full-page screenshot is captured and stored as a WARC resource record within the archive file.
Replay system
The replay system serves archived content directly from WARC files in the user's browser. It employs a multi-layer approach to ensure archives are fully self-contained:
- Server-side URL rewriting — All URLs in HTML attributes (
href,src,srcset, etc.), CSS (url(),@import), and HTTP redirectLocationheaders are rewritten to route through the replay server
- Client-side interception — An injected JavaScript shim overrides
fetch()andXMLHttpRequestto rewrite dynamically constructed URLs at runtime
- Content Security Policy — A strict CSP header (
connect-src 'self') instructs the browser to block any request that escapes the URL rewriting
- Iframe sandboxing — Archived content is rendered in a sandboxed iframe that prevents navigation to external sites
HTTP redirect chains (such as CDN 302 redirects) are resolved internally within the archive, serving the final destination content transparently.
Version control
WebArchiver supports re-crawling archived sites to create new versions. A diff engine compares versions at the URL level using payload digests, identifying resources that were added, removed, or changed between crawls. This enables tracking of how websites evolve over time.
Technology
| Component | Technology |
|---|---|
| Backend | Node.js 20, TypeScript, Fastify |
| Frontend | React 19, Vite, Tailwind CSS |
| Database | SQLite with WAL mode |
| Crawler | Playwright with Chromium |
| Container | Docker |
| License | MIT License |
The application uses SQLite as its database, requiring no external database server. All data — the WARC files, CDX indexes, crawl metadata, and audit logs — is stored in a single data directory, simplifying backup and migration.
Comparison with other tools
| Tool | Type | WARC support | Browser rendering | GUI | Self-hosted | License |
|---|---|---|---|---|---|---|
| Wayback Machine | Service | Yes | No (server-side) | Yes (web) | No | N/A |
| Heritrix | Crawler | Yes | No | No (CLI) | Yes | Apache 2.0 |
| pywb | Replay | Yes | No (server-side) | Minimal | Yes | GPL 3.0 |
| Webrecorder | Service/Tool | Yes (WACZ) | Yes (Service Worker) | Yes | Partial | AGPL 3.0 |
| wget | Crawler | Yes | No | No (CLI) | Yes | GPL 3.0 |
| WebArchiver | Full stack | Yes (1.1) | Yes (Playwright) | Yes (web) | Yes | MIT License |
See also
References
- ^ a b "WebArchiver GitHub Repository". Retrieved 2026-04-09.
- ^ a b "ISO 28500:2017 Information and documentation — WARC file format". International Organization for Standardization.
- ^ "WARC Specifications". International Internet Preservation Consortium.
External links
References
Content Disclaimer
Informasi ini disarikan dari Wikipedia dan disajikan kembali untuk tujuan edukasi. Konten tersedia di bawah lisensi CC BY-SA 3.0. Kami tidak bertanggung jawab atas ketidakakuratan data yang bersumber dari kontribusi publik tersebut.
- The information displayed on this website is sourced in part or in whole from Wikipedia and has been adapted for the purpose of restating it. We strive to provide accurate and relevant information, however:
- There is no guarantee of absolute accuracy. Wikipedia is an open, collaborative project that can be edited by anyone, so information is subject to change.
- It is not intended to constitute professional advice. The content displayed is for informational and educational purposes only. For important decisions (e.g., medical, legal, or financial), please consult a professional.
- Content copyright. Wikipedia is licensed under the Creative Commons Attribution-ShareAlike License (CC BY-SA). This means that content may be reused with appropriate attribution and shared under a similar license.
- Responsible use. Any risk arising from the use of information from this website is entirely the responsibility of the user.

- provide significant coverage: discuss the subject in detail, not just brief mentions or routine announcements;
- are reliable: from reputable outlets with editorial oversight;
- are independent: not connected to the subject, such as interviews, press releases, the subject's own website, or sponsored content.
Please add references that meet all three of these criteria. If none exist, the subject is not yet suitable for Wikipedia.