Skip to content

AlanFnz/ig-archiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ig-archiver

Scrapes shared reels and posts from Instagram DMs, takes screenshots with Playwright, and generates AI summaries via GPT-4o vision.

ig-archiver/
├── extension/                    # chrome extension (React + TypeScript + Tailwind v4 + Vite 6)
│   ├── src/
│   │   ├── components/           # UI components (app, header, scan-button, progress-bar, status-feed)
│   │   ├── platform/             # platform abstraction (types, chromePlatform, electronPlatform)
│   │   ├── lib/                  # archiveStream, scraper, config, truncate
│   │   ├── main.tsx
│   │   ├── types.ts
│   │   └── index.css
│   ├── public/
│   │   ├── manifest.json
│   │   └── content-script.js
│   ├── package.json
│   └── dist/                     # built output — load this folder in Chrome
└── server/                       # express backend
    ├── server.js
    ├── login.js
    ├── lib/                      # db, config, capture, summarize
    ├── package.json
    ├── .env.example
    ├── screenshots/              # autogenerated — PNG captures
    └── database.json             # autogenerated — archive records

Quick Start

1 — Set up the server

cd server
yarn install

First time only: install the Playwright browser binary:

yarn playwright install chromium

Copy the env template and add your OpenAI key:

cp .env.example .env
OPENAI_API_KEY=sk-...your-key-here...
PORT=3000

2 — Log in to Instagram

The server requires an authenticated Instagram session to visit and screenshot each post. Run this once:

yarn run login

A browser window will open. Log in to Instagram, then come back to the terminal and press Enter. Your session is saved to session.json and loaded automatically on every subsequent server start.

If Instagram ever logs you out, just run yarn run login again.

3 — Start the server

yarn start
# or
yarn run dev

You should see:

[ig-archiver] server running on http://localhost:3000
[ig-archiver] Screenshots → .../server/screenshots
[ig-archiver] Database    → .../server/database.json

4 — Build and load the extension

cd extension
yarn install
yarn run build

Then in Chrome:

  1. Navigate to chrome://extensions/
  2. Enable Developer mode (top-right toggle)
  3. Click Load unpacked
  4. Select the extension/dist/ folder

The extension does not require icons to work. Broken icon warnings can be ignored, or add your own 16×16, 48×48, and 128×128 PNGs to extension/icons/.

5 — Archive shared content from Instagram

  1. Open Instagram in Chrome — the extension's content script begins intercepting data immediately
  2. Navigate to the DM conversation you want to archive
  3. Click the IG Archiver toolbar icon
  4. Click Scan & Archive Chat
  5. The extension auto-scrolls to the top of the conversation (5 batches by default), then archives every shared reel and post it found
  6. Watch the real-time progress as each reel/post is visited, screenshotted, and summarised

To capture more history, increase SCROLL_LOADS in extension/src/lib/config.ts and rebuild.


How it works

Chrome Extension                        Node.js Server
────────────────────────────────        ────────────────────────────────
content-script.js (MAIN world)          POST /archive  { urls: [...] }
  → patches XHR at document_start            │
  → captures get_slide_thread_nullable        ↓
    and fetch__SlideThread graphql      for each URL (streaming NDJSON):
    responses as you browse                1. Playwright visits URL (authenticated)
  → stores SlideMessageXMAContent              - waitUntil: load (falls back to
    nodes in window.__igSlideThreads            domcontentloaded on timeout)
                                               - 1280×720 viewport
autoScrollOnce() (MAIN world)                  - SHA-1 filename → screenshots/
  → reads pageInfo cursor from           2. extract <title>, meta description,
    window.__igSlideThreads                  and post caption (article h1)
  → replays XHR to fetch older batch    3. GPT-4o vision → screenshot + caption
  → called SCROLL_LOADS times before       → summary, category + keywords
    scraping                            4. upsert entry in database.json
                                        5. stream progress event back
scrapeExternalLinks() (MAIN world)
  → reads window.__igSlideThreads
  → matches current thread via
    thread_key → thread_fbid mapping
  → extracts target_url from each
    XMA node (instagram.com/p/ or
    instagram.com/reel/ only)
  → sends URL list to server

database.json schema

{
  "url": "https://www.instagram.com/reel/ABC123/",
  "title": "Example post title",
  "metaDescription": "A brief description from the page.",
  "summary": "A one-sentence AI-generated overview of the post.",
  "category": "Learning",
  "keywords": "cooking, recipe, italian",
  "screenshotPath": "screenshots/3a9f12b04c1e.png",
  "archivedAt": "2026-03-03T10:00:00.000Z",
  "createdAt": "2026-03-03T10:00:00.000Z"
}

Categories: References · Memes · Inspiration · Tutorials · News · Ai · Tools · Music production · Movies and shows · Design · Music · Politics (one or two per entry)

Keywords: up to three comma-separated terms per entry, generated by the model


Configuration

.env variable Default Description
OPENAI_API_KEY Required. Your OpenAI key.
PORT 3000 Server listen port.
SCREENSHOT_WIDTH 1280 Viewport / screenshot width.
SCREENSHOT_HEIGHT 720 Viewport / screenshot height.
MOCK Set to true to skip OpenAI calls and return placeholder data.

SCROLL_LOADS (extension-side, in extension/src/lib/config.ts) controls how many scroll batches are fetched before scraping. Default is 5.

TIMEOUT_MS (server-side, in server/lib/config.js) sets the per-URL Playwright navigation timeout. Default is 30000 ms.


Troubleshooting

Symptom Fix
No session.json found Run yarn run login in /server.
OPENAI_API_KEY is not set Copy .env.example.env and add your key.
Navigation failed for a URL The site may block headless browsers or be down. The URL is skipped; other URLs continue.
Extension shows "No shared posts found" Make sure the page was loaded with the extension active (reload the tab after installing).
Error: connect ECONNREFUSED in extension Make sure the server is running (yarn start in /server).
Playwright browser not found Run yarn playwright install chromium inside /server.
Fewer links than expected Increase SCROLL_LOADS in extension/src/lib/config.ts and rebuild — Instagram loads messages in batches.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors