Skip to content

archive pseudo-vcs driver: indexing code in archives (e.g. zip, tar) without extracting files#484

Open
muravjov wants to merge 6 commits intohound-search:mainfrom
muravjov:archive
Open

archive pseudo-vcs driver: indexing code in archives (e.g. zip, tar) without extracting files#484
muravjov wants to merge 6 commits intohound-search:mainfrom
muravjov:archive

Conversation

@muravjov
Copy link

@muravjov muravjov commented May 21, 2024

What kind of change does this PR introduce? (check at least one)

  • Bugfix
  • Feature
  • Code style update
  • Refactor
  • Build-related changes
  • Other, please describe:

The PR fulfills these requirements:

  • All tests are passing?
  • New/updated tests are included?
  • If any static assets have been updated, has ui/bindata.go been regenerated?
  • Are there doc blocks for functions that I updated/created?

If adding a new feature, the PR's description includes:

  • A convincing reason for adding this feature (to avoid wasting your time, it's best to open a suggestion issue first and wait for approval before working on it)

Description:

This PR adds a new driver archive, which allows to index source code in archives (e.g. zip, tar; any that supported by https://github.com/mholt/archiver) without extracting files: while indexing, files are walked using archive API, and while searching, results are checked and snippets generated with files extracted on the fly.

A config example:

{
  "dbpath" : "db",
  "vcs-config" : {
    "git": {
      "ref" : "main"
    }
  },
  "repos" : {
    "video" : {
      "url" : "/Volumes/1tb-ext4/twitch/video.zip",
      "vcs" : "archive",
      "vcs-config" : {
        "ignored-files" : [".git"]
      },
      "url-pattern" : {
        "base-url" : "file:///Volumes/1tb-ext4/src/twitch/{path}"
      }
    }
  }
}

Some metrics:

  • for 160 zip files, 126GB, I got 3GB of indexes
  • it takes about 13 seconds for a search request to execute

@muravjov
Copy link
Author

muravjov commented Jun 1, 2024

@salemhilal
would you mind to review the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants