Skip to content

Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue#557

Open
tanmayjoddar wants to merge 2 commits intocertego:developfrom
tanmayjoddar:develop
Open

Fix(ingestion): implement retry mechanism + solve Docker cross-platform issue#557
tanmayjoddar wants to merge 2 commits intocertego:developfrom
tanmayjoddar:develop

Conversation

@tanmayjoddar
Copy link

@tanmayjoddar tanmayjoddar commented Jan 30, 2026

Implemented comprehensive retry logic for all ingestion sources.

During testing, discovered and fixed a critical Docker cross-platform bug that prevented
BuffaLogs from starting in Linux containers on Windows development machines (WSL2). This
blocked all testing until resolved.


Retry Mechanism Implementation:

  • Moved retry config from environment variables to ingestion.json
  • Implemented shared retry configuration in BaseIngestion class
  • All three sources (Elasticsearch, OpenSearch, Splunk) now use identical retry structure
  • Exponential backoff with jitter (1s→2s→4s...max 30s) using Python backoff library
  • Health checks after successful connection establishment
  • Fail-fast behavior when retries exhausted (re-raises exception)
  • Retry logs include full exception messages (critical mentor feedback)

Backward Compatibility:

  • Default retry config merged when not present in ingestion.json
  • Defaults: enabled=true, max_retries=10, initial_backoff=1s, max_backoff=30s, max_elapsed_time=60s, jitter=true
  • Existing configs without retry block continue working

Critical Docker Bug Discovered & Fixed:

While testing the retry implementation, discovered BuffaLogs container failed to start with error:
exec ./run.sh: no such file or directory

Root Cause: Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows line endings
(CRLF) which Linux containers cannot execute. This is a cross-platform compatibility issue
that affects anyone developing on Windows with WSL2/Docker Desktop.

Solution Implemented:

  • Converted all shell scripts from CRLF to LF line endings
  • Updated Dockerfile to use explicit /bin/bash invocation
  • Added dos2unix to build process as safety net
  • Now works seamlessly on both Windows and Linux environments

Files Modified:

  • config/buffalogs/ingestion.json - Added retry config blocks
  • buffalogs/impossible_travel/ingestion/base_ingestion.py - Shared _read_retry_config()
  • buffalogs/impossible_travel/utils/connection_retry.py - Generic retry decorator
  • buffalogs/impossible_travel/ingestion/elasticsearch_ingestion.py - Applied retry logic
  • buffalogs/impossible_travel/ingestion/opensearch_ingestion.py - Applied retry logic
  • buffalogs/impossible_travel/ingestion/splunk_ingestion.py - Applied retry logic
  • buffalogs/impossible_travel/tests/ingestion/test_elasticsearch_ingestion.py - Updated tests
  • build/Dockerfile - Cross-platform shell script support + dos2unix
  • buffalogs/run.sh - Fixed line endings (CRLF→LF)
  • buffalogs/run_worker.sh - Fixed line endings (CRLF→LF)
  • buffalogs/run_beat.sh - Fixed line endings (CRLF→LF)

Removed:

  • config/buffalogs/buffalogs.env - Removed BUFFALOGS_ES_MAX_RETRIES and BUFFALOGS_ES_RETRY_MAX_TIME
  • buffalogs/buffalogs/settings/certego.py - Removed retry Django settings
  • buffalogs/impossible_travel/tests/ingestion/test_retry_logic.py - Deleted problematic unit tests

Testing:

  • BuffaLogs container now starts successfully on Windows with WSL2
  • All retry configurations load correctly from ingestion.json
  • Backward compatibility verified with missing retry blocks
  • Docker works on both Windows and Linux (cross-platform verified)

Fixes #545

Elasticsearch retry with exception logging via predicate-based backoff hook

Screenshot 2026-01-30 212254 - Copy
Screenshot 2026-01-30 230540

Docker Cross-Platform Fix: Shell Script Line Endings Resolved

Problem: exec ./run.sh: no such file or directory on fresh docker-compose up

Root Cause: Shell scripts (run.sh, run_worker.sh, run_beat.sh) had Windows CRLF line endings; Linux containers cannot execute them.

Solution:

  • Converted scripts from CRLF → LF
  • Updated Dockerfile to use explicit /bin/bash invocation
  • Added dos2unix as build safety net

Result: All services running healthy on both Windows and Linux environments
image


Screenshot 2026-01-30 230603

@tanmayjoddar
Copy link
Author

tanmayjoddar commented Feb 1, 2026

I have attached an additional screenshot showing successful local testing with Elasticsearch:

-Documents are generated and indexed correctly.

-The index is queryable and returns the expected document count.

This completes the local Ingestion and testing flow.

Screenshot 2026-02-02 022354

@tanmayjoddar
Copy link
Author

Hi @Lorygold ,

Just a gentle follow-up on this PR.
Please let me know if there’s anything you’d like changed or improved — I’m happy to adjust.

Thanks!

@Lorygold
Copy link
Contributor

Lorygold commented Mar 5, 2026

@lucaCigarini please run the CI

@lucaCigarini
Copy link

I'm afraid I cannot run CI manually. The author needs to push another commit (even an empty one like this git commit --allow-empty -m "Your commit message")

@tanmayjoddar
Copy link
Author

Hi @lucaCigarini @Lorygold

I pushed a new commit to trigger the CI pipeline as requested.

It looks like the workflow is currently awaiting maintainer approval.
Could someone please approve it so the checks can run?

Thanks!

@tanmayjoddar
Copy link
Author

Hi @lucaCigarini @Lorygold — sorry for the noise. The only remaining failure is Black formatting, pushing the fix now.

@tanmayjoddar
Copy link
Author

tanmayjoddar commented Mar 9, 2026

Hi @lucaCigarini @Lorygold — all CI issues are now resolved:

  • Fixed Black formatting (line-length 160 per project config)
  • Fixed isort import ordering
  • Fixed test_base_ingestion.py: these tests only validate normalize_fields() and don't require a real connection. Since ElasticsearchIngestion.__init__ now calls _initialize_connection(), the tests were failing in CI due to the Docker hostname elasticsearch not being resolvable outside of Docker. Mocked _initialize_connection to keep the tests focused and environment-independent.

Would really appreciate it if someone could approve the workflow run when you get a chance. Thank you for your patience!

@tanmayjoddar
Copy link
Author

test_process_user_logins_ConnectionError and test_process_user_logins_TimeoutError were failing because process_user_logins now logs and re-raises the exception (behavior introduced in this PR). The previous tests only asserted the log but did not account for the propagated exception, causing the test runner to crash.

Fix: Added assertRaises inside assertLogs for both tests to correctly assert both behaviors — the error is logged and the exception propagates to the caller.

Pushing the fix now — will update here once CI completes.

@tanmayjoddar
Copy link
Author

Hi @lucaCigarini @Lorygold — could someone please approve the CI workflow run? I'm actively working on getting all tests to pass and would appreciate being able to see the full output.

Thanks!

- Add shared _read_retry_config() in BaseIngestion
- Add connection_retry.py with generic backoff decorator (backoff library)
- Elasticsearch, OpenSearch, Splunk: eager _initialize_connection() with
  retry on startup; health-check after successful connect
- process_user_logins catches, logs and returns [] on error (per-user
  loop must not abort remaining users)
- process_users re-raises so the task knows if the user-list fetch failed
- Move retry config from env vars into ingestion.json; defaults applied
  when block is absent (backward compatible)

Closes certego#545
process_user_logins catches exceptions, logs them and returns [] —
it does not re-raise. Tests should only assert the log, not a raised
exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add retry mechanism and error handling for Elasticsearch connection failures

3 participants