feature(url-analysis): add domain utility functions and update analysis logic#153
feature(url-analysis): add domain utility functions and update analysis logic#153OrenIntezer wants to merge 1 commit intomasterfrom
Conversation
WalkthroughThis pull request updates the SDK version and enhances URL analysis capabilities. It introduces two private functions for domain extraction and checking, modifies the logic in an existing analysis method, and adds a new test to validate the functionality of the domain checking utility. Changes
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
d8555cc to
a5aa056
Compare
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently, please check "Code review limits" under "Moderation" settings.
Actionable comments posted: 1
🧹 Nitpick comments (4)
intezer_sdk/analysis.py (3)
341-344: Consider handling edge cases in URL parsing.The function should handle edge cases like invalid URLs or empty strings.
def _get_domain(url: str) -> str: + if not url or not isinstance(url, str): + raise ValueError("URL must be a non-empty string") if not url.startswith(('http://', 'https://')): url = 'http://' + url return urlparse(url).netloc
415-416: Consider documenting the date range change.The date range has been increased from 1 day to 10 days. This should be documented in the function's docstring.
491-492: Enhance the main block example.Consider adding more examples to demonstrate different URL formats and edge cases.
if __name__ == '__main__': - print(_clean_url("https://www.google.com/fdsjklfdjs/fdsjklfdsjklfk/")) + examples = [ + "https://www.google.com/path/", + "http://example.com", + "www.test.com/", + "https://sub.domain.com/path?query=1" + ] + for url in examples: + print(f"Original: {url}") + print(f"Cleaned: {_clean_url(url)}\n")tests/unit/test_url_analysis.py (1)
268-281: LGTM!The test covers various URL formats and domain variations.
Consider adding more test cases:
def test_domain_contains_util(self): # Arrange url1 = 'http://google.com/scans?email=orenk@intezer.com' url2 = 'http://intezer.com/scans?email=someone@example.com' url3 = 'https://www.intezer.com/scans?email=someone@example.com' url4 = 'https://www.analyze.intezer.com/scans?email=someone@example.com' + url5 = 'http://myintezer.com/scans' # Should return False + url6 = '' # Should handle empty string + url7 = 'invalid-url' # Should handle invalid URL url_to_search = 'intezer.com' # Act + Assert self.assertFalse(_domain_contains(url1, url_to_search)) self.assertTrue(_domain_contains(url2, url_to_search)) self.assertTrue(_domain_contains(url3, url_to_search)) self.assertTrue(_domain_contains(url4, url_to_search)) + self.assertFalse(_domain_contains(url5, url_to_search)) + with self.assertRaises(ValueError): + _domain_contains(url6, url_to_search) + with self.assertRaises(ValueError): + _domain_contains(url7, url_to_search)
🛑 Comments failed to post (1)
intezer_sdk/analysis.py (1)
347-348: 🛠️ Refactor suggestion
Improve domain matching logic.
The current implementation might match substrings incorrectly. For example, 'example.com' would match 'myexample.com'.
def _domain_contains(url: str, search_domain: str) -> bool: - return search_domain in _get_domain(url) + domain = _get_domain(url) + return domain == search_domain or domain.endswith('.' + search_domain)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def _domain_contains(url: str, search_domain: str) -> bool: domain = _get_domain(url) return domain == search_domain or domain.endswith('.' + search_domain)
a5aa056 to
92cdbc4
Compare
92cdbc4 to
a59b82f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
intezer_sdk/analysis.py (1)
423-427: Consider moving domain filtering to the query level.While the current implementation is correct, filtering analysis IDs after fetching all results could be inefficient for large datasets. Consider adding domain filtering parameters to the
query_url_analyses_historyfunction.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
intezer_sdk/__init__.py(1 hunks)intezer_sdk/analysis.py(3 hunks)tests/unit/test_url_analysis.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- intezer_sdk/init.py
- tests/unit/test_url_analysis.py
🔇 Additional comments (1)
intezer_sdk/analysis.py (1)
11-11: LGTM!The
urlparseimport is correctly placed with other standard library imports and is necessary for the new domain utility functions.
| def _get_domain(url: str) -> str: | ||
| if not url.startswith(('http://', 'https://')): | ||
| url = 'http://' + url | ||
| return urlparse(url).netloc |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Add type hints and improve error handling.
The function should include type hints and handle invalid URLs gracefully.
-def _get_domain(url: str) -> str:
+def _get_domain(url: str) -> str:
+ """Extract domain from URL, prefixing with http:// if protocol is missing.
+
+ Args:
+ url: URL string to parse
+
+ Returns:
+ str: Domain extracted from URL
+
+ Raises:
+ ValueError: If URL is invalid
+ """
if not url.startswith(('http://', 'https://')):
url = 'http://' + url
- return urlparse(url).netloc
+ try:
+ return urlparse(url).netloc
+ except Exception as e:
+ raise ValueError(f"Invalid URL format: {url}") from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get_domain(url: str) -> str: | |
| if not url.startswith(('http://', 'https://')): | |
| url = 'http://' + url | |
| return urlparse(url).netloc | |
| def _get_domain(url: str) -> str: | |
| """Extract domain from URL, prefixing with http:// if protocol is missing. | |
| Args: | |
| url: URL string to parse | |
| Returns: | |
| str: Domain extracted from URL | |
| Raises: | |
| ValueError: If URL is invalid | |
| """ | |
| if not url.startswith(('http://', 'https://')): | |
| url = 'http://' + url | |
| try: | |
| return urlparse(url).netloc | |
| except Exception as e: | |
| raise ValueError(f"Invalid URL format: {url}") from e |
|
|
||
|
|
||
| def _get_domain(url: str) -> str: | ||
| if not url.startswith(('http://', 'https://')): |
|
|
||
| analyses_ids = [ | ||
| report['analysis_id'] for report in analysis_history_url_result.all() | ||
| if _domain_contains(report['submitted_url'], url) |
There was a problem hiding this comment.
Let's talk, not sure i understand why we need the fix here and not in the actual search
Summary by CodeRabbit