web-observatory

web-observatory is a Python package for collecting and analyzing webpages.

See here for extended examples of web-observatory in use.

Modules

`start_project`

Initializes a project directory

`search_google`

Searches Google for terms. Google Custom Search Engine credentials required.

`google_process`

Compiles results from multiple Google searches.

`get_domains`

Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

`initialize_crawl`

Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

`crawl_process`

Processes the JSON output of a crawl into a pandas DataFrame.

`crawl`

Not implemented as a module yet, but it can be run through a command like !scrapy crawl digcon_crawler -O output.json --nolog

`search_merge`

Merges Google searches and crawl results.

`get_versions`

~~Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.~~
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

`initialize_scrape`

Initializes files to scrape urls for their HTML.

`scrape`

Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.

`query`

A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

`ground_truth`

Produces a sample of pages for verifying counts of terms.

`analyze_orgs`

Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

`analyze_currentuse`

Calculate current average and frequency - useful when dealing with historical page versions

`analzye_term_correlations`

Calculates and visualizes co-variance metrics for specified search terms in the site text.

`analyze_association`

Associations per terms as measured by % of shared pages

`co_occurrence`

Returns specific pages using two or more specified search terms.

Issues and Development

See: web-observatory project

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
dist		dist
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-observatory

Modules

`start_project`

`search_google`

`google_process`

`get_domains`

`initialize_crawl`

`crawl_process`

`crawl`

`search_merge`

`get_versions`

`initialize_scrape`

`scrape`

`query`

`ground_truth`

`analyze_orgs`

`analyze_currentuse`

`analzye_term_correlations`

`analyze_association`

`co_occurrence`

Issues and Development

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

web-observatory

Modules

start_project

search_google

google_process

get_domains

initialize_crawl

crawl_process

crawl

search_merge

get_versions

initialize_scrape

scrape

query

ground_truth

analyze_orgs

analyze_currentuse

analzye_term_correlations

analyze_association

co_occurrence

Issues and Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`start_project`

`search_google`

`google_process`

`get_domains`

`initialize_crawl`

`crawl_process`

`crawl`

`search_merge`

`get_versions`

`initialize_scrape`

`scrape`

`query`

`ground_truth`

`analyze_orgs`

`analyze_currentuse`

`analzye_term_correlations`

`analyze_association`

`co_occurrence`

Packages