Skip to content

ericnost/web-observatory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-observatory

Download Latest Version from PyPI

web-observatory is a Python package for collecting and analyzing webpages.

See here for extended examples of web-observatory in use.

Modules

start_project

Initializes a project directory

search_google

Searches Google for terms. Google Custom Search Engine credentials required.

google_process

Compiles results from multiple Google searches.

get_domains

Extracts domain-level information from the urls returned by Google searches (e.g. 'google' in www.google.com)

initialize_crawl

Initializes a Scrapy crawl on a set of domains. Returns a JSON file of urls found through the crawl.

crawl_process

Processes the JSON output of a crawl into a pandas DataFrame.

crawl

Not implemented as a module yet, but it can be run through a command like !scrapy crawl digcon_crawler -O output.json --nolog

search_merge

Merges Google searches and crawl results.

get_versions

Gets historical versions of Twitter-searched urls using the Internet Archive's Wayback Machine. Attempts to find the version of the page archived closest in time to when it was tweeted.
Uses the requests package to ping the url and get the "full" address rather than a redirect (e.g. bit.ly/12312). This helps in scraping.

initialize_scrape

Initializes files to scrape urls for their HTML.

scrape

Conducts the scrape of pages' HTML. Stores body text in a Postgresql database.

query

A set of methods for searching the Postgreql database of site text, including filtering empty results and counting specified search terms.

ground_truth

Produces a sample of pages for verifying counts of terms.

analyze_orgs

Calculates and visualizes averages and frequencies for each search term in the site text and summarizes by organization (domain).

analyze_currentuse

Calculate current average and frequency - useful when dealing with historical page versions

analzye_term_correlations

Calculates and visualizes co-variance metrics for specified search terms in the site text.

analyze_association

Associations per terms as measured by % of shared pages

co_occurrence

Returns specific pages using two or more specified search terms.

Issues and Development

See: web-observatory project

About

Python script for collecting and analyzing webpages

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages