Reproducible workflow to compute distances (Haversine) and analyze correlations between host variables and proximity to cultural hotspots
Short-term rentals (hosts/listings) -> distances to cultural hotspots -> correlation matrices (Pearson)
- Overview
- Scope and Features
- Installation & Requirements
- Quick Start
- Usage Guide
- Data Formats
- Project Architecture
- Methodology
- Dataset Structure
- Performance Benchmarks
- Results & Visualizations
- Research Team
- Industry Partners Supporting Innovation
- Citation & License
- Acknowledgments
- Contact
- FAQ
This repository implements an analysis workflow to:
- Compute geographical distances from short-term rental listings to a set of cultural hotspots in a city using the Haversine formula (km).
- Generate an output dataset with all distances appended and ready for analysis.
- Compute Pearson correlations between a target host variable (e.g., superhost, listings count, or a host-level derived count) and distances to cultural hotspots.
- Export results as correlation matrices (PNG/PDF) and print correlation values to the console.
Included in this repository:
- Correlations.py: runnable script (full pipeline).
- Correlations.ipynb: explanatory notebook using the same workflow.
- 📐 Distance computation: Haversine distances (km) from each listing to multiple hotspots.
- 📊 Correlation analysis: Pearson correlation for existing variables and derived count-based variables.
- 🖼️ Publication-ready outputs: heatmap correlation matrices exported to PNG and PDF.
- 💾 Reproducible artifacts: CSV outputs +
.tar.gzcompression for portability.
| Area | Question | Output |
|---|---|---|
| Urban analytics 🏙️ | Are high-volume hosts closer to cultural hotspots? | Correlations across distances |
| Tourism studies 🏛️ | Which hotspots show the strongest proximity effects? | Ranked correlation values |
| Spatial data science 🧭 | How do host attributes relate to cultural accessibility? | Heatmaps + summary statistics |
- Reads a listings/hosts dataset with latitude/longitude.
- Reads a cultural hotspots dataset with name/latitude/longitude.
- Adds one distance column per hotspot using the
d_t_...prefix (km). - Filters relevant columns, maps
host_is_superhostto 0/1, and removes missing rows. - Exports the final CSV and a compressed
.tar.gzcopy.
- Computes Pearson correlations between:
- An existing variable (e.g.,
host_is_superhost,host_total_listings_count) and distances, or - A derived variable based on counts (e.g.,
listing_countbyhost_id) and distances.
- An existing variable (e.g.,
- Generates and saves a correlation matrix (PNG and PDF) using Seaborn/Matplotlib.
- Python 3.x
- (Optional) Jupyter to run the notebook
pip install numpy pandas seaborn matplotlib haversineNote: tarfile is part of Python’s standard library (you do not install it via pip).
python -c "import numpy, pandas, seaborn, matplotlib, haversine; print('Installation successful!')"| Step | What to do |
|---|---|
| 1) Prepare data |
Extract the CSV files included in Information/: |
| 2) Run |
|
| 3) Check outputs |
Files are generated under Results/:hosts_with_distances_cultural.csv (and .tar.gz) + matrices Correlation_Matrix_*.png/.pdf.
|
The default flow in Correlations.py performs:
- Distance computation:
- Inputs:
Information/hosts.csv,Information/cultural_places.csv - Output:
Results/hosts_with_distances_cultural.csv(+ compression)
- Inputs:
- Three correlation experiments:
- Test 1:
host_is_superhost - Test 2:
host_total_listings_count - Test 3:
host_id(transformed intolisting_countper host)
- Test 1:
from Correlations import Distances, correlations_existing_variable, correlations_new_variable
Distances(
hosts="Information/hosts.csv",
places="Information/cultural_places.csv",
result="Results/hosts_with_distances_cultural.csv",
)
correlations_existing_variable(
filename="Results/hosts_with_distances_cultural.csv",
variable="host_is_superhost",
matrix_path="Results/Correlation_Matrix_1",
)Must include, at minimum:
latitude,longitude- Host variables to analyze, for example:
host_is_superhost,host_total_listings_count,host_id
The script also uses (and/or preserves) identification columns like id, host_url, host_name, etc.
Must include:
place_name: name used to buildd_t_<place_name>columnslatitude,longitude
Cultural-Proximity-Analytics/
├── Correlations.py # Pipeline: distances + correlations + exports
├── Correlations.ipynb # Explanatory notebook
├── Information/ # Input datasets (tar.gz containing CSV)
│ ├── hosts.tar.gz
│ └── cultural_places.tar.gz
├── Results/ # Outputs (CSV, PNG, PDF)
│ ├── hosts_with_distances_cultural.csv.tar.gz
│ ├── Correlation_Matrix_1.png/.pdf
│ ├── Correlation_Matrix_2.png/.pdf
│ └── Correlation_Matrix_3.png/.pdf
├── docs/ # Project assets (logo, figures, team)
└── LICENSE
For each listing with coordinates haversine library).
Let
-
$\rho(X, D_k)$ using Pearson correlation.
The full matrix is visualized as a heatmap and exported to PNG/PDF.
This repository ships example inputs as compressed archives under Information/:
Information/
├── hosts.tar.gz # Contains hosts.csv
└── cultural_places.tar.gz # Contains cultural_places.csv
Outputs are written to Results/ when running the pipeline:
Results/
├── hosts_with_distances_cultural.csv.tar.gz
├── Correlation_Matrix_1.png/.pdf
├── Correlation_Matrix_2.png/.pdf
└── Correlation_Matrix_3.png/.pdf
README assets are stored under docs/ (logo, figures, team photos).
The pipeline is designed for straightforward batch analysis. Runtime depends on the number of listings and hotspots.
| Stage | Core operation | Typical scaling |
|---|---|---|
| Distance computation | Haversine per listing × hotspot | ~ proportional to (N listings × M hotspots) |
| Correlations | Pearson correlations on numeric columns | ~ proportional to number of rows and variables |
| Visualization/export | Heatmap rendering + file writes | ~ proportional to matrix size |
Correlation matrices generated by the included example run (stored under docs/figures/ for README rendering):
|
Test 1 Variable: host_is_superhost![]() PDF · PNG
|
Test 2 Variable: host_total_listings_count![]() PDF · PNG
|
Test 3 Derived: listing_count per host_id![]() PDF · PNG
|
| Photo | Collaborator | Affiliation | Contact |
|---|---|---|---|
|
Dr. Francisco Javier Domínguez-Mota 🇲🇽 Applied Mathematics & Scientific Computing |
|
|
|
Dr. Heriberto Árias-Rojas 🇲🇽 Engineering Applications |
|
|
| Photo | Student | Institution | Contact |
|---|---|---|---|
|
Gabriela Pedraza-Jiménez |
|
|
|
Eli Chagolla-Inzunza |
|
|
| Photo | Student | Institution | Contact |
|---|---|---|---|
|
Jorge L. González-Figueroa |
|
|
|
Christopher N. Magaña-Barocio |
|
|
| Photo | Student | Institution | Contact |
|---|---|---|---|
|
|
Maria Goretti Fraga-Lopez |
|
|
Collaboration between academia and industry to accelerate real-world impact
|
🎯 Focus areas:
|
This project is distributed under the MIT License.
@software{tinoco2024citc_airbnb,
title = {Cultural Proximity Analytics: Distance and correlation analysis (short-term rentals and cultural hotspots)},
author = {Tinoco-Guerrero, Gerardo and Guzmán-Torres, José Alberto and Tinoco-Guerrero, Narciso Salvador},
year = {2024},
institution = {Universidad Michoacana de San Nicolás de Hidalgo},
note = {Geographical distance computation (Haversine) and correlation analysis (Pearson) between host variables and proximity to cultural hotspots.}
}|
Primary Contact
Research coordination Dr. Gerardo Tinoco-Guerrero Morelia, Michoacán, Mexico |
||||||
|
Technical Support
Bug reports, questions, and collaboration requests
|
||||||
|
Collaboration Opportunities
Research and engineering partnerships
|
||||||
|
Student Opportunities
Projects and training in data science and geospatial analytics
|
||||||
|
Institutional Affiliations
|
Do I need Jupyter?
No. You can run the full pipeline with
python Correlations.py. Jupyter is only needed for the notebook.
Why are the example inputs packaged as .tar.gz files?
To keep example datasets compact and easy to distribute. Extract them into
Information/ before running the script.
How do I add a new cultural hotspot?
Add a row to
Information/places.csv with name, latitude, longitude. The pipeline will generate a new distance column (d_t_...) automatically.










