Skip to content

Tomas-Barhon/Python-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python-project

This is a python project for our subject Data Processing in Python (JEM207) belonging to Tomáš Barhoň and Radim Plško.
We will be studying the impact of socio-economical indicators that might be the driving forces for an increased economical criminality in different regions in the Czech Republic.
The main data source will be the https://kriminalita.policie.cz/ API that contains information about all sorts of crimes and their geographical location.
For the socio-economical data we used the data from PAQ research. The four indicators chosen were "Lidé v exekuci (2021) [%]" (ammount of people with foreclosure), "Podíl lidí bez středního vzdělání (2021) [%]" (ammount of people without completed highschool), "Domácnosti čerpající přídavek na živobytí (2020) [%]" (ammount of people receiving social benefits), "Propadání (průměr 2015–2021) [%]" (ammount of kids that obtain the worst grade 5 from any subject at the end of summer semester). These data are obtained from various open-data sources that are mentioned in the references.
The data on criminality is subseted in order to fulfill following conditions. The crime is illegal and is confirmed that it really happened. The record has relevance "Místo následku" (the place where the consequences appear) "Místo spáchání" (the exact place where it was commited). And finally we subset only the crimes that are of an economical nature. (thefts, burglary, ...). We are analysing the data from 2021-2023 (currently till June) which yields about 500 000 criminal records.
The main purpouse of the project is to analyse the data on the level of "ORP" which is relatively small administrative unit in the Czech Republic. We would like to take two different approaches. One being visualizations of the states of each factor in the regions and visually comparring the maps. The second one being correlational analysis between criminality and each of the indicators.
We analysed how the individual indicators influence the level of economic criminality. As following, we created the index of criminality that indicates how each ORPs stand on the scale of criminal activities compared to other regions. All this based on the information we received from the previous analysis.
References:
https://www.datapaq.cz/
PAQ data endpoints:
Domácnosti čerpající přídavek na živobytí (2020) po ORP -> Agentura pro sociální začleňování, MPSV
Podíl lidí bez středního vzdělání -> ČSÚ, SLDB 2021
Propadání (2015-2021)-> ČŠI
Lidé v exekuci (2021)-> Exekutorská komora ČR, ČSÚ, Czech Household Panel Study
https://kriminalita.policie.cz/
https://www.czso.cz/csu/xs/obyvatelstvo-xs Czech Statistical office - the data on the population in each ORP as the data was in quite a messy Excel file, we had to transform it manually and the new table is now to your disposal in our repository (app/počet_obyvatel_ORP.xlsx) and can be used in other projects with similar nature. It makes it easier to share the project with others.

Project report


If you want to learn about our project and our findings in form of various visualizations make sure to read through the project report which can be found at:
Python-project/Project_Report/project_report.pdf

How to install the project


Getting Started

To get started with this project, you'll need to clone the repository to your local machine. Follow the steps below:

Prerequisites

Before you begin, make sure you have Git installed on your machine. If you don't have Git installed, you can download it from the official website: Git Downloads.

Cloning the Repository

  1. Open your terminal or command prompt.

  2. Change the current working directory to the location where you want to clone the repository. For example, if you want to clone the repository to your_repository, you can use the following command:

cd your_repository

Now, you're ready to clone the repository. Use the following command to clone the repository:

git clone https://github.com/Tomas-Barhon/Python-project.git

Press Enter, and Git will clone the repository to your local machine.

Accessing the Repository

After cloning the repository, you can access its contents by navigating to the repository's directory using the cd command. For example:

cd Python-project

Creating virtual environment

We recommend creating a virtual environment for our project with either Conda or Venv. This is only a recommendation and you can just do all of the following steps in your global (system-wide) python environment.

Installing libraries

In order to install all the necessary libraries you can either activate your virtual environment or proceed in the global one.
Now you can install all of the libraries either manually or by navigating to the respository of the project and using its requirements.txt file. You can do this by typing these lines into your command prompt.

pip install -r requirements.txt

Now you should be ready to run the whole project without any trouble.

How to use the modules seperately for your own project


Downloader (check "how_to_downloader.ipynb")

Downloader is a class to download data from the https://kriminalita.policie.cz/ API. It is designed to either download data for one specific month or for multiple years and return them to user as pandas.DataFrame.
In order to use Downloader module you first need to import it using the following command:

from data_API_downloader import Downloader

Now to obtain the criminal records for one specific month you need to run the following lines of code.

download = Downloader(year= 2012,month = 5)
#send the get request to obtain the zip file
download.get_request()
#unzip the downloaded file and get .csv file for the month
download.unzip_files_return_dataframe()

You can freely adjust the year and month. It is important that the year and month are numbers in integer form. The data is available from year 2012, that means you have to enter year higher or equal to 2012 and month has to clearly be from range 1-12 otherwise an error will be returned.

In order to use Downloader to download data for multiple years you need to run the following line of code.

#How to use Downloader to download data for multiple years
crime_data = download.get_multiple_years(years = [2012,2013,2014])

You can adjust the years you want to download the data for. Mind that again years must be a list of integers from 2012 and higher. It is important that you specify at least one year with some available data otherwise an error will be raised. For example specifing years = [5000,5001] will result in a specific ValueError.

DataPipeline (check "how_to_data_pipeline.ipynb")

DataPipeline is a class created to process data created by Downloader and create a final table where each ORP does have all of the 6 parameters and can be used for further analysis of users choice. Thus it can be freely used in any other project requiring such data on the level of ORP.
First, we need to import the two modules with the following command:

from data_API_downloader import Downloader, DataPipeline

Now we proceed with initializing the Downloader object with the year and month being irrelevant if we want to use the get_multiple_years method. And we run already mentioned get_multiple_years method as following:

download = Downloader(year= 2012,month = 5)
crime_data = download.get_multiple_years(years = [2012,2013,2014])

Then the user needs to initialize the DataPipeline object passing the crime_data created by the Downloader and specifying create_data = True meaning that we want to use the provided new data. If we would set it to False it would use our data from 2021-2023 which are part of the repository. This will be mentioned in the other parts.

pipeline = DataPipeline(crime_data = crime_data, create_data = True)

To obtain the desired table we need to run the following 4 methods in the correct order and look at the table in the end.

pipeline.match_crime_data_to_polygons()
pipeline.compute_counts_per_polygon()
pipeline.preprocess_paq_data()
table = pipeline.merge_final_table()
table.head(10)

Make sure to use more years in order to obtain enough observations so that there is at least one observation for each ORP.

VisualizerOfCriminalData (check "how_to_visualizer.ipynb")

VisualizerOfCriminalData is a class which can be used to visualize the data from the final table created from DataPipeline. It is designed to create 3 types of visualization we have found useful in our geographical analysis. The class can return Folium choropleth maps for all the parameters we used in our analysis, it can show scatter plots with regression line for the response variable against all independent variables and it can show correlation heatmap which compares the level of correlation among our variables.


In order to use visualizer the user has to pass the data table created by DataPipeline. For our purpouses we will not download and create our data but we will use the create_data = False attribute in order to demonstrate the usage of the class. Now you need to make sure that you have the data files that were part of the GitHub respoitory and you can proceed with the following commands.
from data_API_downloader import DataPipeline
from visualizer import VisualizerOfCriminalData
pipeline = DataPipeline(crime_data = None, create_data = False)
pipeline.match_crime_data_to_polygons()
pipeline.compute_counts_per_polygon()
pipeline.preprocess_paq_data()
table = pipeline.merge_final_table()

And we initialize visualizer and obtain the Folium maps as a list of 6 objects.

#Now we initialize the visualizer with the data created from pipeline
visualizer = VisualizerOfCriminalData(table)
maps = visualizer.get_folium_maps()
labels = visualizer.english_legend_name_buffer

Now to show these maps with their labels in jupyter user can simply call the following code where he can exchange the number 0 for any number up to 5.

print(labels[0])
maps[0]

Finally user can easily also call the last two methods that show different visualizations of correlation between the variables.

visualizer.show_scatter_correlations()
visualizer.show_correlation_heatmap()

Userguide for the project (check "main.ipynb" or just open it in nbviewer down below)

In order to run the same code as we did you can follow exactly the notebook main.ipynb. It is really similar to how_to_visualizer but you can see all the visualizations and all our outputs. If you followed all the previous instructions you will understand what is going on in the code. One thing to note is that we now run the code in the create_data = False regime. What we did previously is that we saved the data downloaded by Downloader for years 2021-2023 into a csv file which is now part of the repository as data_in_polygons.csv. The main reason for that is that it takes quite a lot of time to match cca 500 000 records to their polygons so to speed up our process we saved it for the years we aimed to analyse. You can of course always download the data with Downloader and then pass them into DataPipeline as we already explained in the previous tutorials but to make it easy for people that just want to run it exactly as we did we made this choice.

How to open the notebook with the Folium maps displayed


In order to open the notebook in your web-browser and be able to see all the visualizations you need to open the notebook in nbviewer.
You can find it on the following link: (if you want to open other notebooks just change the name in the end)
https://nbviewer.org/github/Tomas-Barhon/Python-project/blob/main/app/main.ipynb

Disclaimer & final notes


Although we tried to make our code quite user input proof there is a lot of places where it might fail for other reasons. That is the reason why we did provide such a detailed tutorial, how does the code work and how it should be used. Also we do not cover everything from downloading python creating a virtual environment and how to make it a jupyter kernel. We assume that the user has some knowledge of these types of technologies. We will be happy for any feedback and pull requests to the repository. Our goal was to not even do our own project but provide modules that can be useful to someone else doing similar type of analysis. We would like to thank all of the data sources that we used. We had to share some of the data in transformed form as part of our repository for already mentioned purpouses (time, exact data source for our project, not everyone has public API). The last note is that we used pytest for our testing but the tests can be freely ignored by user who will follow the guidance above.

About

Study of the main determinants of economic crimes in the Czech Republic

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages