Code
from pathlib import Path
import numpy as np
import pandas as pd
import openpyxl
import warnings
'ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning) warnings.filterwarnings(
The raw data sets can all be found in the directory data/01_raw
. Each data set is then located in its own folder, which includes a spreadsheet (.xlsx
, .xls
, .csv
or .ods
), as well as the URL to the source for downloading the data.
Number | Dataset Name | Source | Folder in data/01_raw |
---|---|---|---|
R01 | Swiss Real Estate Offer Index | https://www.iazicifi.ch/ |
01_Swiss Real Estate Offer Index |
R02 | Distribution of population by tenure status | https://ec.europa.eu/ |
02_Distribution of population by tenure status |
R03 | Average sales price of new and existing residential property | https://www.statista.com/ |
03_Average price per square meter in european countries |
R04 | General overview “Buildings” by cantons 2021 | https://www.bfs.admin.ch/ |
04_Allgemeine Übersicht Gebäude nach Kantonen 2021 |
R05 | Structure of the permanent resident population by canton, 1999-2021 | https://www.bfs.admin.ch/ |
05_Struktur der ständigen Wohnbevölkerung nach Kanton, 1999-2021 |
R06 | Swiss Construction Price Index - Average unit prices in Switzerland and the major regions | https://www.bfs.admin.ch/ |
06_Schweizerischer Baupreisindex - Durchschnittliche Einheitspreise in der Schweiz und in den Grossregionen |
R07 | Taxes & Public Finances | https://www.statista.com/ |
07_Steuern_&_Staatsfinanzen |
from pathlib import Path
import numpy as np
import pandas as pd
import openpyxl
import warnings
'ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=FutureWarning) warnings.filterwarnings(
= Path('../data')
data_path = Path(data_path, '01_raw')
raw_data = Path(data_path, '02_processed')
prc_data = Path(data_path, '03_exports') exp_data
The source of this dataset consists of an excel file with multiple spreadsheets. Collectivly, it shows the monthly average price of real estate in selected regions of switzerland.
Miet_Indizes
: This sheet contains the monthly averaged offer prices for rental appartments in selected regions of switherland.
t
: date of the index in format dd.mm.yyyy
Eigenthumswohnungen
: prices for condominiums in switzerlandEinfamilienhäuser
: prices for houses in switzerlandPreis_Indizes
: This sheet contains the monthly averaged price offers for condominiumns and houses in switzerland.
t
: date of the index in format dd.mm.yyyy
Eigenthumswohnungen
: prices for condominiums in switzerlandEinfamilienhäuser
: prices for houses in switzerlandThis dataset was created by the company IAZI AG - CIFI SA and ImmoScout24. The Swiss Real Estate Offer Index is a real estate index family. They are the world’s first hedonic indices that are calculated and updated in real time. Thanks to the continuous updating and their methodology, they allow the timely monitoring of the development of offer prices and offer rents on the largest real estate platform in Switzerland.
In the following steps, the data will be extracted from 2 sheets and in the end combined to create the raw dataset R01.
The data is already preprocessed which means, handling NaN
values won’t be necessairy. However since two sheet will be merged, a cut-off from a specific year is nescessairy since the two subdatasets do not have the same time range.
= Path(
rDataSet_10
raw_data, '01_Swiss Real Estate Offer Index',
'Swiss Real Estate Offer Index.xlsx'
)
As before mentioned a new dataset will be created through extracting the specific sheets from the main dataset and combining them into one dataset.
Data extraction and reshaping Miet_Indizes
= pd.read_excel(
dataframe_10
rDataSet_10, ='Miet_Indizes',
sheet_name=6,
skiprows=3,
skipfooter
)
# drop unused columns
= dataframe_10.drop(
dataframe_10 2:],
dataframe_10.columns[=1
axis
)
# rename columns
= dataframe_10.rename(
dataframe_10 ={
columns't': 'date',
'CH': 'rental_index'
}
)
# format column date to datetime
'date'] = pd.to_datetime(
dataframe_10['date'],
dataframe_10[=True
dayfirst
).dt.date
# reset index & sort by date
=True, inplace=True)
dataframe_10.reset_index(drop='date', inplace=True)
dataframe_10.sort_values(by
dataframe_10.head()
date | rental_index | |
---|---|---|
0 | 2014-12-31 | 260.82 |
1 | 2015-01-31 | 259.36 |
2 | 2015-02-28 | 258.83 |
3 | 2015-03-31 | 258.10 |
4 | 2015-04-30 | 261.57 |
Data extraction and reshaping Preis_Indizes
= pd.read_excel(
dataframe_11
rDataSet_10, ='Preis_Indizes',
sheet_name=6,
skiprows=3,
skipfooter
)
# rename columns
= dataframe_11.rename(
dataframe_11 ={
columns't': 'date',
'Eigentumswohnungen': 'condos',
'Einfamilienhäuser': 'houses'
}
)
# format column date to datetime
'date'] = pd.to_datetime(
dataframe_11['date'],
dataframe_11[=True
dayfirst
).dt.date
# reset index & sort by date
=True, inplace=True)
dataframe_11.reset_index(drop='date', inplace=True)
dataframe_11.sort_values(by
dataframe_11.head()
date | condos | houses | |
---|---|---|---|
0 | 2010-12-31 | 5822.53 | 5458.33 |
1 | 2011-01-31 | 5852.88 | 5502.88 |
2 | 2011-02-28 | 5846.56 | 5509.60 |
3 | 2011-03-31 | 5964.04 | 5597.43 |
4 | 2011-04-30 | 5943.01 | 5594.06 |
= pd.merge(
dataframe_10
dataframe_10,
dataframe_11,='date'
on
)
dataframe_10.head()
date | rental_index | condos | houses | |
---|---|---|---|---|
0 | 2014-12-31 | 260.82 | 6898.10 | 5990.61 |
1 | 2015-01-31 | 259.36 | 6916.64 | 5964.14 |
2 | 2015-02-28 | 258.83 | 6894.75 | 5974.35 |
3 | 2015-03-31 | 258.10 | 6879.93 | 5988.68 |
4 | 2015-04-30 | 261.57 | 6901.96 | 5987.68 |
Collectivly, this presents the distribution of population by tenure status (tenure, owner) in selected european countries.
Sheet: Summary
: This sheet gives additional informationa bout the source and the contents of the dataset
Sheet: Structure
: The summary sheet gives additionaly information about the structure of the data.
Sheet: Sheet 1
: This sheet shows the distribution of population by tenure status, type of household and income group with the tenure type owner
2003 - 2022
: year of the dataSheet: Sheet 2
: This sheet shows the distribution of population by tenure status, type of household and income group with the tenure type tenure
Note: The relevant sheet for this task is
Sheet 1
Although there are some NaN entries in the dataset, they wont be replaced or looked at differently. Instead, the visualisation will only cover reference a specific column where there are the least NaN values. Aditionally there are some countries whose names are abnormal. These will be renamed to match the rest of the data.
= Path(
rDataSet_20
raw_data, '02_Distribution of population by tenure status',
'ilc_lvho02__custom_6240818_spreadsheet.xlsx'
)
= pd.read_excel(
dataframe_20
rDataSet_20, ='Sheet 1',
sheet_name=11,
skiprows=3,
skipfooter
)
# keep only relevant columns
= dataframe_20.iloc[:, [0, 16]]
dataframe_20
# rename columns
= dataframe_20.rename(
dataframe_20 ={
columns'GEO (Labels)': 'country',
'Unnamed: 16': '2018'
}
)
# rename row with Germany
= dataframe_20.replace(
dataframe_20 'Germany (until 1990 former territory of the FRG)',
'Germany'
)
# reset index
=True, inplace=True)
dataframe_20.reset_index(drop
dataframe_20.head()
country | 2018 | |
---|---|---|
0 | Belgium | 72.3 |
1 | Bulgaria | 83.6 |
2 | Czechia | 78.7 |
3 | Denmark | 60.5 |
4 | Germany | 51.5 |
This dataset contains the monthly averaged price per square meter for owned real estate in selected european countries.
Sheet: Overview
: This sheet containes the additionaly information about the data and its source
Sheet: Data
: This sheet contains the monthly averaged price offers for condominiumns and houses in switzerland
Country
: respective countryNew Housing
: average sales price for new built housesExisting Housing
: average sales price for existing housesThe data for this dataset was originally published by the company Deloitte in August 2022 in a pdf File. Afterwards the relevant data has been copied into a workable Excel file and uploaded to Statista. In the following steps an extraction of the sheet ‘data’ will be made. Since only one price is needed in the end, the average of the price for existing houses and new houses gets calculated. This also removes the problematic of the NaN Values because in that case only the existing value is kept.
= Path(
rDataSet_30
raw_data, '03_Average price per square meter in european countries',
'statistic_id722905_average-residential-real-estate-square-meter-prices-in-europe-2021-by-country.xlsx'
)
= pd.read_excel(
dataframe_30
rDataSet_30, ='Data',
sheet_name=4,
skiprows
)
# drop unused columns
dataframe_30.drop('Unnamed: 0'],
[=1,
axis=True
inplace
)
# rename columns
= dataframe_30.rename(
dataframe_30 ={
columns'Unnamed: 1': 'country',
'New housing': 'new_housing',
'Existing housing': 'existing_housing'
}
)
# calculate average price
'avg_price'] = np.nanmean(
dataframe_30['new_housing', 'existing_housing']],
dataframe_30[[=1
axis
)
# drop unused columns
dataframe_30.drop('new_housing', 'existing_housing'],
[=1,
axis=True
inplace
)
dataframe_30.head()
country | avg_price | |
---|---|---|
0 | UK | 4235.0 |
1 | Austria | 4205.0 |
2 | France | 4012.0 |
3 | Netherlands | 3599.5 |
4 | Norway | 4286.0 |
This dataset contains detailed information on the types of habitable buildings in switzerland, categorized by swiss cantons. Each sheet in this dataset represents a canton of switzerland, as well as a sheet for the entire country.
Total
: contains different types of housings and buildingsBauperiode
: contains ranges of years when the houses were builtGrossanzahl
: number of storeys of the buildingsEnergiequelle der Heizung
: energy source for heatingEnergiequelle für die Warmwasseraufbereitung
: energy source for warm water treatmentEigentümertyp
: type of ownershipTotal
: cumulative number of habitable buildingsBauperiode
: building periodsMit ... Wohnungen
: number of appartments within a buildingThe files provided by the BFS (Bundesamt für Statistik, Federal Statistics Office) are already cleaned datasets. Therefore, handling of NaN values will mostly not be necessary. The current dataset exhibits some cells without values, but this is due to yearly numbers being present in the rows and the columns. This leads to values along the diagonal and NaN values in the triangles above and below the diagonal. The main objective of processing this dataset is to extract the necessary columns and rows for later analysis and visualizations.
= Path(
rDataSet_40
raw_data, '04_Allgemeine Übersicht Gebäude nach Kantonen 2021',
'je-d-09.02.00-2021.xlsx'
)
= openpyxl.load_workbook(rDataSet_40, read_only=True)
wb_40 = wb_40.sheetnames
sheets_40
def buildings_extraction(path: str, sheet: str, df_out: pd.DataFrame) -> None:
"""
Extracts the total number of buildings and single family homes for
a given sheet (canton)
:param path: path to the excel file
:param sheet: sheet name
:param df_out: dataframe to store the extracted data
:return: None
"""
= pd.read_excel(io=path, sheet_name=sheet, skiprows=4)
df_temp = sheet[-2:]
canton = int(df_temp.iloc[0, 1])
total = int(df_temp.iloc[2, 1])
single_house
= [total, single_house]
df_out.loc[canton]
= pd.DataFrame(
dataframe_40 =['buildings_total', 'single_family_home']
columns
)
for sheet in sheets_40:
buildings_extraction(=rDataSet_40,
path=sheet,
sheet=dataframe_40
df_out
)
= dataframe_40.iloc[1:].copy()
dataframe_40 = 'canton'
dataframe_40.index.name
= dataframe_40.reset_index()
dataframe_40 = dataframe_40.sort_values(by='canton')
dataframe_40
dataframe_40.head()
canton | buildings_total | single_family_home | |
---|---|---|---|
18 | AG | 153894 | 102206 |
15 | AI | 5299 | 2917 |
14 | AR | 16323 | 9251 |
1 | BE | 238111 | 114053 |
12 | BL | 67390 | 46632 |
This dataset contains information about the population size in each canton of switzerland.
Total
: aggregates the data from all other columnsAs previously mentioned, the datasets from the BFS are known for their high data quality, requiring primarily the extraction of relevant information.
= Path(
rDataSet_50
raw_data,'05_Struktur der ständigen Wohnbevölkerung nach Kanton, 1999-2021',
'je-d-01.02.03.04.xlsx'
)
= pd.read_excel(
dataframe_50 =rDataSet_50,
io='2021',
sheet_name=4,
skiprows
)
# selecting and renaming relevant columns
= dataframe_50[['Unnamed: 0', 'Unnamed: 1']].copy()
dataframe_50
dataframe_50.rename(=True,
inplace={
columns'Unnamed: 0': 'canton',
'Unnamed: 1': 'inhabitants'
},
)
# select canton rows (present in dataframe_40)
= dataframe_50[dataframe_50['canton'].isin(dataframe_40['canton'])]
dataframe_50
dataframe_50.head()
canton | inhabitants | |
---|---|---|
2 | VD | 822968.0 |
3 | VS | 353209.0 |
4 | GE | 509448.0 |
6 | BE | 1047473.0 |
7 | FR | 329809.0 |
This dataset contains detailed information about pricing in the building industry across different regions of Switzerland. The columns represent the different regions, while the rows contain various segments of the building process, such as attic, framework, flooring, and more.
As mentioned previously, the datasets from BFS are very clean. Therefore, this part mainly involves extracting the relevant data. Since this dataset contains macros, a copy was created with the suffix ’_raw’ for convenience. Some rows consist of NaN (Not a Number) values, which is a result of the dataset structure. Additionally, some values are missing for the region of Ticino. This information is provided at the end of the Excel sheet as a comment: “Drei Punkte (…) bedeuten, dass der Wert nicht vorhanden, nicht genügend repräsentativ oder unter Datenschutz ist.” (EN: Three dots (…) mean that the value is not present, not sufficiently representative, or under data protection.)
= Path(
rDataSet_60
raw_data, '06_Schweizerischer Baupreisindex - Durchschnittliche Einheitspreise in der Schweiz und in den Grossregionen',
'cc-t-05.05.02.xlsx'
)
= pd.read_excel(
dataframe_60 =rDataSet_60,
io='BAP_PCO',
sheet_name=9,
skiprows
)
# drop unused columns
= dataframe_60.drop(
dataframe_60 =[
columns'Unnamed: 0',
'Unnamed: 1',
'Unnamed: 2',
'Einheit',
'Menge',
'Unnamed: 4',
'Unnamed: 15',
'Unnamed: 16'
]
)
# rename columns
= dataframe_60.rename(
dataframe_60 ={
columns'NPK Position': 'Category',
'Schweiz': 'Switzerland',
'Genferseeregion (VD,VS,GE)': 'Lake Geneva Region',
'Espace Mittelland (BE, FR, SO, NE, JU)': 'Espace Midland',
'Nordwestschweiz (BS, BL, AG)': 'Northwestern Switzerland',
'Zürich (ZH)': 'Zurich',
'Ostschweiz (GL, SH, AR, AI, SG, GR, TG)': 'Eastern Switzerland',
'Zentralschweiz (LU, UR, SZ, OW, NW, ZG)': 'Central Switzerland',
'Tessin (TI)': 'Ticino'
}
)
def assignCategory(data: pd.DataFrame) -> pd.DataFrame:
= []
categories for index, row in data.iterrows():
if type(row['Category']) != float:
= ' '.join(row['Category'].split(' ')[1:])
category
categories.append(category)
= 'none'
current_category for i in range(len(categories)):
if categories[i] != '':
= categories[i]
current_category = current_category
categories[i]
'Category'] = categories
data[return data
= assignCategory(dataframe_60)
dataframe_61 dataframe_61.head()
Category | Switzerland | Lake Geneva Region | Espace Midland | Northwestern Switzerland | Zurich | Eastern Switzerland | Central Switzerland | Ticino | |
---|---|---|---|---|---|---|---|---|---|
0 | Erdarbeiten | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | Erdarbeiten | 6.5806 | 9.649058 | 5.1200 | 5.0588 | 5.8022 | 5.3222 | 6.5600 | 8.9343 |
2 | Erdarbeiten | 6.3088 | 8.066795 | 5.2969 | 5.5250 | 6.2000 | 5.7944 | 5.3686 | 8.2543 |
3 | Erdarbeiten | 52.3058 | 55.199414 | 47.0450 | 46.8138 | 52.0889 | 57.5089 | 58.3843 | 47.9086 |
4 | Erdarbeiten | 12.4308 | 17.165547 | 11.8777 | 10.9088 | 11.7300 | 10.3056 | 9.4986 | 11.9183 |
NaN
values= dataframe_61.replace('...', np.NaN)
dataframe_61 = dataframe_61.dropna()
dataframe_61
print(dataframe_61.isna().sum())
print(dataframe_61.dtypes)
Category 0
Switzerland 0
Lake Geneva Region 0
Espace Midland 0
Northwestern Switzerland 0
Zurich 0
Eastern Switzerland 0
Central Switzerland 0
Ticino 0
dtype: int64
Category object
Switzerland float64
Lake Geneva Region float64
Espace Midland float64
Northwestern Switzerland float64
Zurich float64
Eastern Switzerland float64
Central Switzerland float64
Ticino float64
dtype: object
= dataframe_61.groupby('Category').sum()
dataframe_62
def normalize_rows(data: pd.DataFrame) -> pd.DataFrame:
= list(data.columns)
regions = data.index
categories = pd.DataFrame(columns = regions)
data2
for index, row in data.iterrows():
= [float(x) for x in row]
row_list_floats = float(row['Switzerland'])
ch = [x * 100 / ch - 100 for x in row_list_floats]
normalized_row len(data2.index)] = normalized_row
data2.loc['Category'] = categories
data2[= data2.set_index('Category')
data2 return data2
= normalize_rows(dataframe_62)
dataframe_63 = dataframe_63.drop(columns=['Switzerland'])
dataframe_63
dataframe_63.head()
Lake Geneva Region | Espace Midland | Northwestern Switzerland | Zurich | Eastern Switzerland | Central Switzerland | Ticino | |
---|---|---|---|---|---|---|---|
Category | |||||||
Allgemeine Schreinerarbeiten | 9.782592 | -5.054522 | -5.332594 | -1.278220 | -3.929714 | -2.317067 | 10.482749 |
Aufzüge | -1.834763 | 5.337382 | -1.834763 | -1.834763 | -1.834763 | 1.768710 | -0.301372 |
Baureinigung | 12.377640 | 10.988312 | -0.829673 | -8.518666 | -17.082390 | -12.249670 | 6.612687 |
Bodenbeläge | 6.990752 | -4.967931 | -3.518762 | 2.796378 | -9.766969 | 1.671983 | 12.403766 |
Dichtungsbeläge | 6.729232 | -5.094807 | 1.559295 | -6.350533 | 2.841165 | -5.195145 | 10.569095 |
Sorting columns by the most positive values (above swiss average)
= dataframe_63.T
dataframe_63_T 'POS_COUNT'] = dataframe_63_T.select_dtypes(include='float64').gt(0).sum(axis=1)
dataframe_63_T[
= dataframe_63_T.sort_values(by='POS_COUNT', ascending=False)
dataframe_64 = dataframe_64.drop(columns=['POS_COUNT'])
dataframe_64
dataframe_64.head()
Category | Allgemeine Schreinerarbeiten | Aufzüge | Baureinigung | Bodenbeläge | Dichtungsbeläge | Erdarbeiten | Estriche | Fenster | Gerüste | Gipserarbeiten | Gärtnerarbeiten | Kücheneinrichtungen | Malerarbeiten | Maurer- und Stahlbetonarbeiten | Metallbauarbeiten | Plattenarbeiten | Spenglerarbeiten | Tiefbauarbeiten | Zimmerarbeiten |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lake Geneva Region | 9.782592 | -1.834763 | 12.377640 | 6.990752 | 6.729232 | 18.853844 | 9.125194 | -2.855986 | 30.495326 | -2.540170 | 17.343409 | 5.541763 | 26.377162 | -0.587649 | 1.882927 | -11.743498 | 34.513863 | -1.217680 | 20.001063 |
Ticino | 10.482749 | -0.301372 | 6.612687 | 12.403766 | 10.569095 | 4.854349 | 14.092483 | 0.948539 | -13.272724 | -10.468717 | 17.892175 | -11.455716 | 12.276155 | -8.746794 | -20.442963 | -4.932700 | -20.601170 | -11.603154 | 1.526116 |
Espace Midland | -5.054522 | 5.337382 | 10.988312 | -4.967931 | -5.094807 | -12.021344 | 1.678070 | -1.247265 | -5.394688 | -9.314736 | -3.830083 | 6.225370 | -0.782745 | 9.283176 | 1.162569 | 6.901373 | -3.566494 | -3.394399 | -0.242437 |
Central Switzerland | -2.317067 | 1.768710 | -12.249670 | 1.671983 | -5.195145 | 1.162903 | -5.680765 | -0.498641 | -7.814889 | 1.363496 | -18.516207 | 6.864781 | -3.704139 | -8.262848 | 8.587475 | 7.402684 | -23.364883 | -2.173099 | -13.500970 |
Northwestern Switzerland | -5.332594 | -1.834763 | -0.829673 | -3.518762 | 1.559295 | -11.036378 | -6.413027 | 3.134207 | -16.413265 | 4.357343 | -2.070326 | -5.279719 | -21.140344 | 2.573407 | -4.320370 | 8.691597 | -0.342758 | 12.987236 | -13.228093 |
= {
translation_dict 'Allgemeine Schreinerarbeiten': 'General carpentry work',
'Aufzüge': 'Elevators',
'Baureinigung': 'Construction cleaning',
'Bodenbeläge': 'Floor coverings',
'Dichtungsbeläge': 'Sealing coatings',
'Erdarbeiten': 'Earthworks',
'Estriche ': 'Attic',
'Fenster': 'Windows',
'Gerüste': 'Scaffolding',
'Gipserarbeiten': 'Plastering work',
'Gärtnerarbeiten': 'Gardening work',
'Kücheneinrichtungen': 'Kitchen fittings',
'Malerarbeiten': 'Painting work',
'Maurer- und Stahlbetonarbeiten': 'Masonry and reinforced concrete work',
'Metallbauarbeiten': 'Metal construction work',
'Plattenarbeiten': 'Tiling work',
'Spenglerarbeiten': 'Sheet metal work',
'Tiefbauarbeiten': 'Civil engineering works',
'Zimmerarbeiten': 'Carpentry work'
}
= dataframe_64.rename(columns=translation_dict)
dataframe_64
'region'] = dataframe_64.index
dataframe_64[=True, inplace=True)
dataframe_64.reset_index(drop dataframe_64.head()
Category | General carpentry work | Elevators | Construction cleaning | Floor coverings | Sealing coatings | Earthworks | Attic | Windows | Scaffolding | Plastering work | Gardening work | Kitchen fittings | Painting work | Masonry and reinforced concrete work | Metal construction work | Tiling work | Sheet metal work | Civil engineering works | Carpentry work | region |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9.782592 | -1.834763 | 12.377640 | 6.990752 | 6.729232 | 18.853844 | 9.125194 | -2.855986 | 30.495326 | -2.540170 | 17.343409 | 5.541763 | 26.377162 | -0.587649 | 1.882927 | -11.743498 | 34.513863 | -1.217680 | 20.001063 | Lake Geneva Region |
1 | 10.482749 | -0.301372 | 6.612687 | 12.403766 | 10.569095 | 4.854349 | 14.092483 | 0.948539 | -13.272724 | -10.468717 | 17.892175 | -11.455716 | 12.276155 | -8.746794 | -20.442963 | -4.932700 | -20.601170 | -11.603154 | 1.526116 | Ticino |
2 | -5.054522 | 5.337382 | 10.988312 | -4.967931 | -5.094807 | -12.021344 | 1.678070 | -1.247265 | -5.394688 | -9.314736 | -3.830083 | 6.225370 | -0.782745 | 9.283176 | 1.162569 | 6.901373 | -3.566494 | -3.394399 | -0.242437 | Espace Midland |
3 | -2.317067 | 1.768710 | -12.249670 | 1.671983 | -5.195145 | 1.162903 | -5.680765 | -0.498641 | -7.814889 | 1.363496 | -18.516207 | 6.864781 | -3.704139 | -8.262848 | 8.587475 | 7.402684 | -23.364883 | -2.173099 | -13.500970 | Central Switzerland |
4 | -5.332594 | -1.834763 | -0.829673 | -3.518762 | 1.559295 | -11.036378 | -6.413027 | 3.134207 | -16.413265 | 4.357343 | -2.070326 | -5.279719 | -21.140344 | 2.573407 | -4.320370 | 8.691597 | -0.342758 | 12.987236 | -13.228093 | Northwestern Switzerland |
The source of these comprehensive data is Statista, renowned for its quality and reliability. Researchers, analysts, and interested individuals can rely on this data to conduct informed studies and analyses.
The files provided by Statista are preprocessed datasets, typically requiring minimal treatment of missing values or NaNs. The current datasets are cleaned.
The primary objective of processing this dataset is to extract the necessary columns and rows for subsequent analysis and visualization.
Schweiz - Einkommenssteuersätze nach Kantonen 2022_Statista.csv
:
KANTON
: represents the canton names of SwitzerlandPROZENT
: represents the income tax rates in percentageSchweiz - Gewinnsteuersätze nach Kantonen 2022 _ Statista.csv
KANTON
: represents the canton names of SwitzerlandPROZENT
: represents the profit tax rates in percentageSchweiz - Vermögenssteuersätze nach Kantonen 2018 _ Statista.csv
KANTON
: represents the canton names of SwitzerlandPROZENT
: represents the wealth tax rates in promille= Path(
rDataSet_70
raw_data,'07_Steuern_&_Staatsfinanzen',
'Schweiz - Einkommenssteuersätze nach Kantonen 2022 _ Statista.csv'
)= Path(
rDataSet_71
raw_data,'07_Steuern_&_Staatsfinanzen',
'Schweiz - Gewinnsteuersätze nach Kantonen 2022 _ Statista.csv'
)= Path(
rDataSet_72
raw_data,'07_Steuern_&_Staatsfinanzen',
'Schweiz - Vermögenssteuersätze nach Kantonen 2018 _ Statista.csv'
)
= pd.read_csv(
dataframe_70
rDataSet_70,=';',
sep='latin1'
encoding
)
= pd.read_csv(
dataframe_71
rDataSet_71,=';',
sep='latin1'
encoding
)
= pd.read_csv(
dataframe_72
rDataSet_72,=';',
sep='latin1'
encoding
)
# join dataframes
= pd.merge(
dataframe_73
dataframe_70,
dataframe_71,='KANTON'
on
)
= pd.merge(
dataframe_73
dataframe_73,
dataframe_72,='KANTON'
on
)
# rename columns
= dataframe_73.rename(
dataframe_73 ={
columns'KANTON': 'canton',
'PROZENT_x': 'income_tax',
'PROZENT_y': 'profit_tax',
'Promille': 'wealth_tax'
}
)
dataframe_73.head()
canton | income_tax | profit_tax | wealth_tax | |
---|---|---|---|---|
0 | Genf | 44.75 | 14.00 | 10.1 |
1 | Basel-Landschaft | 42.17 | 17.97 | 7.6 |
2 | Waadt | 41.50 | 14.00 | 7.9 |
3 | Bern | 41.04 | 21.04 | 5.8 |
4 | Basel-Stadt | 40.34 | 13.04 | 8.0 |
The processed data sets can all be found in the directory data/02_processed
.
# save R01 to csv
dataframe_10.to_csv('P01_price_indices.csv'),
Path(prc_data, =False
index
)
# save R02 to csv
dataframe_20.to_csv('P02_ratio_homeowners_eu.csv'),
Path(prc_data, =False
index
)
# save R03 to csv
dataframe_30.to_csv('P03_avg_price_smeter.csv'),
Path(prc_data, =False
index
)
# save R04 to csv
dataframe_40.to_csv('P04_buildings_by_canton.csv'),
Path(prc_data, =False
index
)
# save R05 to csv
dataframe_50.to_csv('P05_population_by_canton.csv'),
Path(prc_data, =False
index
)
# save R06 to csv
dataframe_64.to_csv('P06_construction_prices.csv'),
Path(prc_data, =False
index
)
# save R07 to csv
dataframe_73.to_csv('P07_taxes.csv'),
Path(prc_data, =False
index )
Number | Dataset Name | Source |
---|---|---|
P01 | P01_price_indices.csv | R01 |
P02 | P02_ratio_homeowners_eu.csv | R02 |
P03 | P03_avg_price_smeter.csv | R03 |
P04 | P04_buildings_by_canton.csv | R04 |
P05 | P05_population_by_canton.csv | R05 |
P06 | P06_construction_prices.csv | R06 |
P07 | P07_taxes.csv | R07 |
= pd.read_csv(
dataframe_10 'P01_price_indices.csv')
Path(prc_data,
)
display(dataframe_10.describe()) dataframe_10.info()
rental_index | condos | houses | |
---|---|---|---|
count | 101.000000 | 101.000000 | 101.000000 |
mean | 261.415050 | 7392.592772 | 6478.379406 |
std | 3.221622 | 518.387040 | 495.529958 |
min | 256.620000 | 6856.950000 | 5940.580000 |
25% | 259.100000 | 7024.970000 | 6057.130000 |
50% | 260.820000 | 7157.740000 | 6315.930000 |
75% | 263.460000 | 7617.250000 | 6786.410000 |
max | 271.970000 | 8589.250000 | 7479.460000 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 101 non-null object
1 rental_index 101 non-null float64
2 condos 101 non-null float64
3 houses 101 non-null float64
dtypes: float64(3), object(1)
memory usage: 3.3+ KB
= pd.read_csv(
dataframe_20 'P02_ratio_homeowners_eu.csv')
Path(prc_data,
)
display(dataframe_20.describe()) dataframe_20.info()
2018 | |
---|---|
count | 36.000000 |
mean | 75.527778 |
std | 12.510170 |
min | 42.500000 |
25% | 69.800000 |
50% | 74.800000 |
75% | 84.100000 |
max | 96.400000 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36 entries, 0 to 35
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 36 non-null object
1 2018 36 non-null float64
dtypes: float64(1), object(1)
memory usage: 704.0+ bytes
= pd.read_csv(
dataframe_30 'P03_avg_price_smeter.csv')
Path(prc_data,
)
display(dataframe_30.describe()) dataframe_30.info()
avg_price | |
---|---|
count | 18.000000 |
mean | 2841.213889 |
std | 1518.824498 |
min | 1246.500000 |
25% | 1687.625000 |
50% | 2377.000000 |
75% | 3908.875000 |
max | 7126.350000 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 18 non-null object
1 avg_price 18 non-null float64
dtypes: float64(1), object(1)
memory usage: 416.0+ bytes
= pd.read_csv(
dataframe_40 'P04_buildings_by_canton.csv')
Path(prc_data,
)
display(dataframe_40.describe()) dataframe_40.info()
buildings_total | single_family_home | |
---|---|---|
count | 26.000000 | 26.000000 |
mean | 68236.961538 | 38733.923077 |
std | 64860.525980 | 35848.970768 |
min | 5299.000000 | 2686.000000 |
25% | 17153.000000 | 9472.250000 |
50% | 55635.500000 | 29065.500000 |
75% | 103385.750000 | 62761.500000 |
max | 238111.000000 | 118612.000000 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 canton 26 non-null object
1 buildings_total 26 non-null int64
2 single_family_home 26 non-null int64
dtypes: int64(2), object(1)
memory usage: 752.0+ bytes
= pd.read_csv(
dataframe_50 'P05_population_by_canton.csv')
Path(prc_data,
)
display(dataframe_50.describe()) dataframe_50.info()
inhabitants | |
---|---|
count | 2.600000e+01 |
mean | 3.361073e+05 |
std | 3.613990e+05 |
min | 1.636000e+04 |
25% | 7.634725e+04 |
50% | 2.408105e+05 |
75% | 4.035468e+05 |
max | 1.564662e+06 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 canton 26 non-null object
1 inhabitants 26 non-null float64
dtypes: float64(1), object(1)
memory usage: 544.0+ bytes
= pd.read_csv(
dataframe_60 'P06_construction_prices.csv')
Path(prc_data,
)
display(dataframe_60.describe()) dataframe_60.info()
General carpentry work | Elevators | Construction cleaning | Floor coverings | Sealing coatings | Earthworks | Attic | Windows | Scaffolding | Plastering work | Gardening work | Kitchen fittings | Painting work | Masonry and reinforced concrete work | Metal construction work | Tiling work | Sheet metal work | Civil engineering works | Carpentry work | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 |
mean | 0.336175 | -0.076333 | -1.243108 | 0.801317 | 0.722615 | -0.434642 | 0.486336 | 0.331205 | -3.709846 | -0.392602 | -0.171934 | -1.510563 | -0.836175 | -1.656109 | -1.725674 | 0.385366 | -5.061521 | -0.898923 | -1.994070 |
std | 6.846241 | 2.742609 | 11.704542 | 7.542107 | 6.546152 | 10.466442 | 8.952986 | 2.116255 | 15.607533 | 7.361430 | 13.288935 | 7.577881 | 16.526478 | 7.031241 | 9.314879 | 8.503175 | 19.327410 | 8.449973 | 11.387849 |
min | -5.332594 | -1.834763 | -17.082390 | -9.766969 | -6.350533 | -12.021344 | -11.037287 | -2.855986 | -16.413265 | -10.468717 | -18.516207 | -11.455716 | -21.140344 | -8.746794 | -20.442963 | -11.743498 | -23.364883 | -11.603154 | -13.500970 |
25% | -4.492118 | -1.834763 | -10.384168 | -4.243347 | -5.144976 | -6.850601 | -6.046896 | -0.872953 | -10.543806 | -5.927453 | -6.011251 | -6.697025 | -11.012066 | -8.350783 | -3.688942 | -6.690155 | -16.402130 | -5.672421 | -9.860529 |
50% | -2.317067 | -1.834763 | -0.829673 | 1.671983 | 1.559295 | -2.191048 | 1.639683 | 0.208580 | -7.063719 | 1.363496 | -3.830083 | -4.356089 | -0.782745 | -0.587649 | 1.162569 | 4.825717 | -9.866113 | -2.173099 | -2.021207 |
75% | 4.252186 | 0.733669 | 8.800500 | 4.893565 | 4.785198 | 3.008626 | 5.401632 | 1.788769 | -5.949827 | 4.734577 | 7.636541 | 5.883567 | 5.858418 | 2.580034 | 2.995544 | 7.152028 | -1.954626 | 2.920700 | 0.641840 |
max | 10.482749 | 5.337382 | 12.377640 | 12.403766 | 10.569095 | 18.853844 | 14.092483 | 3.134207 | 30.495326 | 8.742762 | 17.892175 | 6.864781 | 26.377162 | 9.283176 | 8.587475 | 8.691597 | 34.513863 | 12.987236 | 20.001063 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 General carpentry work 7 non-null float64
1 Elevators 7 non-null float64
2 Construction cleaning 7 non-null float64
3 Floor coverings 7 non-null float64
4 Sealing coatings 7 non-null float64
5 Earthworks 7 non-null float64
6 Attic 7 non-null float64
7 Windows 7 non-null float64
8 Scaffolding 7 non-null float64
9 Plastering work 7 non-null float64
10 Gardening work 7 non-null float64
11 Kitchen fittings 7 non-null float64
12 Painting work 7 non-null float64
13 Masonry and reinforced concrete work 7 non-null float64
14 Metal construction work 7 non-null float64
15 Tiling work 7 non-null float64
16 Sheet metal work 7 non-null float64
17 Civil engineering works 7 non-null float64
18 Carpentry work 7 non-null float64
19 region 7 non-null object
dtypes: float64(19), object(1)
memory usage: 1.2+ KB
= pd.read_csv(
dataframe_70 'P07_taxes.csv')
Path(prc_data,
)
display(dataframe_70.describe()) dataframe_70.info()
income_tax | profit_tax | wealth_tax | |
---|---|---|---|
count | 26.000000 | 26.000000 | 26.000000 |
mean | 33.516154 | 14.683462 | 4.665385 |
std | 6.507109 | 2.548612 | 2.353711 |
min | 22.220000 | 11.850000 | 1.300000 |
25% | 30.015000 | 12.815000 | 2.650000 |
50% | 33.250000 | 13.935000 | 4.400000 |
75% | 39.447500 | 15.822500 | 6.250000 |
max | 44.750000 | 21.040000 | 10.100000 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 canton 26 non-null object
1 income_tax 26 non-null float64
2 profit_tax 26 non-null float64
3 wealth_tax 26 non-null float64
dtypes: float64(3), object(1)
memory usage: 960.0+ bytes
To use the data correctly in our visualizations, we exported the specific needed data to .js
constants variables. These files can be found in the directory data/03_exports
.
Chart Name | Used Dataframes | Exported File |
---|---|---|
S1 | P02 | s1Data.js |
S2 | P03 | s2Data.js |
S3 | P01 | s3Data.js |
S4 | P04, P05 | s4Data.js |
S5 | P06 | s5Data.js |
S6 | P07 | s6Data.js |
Export Function
def generate_data_js(df: pd.DataFrame, js_file_name: str, variable_name: str, orient: str) -> None:
"""
Generates a JS file with a variable name and the data from the dataframe
-------------------------------------------------------------------------
:param df: dataframe to export
:param js_file_name: name of the JS file
:param variable_name: name of the variable
:param orient: orientation of the dataframe
:return: None
"""
= df.to_json(orient=orient, force_ascii=False)
json_data with open(Path(exp_data, js_file_name), 'w', encoding='utf-8') as f:
f"export const {variable_name} = {json_data};") f.write(
= [
s1_country_filter 'Switzerland',
'Norway',
'United Kingdom',
'Austria',
'France',
'Netherlands',
'Denmark',
'Czechia',
'Slovenia',
'Belgium',
'Spain',
'Italy',
'Croatia',
'Poland',
'Hungary',
'Portugal',
'Serbia',
'Latvia',
]
= pd.read_csv(
s1_data 'P02_ratio_homeowners_eu.csv')
Path(prc_data,
)
# filter out countries
= s1_data[s1_data['country'].isin(s1_country_filter)]
s1_data
# sort data descending
= s1_data.sort_values(by='2018', ascending=False)
s1_data
# export data
generate_data_js(=s1_data,
df='s1Data.js',
js_file_name='s1Data',
variable_name='records'
orient
)
s1_data.head()
country | 2018 | |
---|---|---|
10 | Croatia | 90.1 |
16 | Hungary | 86.0 |
34 | Serbia | 84.4 |
20 | Poland | 84.0 |
13 | Latvia | 81.6 |
= pd.read_csv(
s2_data 'P03_avg_price_smeter.csv')
Path(prc_data,
)
# sort data descending
= s2_data.sort_values(by='avg_price', ascending=False)
s2_data
# export data
generate_data_js(=s2_data,
df='s2Data.js',
js_file_name='s2Data',
variable_name='records'
orient
)
s2_data.head()
country | avg_price | |
---|---|---|
17 | Switzerland | 7126.35 |
4 | Norway | 4286.00 |
0 | UK | 4235.00 |
1 | Austria | 4205.00 |
2 | France | 4012.00 |
= pd.read_csv(
s3_data 'P01_price_indices.csv')
Path(prc_data,
)
# sort data descending
= s3_data.sort_values(by='date', ascending=True)
s3_data
# filter out data before 2020-01-01
= s3_data[s3_data['date'] >= '2020-01-01']
s3_data
# calculate relative differences
'prc_rental'] = (100/258.26)*s3_data['rental_index']
s3_data['prc_condos'] = (100/7157.74)*s3_data['condos']
s3_data['prc_houses'] = (100/6308.97)*s3_data['houses']
s3_data[
# export data
generate_data_js(=s3_data,
df='s3Data.js',
js_file_name='s3Data',
variable_name='records'
orient
)
s3_data.head()
date | rental_index | condos | houses | prc_rental | prc_condos | prc_houses | |
---|---|---|---|---|---|---|---|
61 | 2020-01-31 | 258.26 | 7157.74 | 6308.97 | 100.000000 | 100.000000 | 100.000000 |
62 | 2020-02-29 | 258.80 | 7245.38 | 6316.64 | 100.209092 | 101.224409 | 100.121573 |
63 | 2020-03-31 | 258.43 | 7366.09 | 6398.48 | 100.065825 | 102.910835 | 101.418774 |
64 | 2020-04-30 | 256.67 | 7383.67 | 6384.32 | 99.384341 | 103.156443 | 101.194331 |
65 | 2020-05-31 | 258.01 | 7342.29 | 6434.78 | 99.903198 | 102.578328 | 101.994145 |
= {
region_mapping_s4 'VD': 'Lake Geneva Region',
'VS': 'Lake Geneva Region',
'GE': 'Lake Geneva Region',
'BE': 'Espace Midland',
'FR': 'Espace Midland',
'SO': 'Espace Midland',
'NE': 'Espace Midland',
'JU': 'Espace Midland',
'BS': 'Northwestern Switzerland',
'BL': 'Northwestern Switzerland',
'AG': 'Northwestern Switzerland',
'ZH': 'Zurich',
'GL': 'Eastern Switzerland',
'SH': 'Eastern Switzerland',
'AR': 'Eastern Switzerland',
'AI': 'Eastern Switzerland',
'SG': 'Eastern Switzerland',
'GR': 'Eastern Switzerland',
'TG': 'Eastern Switzerland',
'LU': 'Central Switzerland',
'UR': 'Central Switzerland',
'SZ': 'Central Switzerland',
'OW': 'Central Switzerland',
'NW': 'Central Switzerland',
'ZG': 'Central Switzerland',
'TI': 'Ticino'
}
= pd.read_csv(
s4_data_1 'P04_buildings_by_canton.csv')
Path(prc_data,
)
= pd.read_csv(
s4_data_2 'P05_population_by_canton.csv')
Path(prc_data,
)
# merge dataframes
= pd.merge(
s4_data
s4_data_1,
s4_data_2,='canton'
on
)
'prc_single_family_home'] = s4_data['single_family_home'] / s4_data['buildings_total'] * 100
s4_data['region'] = s4_data['canton'].map(region_mapping_s4)
s4_data[
# export data
generate_data_js(=s4_data,
df='s4Data.js',
js_file_name='s4Data',
variable_name='records'
orient
)
s4_data.head()
canton | buildings_total | single_family_home | inhabitants | prc_single_family_home | region | |
---|---|---|---|---|---|---|
0 | AG | 153894 | 102206 | 703086.0 | 66.413245 | Northwestern Switzerland |
1 | AI | 5299 | 2917 | 16360.0 | 55.048122 | Eastern Switzerland |
2 | AR | 16323 | 9251 | 55585.0 | 56.674631 | Eastern Switzerland |
3 | BE | 238111 | 114053 | 1047473.0 | 47.899089 | Espace Midland |
4 | BL | 67390 | 46632 | 292817.0 | 69.197210 | Northwestern Switzerland |
= pd.read_csv(
s5_data 'P06_construction_prices.csv')
Path(prc_data,
)
# move last column to first position
= list(s5_data.columns)
cols = [cols[-1]] + cols[:-1]
cols = s5_data[cols]
s5_data
# transform dataframe
= s5_data['region'].to_list()
regions = s5_data.T
s5_data = regions
s5_data.columns
# reset & rename index
=True)
s5_data.reset_index(inplace={'index': 'work_category'}, inplace=True)
s5_data.rename(columns
# drop first row
= s5_data.iloc[1:].copy()
s5_data
# export data
generate_data_js(=s5_data,
df='s5Data.js',
js_file_name='s5Data',
variable_name='records'
orient
)
s5_data.head()
work_category | Lake Geneva Region | Ticino | Espace Midland | Central Switzerland | Northwestern Switzerland | Zurich | Eastern Switzerland | |
---|---|---|---|---|---|---|---|---|
1 | General carpentry work | 9.782592 | 10.482749 | -5.054522 | -2.317067 | -5.332594 | -1.27822 | -3.929714 |
2 | Elevators | -1.834763 | -0.301372 | 5.337382 | 1.76871 | -1.834763 | -1.834763 | -1.834763 |
3 | Construction cleaning | 12.37764 | 6.612687 | 10.988312 | -12.24967 | -0.829673 | -8.518666 | -17.08239 |
4 | Floor coverings | 6.990752 | 12.403766 | -4.967931 | 1.671983 | -3.518762 | 2.796378 | -9.766969 |
5 | Sealing coatings | 6.729232 | 10.569095 | -5.094807 | -5.195145 | 1.559295 | -6.350533 | 2.841165 |
= {
region_mapping_s6 'Genf': 'Lake Geneva Region',
'Waadt': 'Lake Geneva Region',
'Wallis': 'Lake Geneva Region',
'Bern': 'Espace Midland',
'Freiburg': 'Espace Midland',
'Solothurn': 'Espace Midland',
'Neuenburg': 'Espace Midland',
'Jura': 'Espace Midland',
'Basel-Stadt': 'Northwestern Switzerland',
'Basel-Landschaft': 'Northwestern Switzerland',
'Aargau': 'Northwestern Switzerland',
'Zürich': 'Zurich',
'Glarus': 'Eastern Switzerland',
'Schaffhausen': 'Eastern Switzerland',
'Appenzell Ausserrhoden': 'Eastern Switzerland',
'Appenzell Innerrhoden': 'Eastern Switzerland',
'St. Gallen': 'Eastern Switzerland',
'Graubünden': 'Eastern Switzerland',
'Thurgau': 'Eastern Switzerland',
'Luzern': 'Central Switzerland',
'Uri': 'Central Switzerland',
'Schwyz': 'Central Switzerland',
'Obwalden': 'Central Switzerland',
'Nidwalden': 'Central Switzerland',
'Zug': 'Central Switzerland',
'Tessin': 'Ticino'
}
= pd.read_csv(
s6_data 'P07_taxes.csv',
Path(prc_data, ='latin1')
encoding
)
# sort data descending
= s6_data.sort_values(by='canton', ascending=False)
s6_data 'region'] = s6_data['canton'].map(region_mapping_s6)
s6_data[
# group data by region
= s6_data.groupby('region').mean().reset_index()
s6_data
# export data
generate_data_js(=s6_data,
df='s6Data.js',
js_file_name='s6Data',
variable_name='records'
orient
)
s6_data.head()
region | income_tax | profit_tax | wealth_tax | |
---|---|---|---|---|
0 | Central Switzerland | 25.516667 | 12.575000 | 2.066667 |
1 | Eastern Switzerland | 30.308571 | 13.455714 | 3.614286 |
2 | Espace Midland | 37.384000 | 15.954000 | 5.280000 |
3 | Lake Geneva Region | 40.916667 | 15.040000 | 8.100000 |
4 | Northwestern Switzerland | 38.963333 | 16.143333 | 6.666667 |