Abschnittsübersicht

    • Welcome to our class Applied Data Science in Tourism!

      The course starts on the 31th March 2025 at 9:30am and finishes on the 11th April 2025. Normally, we start at 9:30am, followed by a lunch break from 12:30pm to 1:30pm. After 1:30pm we have our afternoon session until 3pm except for Wednesdays which are reserved for language classes etc. Classes take typically place in 01.011 or online if needed (see virtual class room below).

    • Schedule

      First Week

      Date Topic
      31st March 2025 Organization, Exam, and Introduction (9:30 to 11:00)
      Presentation by Sophia Quint from visitBerlin (11:00 to 12:00)
      Researching Data (13:30 to 15:00)
      1st April 2025 First Steps in Metabase (9:30 to 12:30 and 13:30 to 15:00)
      2nd April 2025 More Steps in Metabase (9:30 to 12:30)
      Enriching Data
      3rd April 2025 First Steps in Orange (9:30 to 12:30)
      Orange for Forecasting (13:30 to 15:00)
      4th April 2025 More Forecasting in Orange (9:30 to 12:30)
      Summary of the week (13:30 to 15:00)

       

      Second Week

      Date Topic
      7th April 2025 Predicting Prices on Airbnb Data (9:30 to 12:30 and 13:30 to 15:00)
      8th April 2025 Clustering and finding patterns in Airbnb Data (9:30 to 12:30 and 13:30 to 15:00)
      9th April 2025 Word Cloud of Reviews in Airbnb Data (9:30 to 12:30)
      10th April 2025 in 01.103

      GNTB Knowledge Graph (9:30 to 12:30)
      Exam Preparation and Questions (13:30 to 15:00)

      11th April 2025 online in our virtual class room

      Guest Card Oberstaufen (9:30 to 12:30)
      Course Summary (13:30 to 15:00)
      visitBerlin Dashboard/SARIMA in SPSS

    • In case of unforeseen circumstances or important information will be shared here.

    • In the class we will use the following tools. Our goal is to learn how to use these tools and apply them skillfully with our touristic data sets.

      • Metabase
      • Orange (KNIME/RapidMiner)
      • (Jupyter Notebooks)
      • (RStudio)

      There are many more tools. The above choice is primarily caused by the fact that the above tools offer Open Source version. This means you can freely use them without needing to pay any money. The first two tools also require no programming skills.

    • During our first week, we will work together on a data project. Our goal is to forecast the Berlin visitors originating from the US. We will search for relevant data, create our first visualizations and try to forecast the development of visitors from the US.

    • Task
      • First we want to figure out who is visiting Berlin. What are the top residences of people visiting Berlin?
      • Try to find officially collected data covering who is visiting Berlin. Which time duration is covered by the data? How can you access this data? What is stored in these files?

      Since some German is helpful break up into groups of two making sure at least one person speak German.

    • The data is linked from https://www.statistik-berlin-brandenburg.de/archiv/g-iv-1-m. The linked page lists Excel files for each month starting 2009.

    • The files covered in this folder present the data available at https://www.statistischebibliothek.de/mir/receive/BBSerie_mods_00000050 but in a single CSV file for Berlin guests and residence for easy access. The data is also provided as SQLite files.

    • After watching the short introductory video at metabase.com, follow the installation instructions at https://www.metabase.com/docs/latest/installation-and-operation/running-the-metabase-jar-file#local-installation to install Metabase. Basically, you need to install Java and download the metabase.jar file. After starting Metabase, you can access it using your browser at http://localhost:3000.

    • Task
      • Familiarize yourself with the Metabase tool and the sample x-rays. You can find the documentation at https://www.metabase.com/learn/.
      • Add our two SQLite files as separate datasets. You can also 
        upload CSVs to Metabase to the Sample Database.
      • Create a stacked bar chart (a bar for each type of domestic accommodation) over time. You can find documentation at https://www.metabase.com/learn/metabase-basics/querying-and-dashboards/visualization/bar-charts. It's fine to follow their example to learn about the bar chart and then translate your learning into your own bar chart.
        Did you notice the data problem for clinics/centers and clinics?
        Accommodation of Domestic Visitors
      • Now that you are familiar with the basics of Metabase how about you create some new chart of your own using the data. Remember you can select, filter, aggregate, and group the data prior to visualizing it. Present your created chart in class.
    • Visitor wheel

    • Task
      • Create a chart showing the top foreign residences of people visiting Berlin.
        It could look something like the chart below
        Top Residences
      • When do the two regularly occurring peaks happen over the year for visitors from the US? Can you make out differences within these two peaks?
    • Task
      • It is possible to predict the number of future visitors from historical data. We will do this starting tomorrow. What are downsides to predicting the number of visitors this way? What are the implicitly made assumptions?
      • What are indicators predictive of visitors? Which could be publicly accessed and downloaded and help us predicting US visitors?
    • Task
      • With Google Trends you can get data from Google showing what people search on Google depending on certain time spans and geographic locations. Play with Google Trends. Can you find search terms with a similar pattern than our Berlin visitors from the US?
      • What is the value that reported by Google to indicate the volume of searches (the y-axis on the Google Trends charts). Research the definition and draw conclusions when comparing different search terms using Google Trends.
    • "Indexing: Google Trends data is pulled from a random, unbiased sample of Google searches, which means we don’t have exact numbers for any terms or topics. In order to give a value to terms, we index data from 1-100, where 100 is the maximum search interest for the time and location selected."
      "Normalization: When we look at search interest in a topic or query, we are not looking at the total number of searches. Instead, we look at the percentage of searches for that topic, as a proportion of all searches at that time and location."

      from https://newsinitiative.withgoogle.com/resources/trainings/basics-of-google-trends/ and see also https://support.google.com/trends/answer/4365533

    • Task
      • Building on yesterday's exploration of data from Google Trends, let's create a final chart contrasting the official data of US visitors with searches on Google. For this download the google_trends.sqlite file below.
      • This file contains of two tables which have to be joined by month year to combine/enrich the data. You can read about joins at https://www.metabase.com/docs/latest/questions/query-builder/join.
      • As an inspiration your final chart, could look at follows:
        Google Trends and Actual Visitors from the USNote the two y-axes.
    • This file contains Google Trends information and the data about Berlin visitors.

    • Task
      • Orange is an software for data mining. It does not require any textual programming. The data analysis is performed by arranging nodes and edges where nodes are representing the operations and the edges represent the flow of the data. Install Orange from the official sources at https://orangedatamining.com/download/.
      • Familiarize yourself with Orange by loading the Berlin visitors data below and filter for the US residence. Look at the data using the Data Table.
    • Task
      • Based on the above video, create a forecast for the US visitors for our data. Play with the (hyper-)parameters similarly to how Nathan Humphrey is doing it. Can you make it work? If not, speculate why it might not work. Research about ARIMA and SARIMA models.
      • In addition, try out the Seasonal Adjustment node and create a line chart with trend, seasonal and residual plots.
        Note, the needed workaround Edit Domain for a bug in Orange (see https://github.com/biolab/orange3-timeseries/issues/281) to fix the error
        variable month_year is not in domain in the Seasonal Adjustment widget.
    • If you want to use Orange in a computer pool, you can follow these instructions for setting it up:

      1. Download the portable Version of Orange for Windows from https://orangedatamining.com/download/ and extract the zip file in some folder
        Be patient. It takes some time to extract the file since it contains lots of files and data.
      2. Navigate to the folder and double click on the link called Orange.
        Again be patient. Starting Orange for the first time usually takes longer but later starts are quicker.

      This setup was tested in the computer pool room 05.110 using Orange 3.38.1.

      These instructions might also work on others computers.

    • Forecast Pipeline in Orange

    • Trend, Season, and Residuals

    • Task
      • Yesterday, you used the Seasonal Adjustment node. How did you choose the seasonal period? Find a way to let Orange determine the seasonal period. This way the computer figures out a value itself instead of us trying manually. It also helps to review and critically question our guessed value.
      • Apply an ARIMA model to the trend of US visitors and create a line chart of the trend with forecasting and a line chart for the seasonal pattern.
        ARIMA trend forecast
    • Task
      • Orange does not provide an SARIMA (ARIMA model handling also seasonality patterns; see https://datascience.stackexchange.com/questions/120136/seasonal-arima). Check out other models and try to use these for forecasting US visitors and forecast 5 years ahead.
        Forecast

      • Perform a statistical hypothesis test in Orange to determine whether different time series from Google Trends predict the US visitors. What lag is reported for different search terms (see the google_trends.csv file below)? You can learn the very basics about a suitable hypothesis test at https://www.statology.org/granger-causality-test-in-r/. For this you have to merge the US visitor data with the Google Trends data similarly how you have done it in Metabase but now in Orange.
    • Orange Workflow for US visitor data

    • Right click on the link to save this exemplary solution on your computer where you can open it using Orange.

    • What we did during the week?

      • Sophia Quint brought is the case of forecasting US visitors visiting Berlin.
      • You researched and identified relevant data.
      • You installed Metabase on your computer and are able to use it from now on.
      • You explored the researched data in form of SQLite files using Metabase and created different charts.
      • You identified the "peak behavior" of Berlin visitors from the US and visualized it.
      • You learned about Google Trends data and found search terms with similar patterns relating to our US visitor data.
      • You enriched the US visitor data with Google Trends data showing a peak in searches before actual US visitors arrive in Berlin.
      • You installed Orange on your computers and learned about its basic widgets/nodes and edges for modeling the flow of data.
      • You familiarized yourself with Orange and loaded the data in form of CSV files.
      • You learned about different nodes in Orange, specifically you applied ARIMA (autoregressive (AR) integrated (I) moving average (MA)) and VAR (vector autoregression) models to forecast US visitors.
      • You performed your (first ever?) statistical hypothesis test for determining whether search terms on Google Trends are predictive for forecasting US visitors. You applied the so called Granger Causality test.

      In summary, you did a lot. But actually more importantly, I hope, you learned some meta lessons along the way.

      • Using a powerful but complicated machine like a computer is challenging but can be rewarding. You need to exact when instructing a computer. It's expected to struggle and to run into problems. When you solve a problem, that's learning. If you run into a problem and solve it yourself, you are much more likely to remember.
      • Things go wrong. That's expected and normal. This way you gain experience. The more you know, the easier it gets. Also, make sure to reflect on what you are doing and form hypotheses internally to verify your understanding. This way you get better and avoid dead-ends.
      • The more you do (in and across tools), reflecting when it does not work and trying to fix it in a principled non-random manner, you will gain more insights and confidence. Problems come in similar patterns with similar solutions.
      • With the Internet, it's very likely that another person ran into the very same problem (and was kind enough to document and provide a solution to it) as you are experiencing right now. Use that to your advantage and search for similar problems online to find solutions.
      • It's normal to struggle. You saw me struggle myself (more than once in Metabase and more than once in Orange). The more you use a tool the more you become an expert in it. If you are not using regularly, you have the chance for re-learning things you have forgotten.

      The next week will basically be the same as this week with

      • new datasets to explore and
      • new methods to learn and apply

      to gain more experience and confidence. In the end you should be able to solve data science problems with the tools we covered. in class yourself and transfer some of your knowledge and skills to other tools in future.

    • Classification on Airbnb data

      In a classification, we try to predict a categorical variable from features. In the data the model observes the relationship between input features and output target and tries to learn patterns for prediction the target label on unseen data.

      Task
      • Download the detailed listing data for Berlin (or any other place you find interesting) from https://insideairbnb.com/get-the-data/.
      • What columns/fields does this dataset contain? Familiarize yourself with the different columns. How are categorical columns vs numeric columns displayed in Orange?
      • Select neighbourhood_group_cleansed as the target and predict it using different classification models like kNN, Tree, and Random Forest. You can analyze the model results using a Confusion Matrix.
        You can use video tutorials like the following to learn about classification.
      • What are the most useful but obvious columns for predicting the neighbourhood?
    • Exemplary solution to the above task

    • Regression on Airbnb data

      In a regression, we try to predict a numeric variable from features. It's similar to classification except that the predicted value is continuous. 

      Task
      • Use the same data as for the previous task.
      • Predict the prices of offered accommodations on Airbnb. Use different models like Linear Regression, kNN, and Random Forest to predict the prices.
        The following video explains linear regression among other concepts.
      • Which model has the lowest mean absolute error (MAE) for you?
    • You can use the following code in a Python Script within Orange to remove the `$` from the price column. Afterwards you can turn this column in a numeric columns using Edit Domain to use it as a target for prediction in a regression model.

       

      out_data = in_data.copy()

      for row in out_data:

           row['price'] = row['price'].value.replace("$", "")

    • Exemplary solution to the above task

    • In the last section we learned about classification and regression models. These are models that observe the input data (features) along with the output data (labels). Their purpose is to model relationship between inputs and outputs for prediction. In this section we are looking at methods that focus on uncovering patterns in data. Their primary goal is not to make predictions but to help getting a better understanding of data and its patterns.

    • Clustering and Dimensionality Reduction

      Task
      • Familiarize yourself with the k-means algorithm by using the Interactive k-Means widget. What are the two basic steps this algorithm is performing? Note, that the algorithm relies on calculating distances between points. What is the purpose of this algorithm?
      • Build a data pipeline in Orange to apply the k-means clustering algorithm on the Airbnb data. Try to uncover patterns within the data. Use the Data Sampler to reduce the size of the data.
        Try finding cluster using two features only first. This makes it easy to verify the identified clusters.
      • Usually, the data has more than two or three dimensions (columns/features). This makes it challenging to visually show patterns. Make use of dimensionality reduction methods like principal component analysis (PCA) to project the data into two dimensions.
    • Right click on the link to save this exemplary solution on your computer where you can open it using Orange.

    • Embedding Images from Airbnb

      Task
      • Did you notice that the Airbnb data set contains URLs to images (see column picture_url)? Sample 20% of these images and download them using the Save Images widget from the Image Analytics add-on.
      • Use a (convolutional) neural network in Orange to embed the Airbnb accommodation images (Image Embedding). This basically turns any image into a point in a high-dimensional space. Then you can apply clustering methods and dimensionality reduction to gain insights about the images. You can view images using the Image Viewer widget.
    • Text Analysis or Natural Language Processing (NLP)

      Task
      • The Airbnb data provides also unstructured texts like reviews and description texts. Examine the relevant columns to see what kind of texts these columns contain.
      • To provide widgets for text analysis, you need to install the Text add-on. Similar to the Form TimeSeries, you have to turn the data into a Corpus for doing text analysis in Orange. Make sure to only take a small sample. Otherwise you might overload your computer.  Build a word cloud using the review texts and Preprocess Text to bring the text into shape where needed.
    • Word Cloud of Reviews

    • Task
      • Compare the review ratings with the reviews by doing a Sentiment Analysis. Try out different methods for doing sentiment analysis.
      • Try out other widgets from the Text Mining section like Topic Modeling.
    • Data Science Workflow

    • For carrying out data projects there do exists guidelines on what to do in which order. A popular approach is the Cross-industry standard process for data mining (CRISP-DM) shown below.

      CRISP data mining process diagram
      licensed under CC BY-SA 3.0 DEED from https://commons.wikimedia.org/wiki/File:CRISP-DM_Process_Diagram.png

      There exist different approaches highlighting different parts of the process. But usually it involves, a definition or framing of the problem/question to be answered, various steps involving the data like data collection and data preprocessing, model related steps like modeling and model evaluation followed by the deployment of the model. It is advisable to also include a specific step to verify and update the understanding.

      I want to highlight that these steps are typically performed incrementally and a discovery during modeling may lead to an updated data processing. So the steps are not strictly followed but often reiterated as needed.

    • Make sure you are registered for the exam in EMMA if you want to take the exam. You can cancel your registration until 7 days before the exam.

      The exam consists of a presentation which has to be given on the 20th June 2025 in 01.011 from 9:00 to 14:00. Upload your presentation below before the day of the examination. The exact schedule will be added below and finalized 7 days before the exam.

    • Time on 20th June 2025 Examinee
      09:00 to 09:20  
         
         
         
         
         
         
         
         
    • In your presentation you are expected to show your skills for working with data. The presentation is supposed to take 15 minutes followed by some questions. In total the exam will take around 20 minutes. You can use the data sets and tools we covered in class. You can also work with other data sets and tools that you are already familiar with or would like to learn. When you are using another data set make sure to also share that data set with me for verification purposes.
      For my grading of your work, keep in mind

      • I'd like you to show and apply what you have learned in class.
      • Prepare and present your slides/dashboard/documents in English.
      • Use the terminology that we covered in class (filter, select, aggregate, bar chart, scatter plot, bubble chart, categorical variable, metric feature, supervised/unsupervised learning, etc.) during your presentation.
      • Show that you can summarize visually in a concise and non-misleading way.
      • Present a more elaborate analyses to reveal some patterns in your data set.
      • Aim for a coherent presentation and jump not to much between different data sets. This also allows you to go into some depth.
      • I'm most interested in your visuals and your explanation of those and the insights you got.
      • Don't forget to include definitions for data fields. Sometimes it is not obvious what is being counted or measured.
      • For a very good grade you are expected to do more than just replicate and rearrange the charts and analyses we did in class. Learn a new analysis, apply it and present its results.
      • Have a critical mind and reflect on the obtained results. Make sure that you applied the method correctly and check that the data is plausible. Report hyperparameters if they are important and non-obvious for the analysis.

      Good luck!

      You can find an example from the German Tourism Association (Deutscher Tourismusverband) below. It provides basic visualizations.

    • Hints

      • Some of the data sets we covered come with reports. You can use those reports for inspiration.
      • You are allowed to talk to your fellow students about methods and how to apply them. You can also brain storm together. But make sure that you don't copy from each other. We want to avoid same/similar charts or analyses on the same/similar data.
    • Upload your presentation here. You can upload either slides as PDF file or an exported Metabase dashboard that you are going to present. Make sure though your presentation is in a single PDF file. Your submission will be provided already connected to a device for presentation. Don't forget to include any custom data in a ZIP file that was not covered in class. Latest submission is the day before the exam.