Tools for Analytics (TFA)
Course notes
2024-11-04
Module 1 Introduction to the course
This site contains course notes for the course “Tools for Analytics” held at Aarhus BSS. The notes show the learning path for each week and contain. The course is an introductory course at the Operations and Supply Chain Analytics programme and intended to give knowledge about IT tools for Analytics. Expect the notes to be updated when the course runs. The date listed above is the last time the notes was updated. A set of slides are also available.
Learning path diagram
Click/hover the nodes to follow links and see details.
A detailed description of Business Analytics have been pointed out as an extra supplement in the learning path diagram. You may have a look at it if you like.
1.1 Learning outcomes
By the end of this module, you are expected to:
- Memorize the purpose of the course.
- Describe what the term Business Analytics mean.
- Identify pros and cons of using Excel, VBA and R.
- Describe how a computer works.
- Describe what an algorithm is.
- Know how the course is organized.
The learning outcomes relate to the overall learning goals number 1, 3 and 5 of the course.
1.2 Purpose for the course
Since the amount of available data has increased extensively in many companies, there is a need for analysts with the ability to do tasks within Analytics. For instance, extract relevant data and perform valid quantitative analysis. Clearly, it is also important that the analyst can communicate the results of the analysis to their surroundings. This requires for the analyst to be particularly qualified in handling IT based tools beyond e.g. basic Excel.
Business Analytics (BA) (or just Analytics) refers to the scientific process of transforming data into insight for making better decisions in business. BA can both be seen as the complete decision making process for solving a business problem and the creation of business value by integration of concepts, methods and data. As a process, it can be characterized by descriptive, predictive, and prescriptive model building using data sources. For a full definition see the appendix.
Within a Business Analytics (BA) framework the course focuses on giving you an introduction to programming, handeling data and doing descriptive analytics. Descriptive analytics categorizes, characterizes, consolidates, and classifies data. Examples are standard reporting and dashboards (key performance indicators (KPIs), what happened or is happening now?) and ad-hoc reporting (how many/often?). Descriptive analytics often serves as a first step in the successful application of predictive or prescriptive analytics. Predictive and prescriptive analytics are covered in other courses of the programme.
Analytics may be seen as a data driven process:
For doing data driven analytics you first must import your data. That is, take data from a database, file, web API etc. and transform it into a data frame/table. In general raw data may be messy and need to be structured in a tidy way. Tidying your data means storing it in a structured form suitable for analysis. In brief, when your data is tidy, each column is a variable, and each row is an observation. Tidy data is important because the consistent structure lets you focus your struggle on questions about the data. Once you have tidy data, a common first step is to transform it. Transformation includes narrowing in on observations of interest (e.g. only observations from a specific year or warehouse), creating new variables based on existing variables (e.g. the cost of using the machine that day given idle time). Together, tidying and transforming are called wrangling, because it can be a hard process to get your data in the right form.
The next step is to do a simple exploration of you data such as calculating a set of summary statistics (like counts, means or KPIs). A good way to get an overview over your data is by visualization. A good visualisation will show you things that you did not expect, raise new questions about the data or confirm your hypothesis. A good visualization might also hint that you’re asking the wrong question, or you need to collect different data. Exploration and visualization are descriptive analytics and used to answer questions such as: What happened? How many, how often, where? Where exactly is the problem? What actions are needed? Models are complementary tools to visualization. Once you have made your questions sufficiently precise, you can use a model to answer them. A model is a description of a system using mathematical concepts and a simplification of the real system. That is, the results of a model are based on a set of assumptions. Models for statistical analysis, forecasting, system behavior are predictive analytics and answer questions like: Why is this happening? What if these trends continue? What will happen next? Models for prescriptive analytics use optimization and other decision modeling techniques to suggest decision options with the goal of improving business performance and answer questions like: What is the best that can happen?
Exploration, visualization and modeling may be seen as different steps which can be used for analyzing the data and answer the overall questions. This course will focus on the two first steps.
Given an analysis, communication is an absolutely critical part. It does not matter how well your models and visualization have led you to understand the data unless you can also communicate your results to decision makers.
Note that analytics is not a one-way process, it is common that you several times have to tidy and transform your data, explore and visualize based on the results of a model, rerun the model based on feedback from the decision makers etc. Common connections are visualized using directed arrows in Figure 1.1.
Surrounding the process is programming. Programming is the Swiss army knife you use during parts of the process. An introduction to programming is given using both VBA in Excel and the programming language and free software environment R. Programming focus on writing algorithms. An algorithm is a finite sequence of well-defined instructions to solve a specific problem or to perform a computation. That is, we use a programming language to program an algorithm that solves a specific task, e.g. find the best route, sort words, make a plot, etc.
1.3 R vs Excel/VBA
This course gives you an introduction to programming using both VBA and R. The two programming languages are different and here are some comparisons:
Excel
Pros:
- Initial learning curve is quite minimal.
- Analysis can be done via point-and-click.
- Useful for fast analysis (you can change a cell and see effects on other cells, plots etc.)
- It is not exceedingly hard to make basic graphs and charts.
- Data can be stored inside the sheets.
Cons:
- The mixture of data entries, analysis, and visualization makes it easy to confuse cells that contain raw data from those that are the product of analysis.
- The analysis directly manipulates the only copy of the raw data.
- Using mouse clicks means that a mistaken click or drag action can lead to errors or the overwriting of data.
- Do not handle non-tabular data well.
VBA
VBA is a compiled language implemented using compilers (translators that generate machine code from source code). That is, code need to be compiled first before running it.
Pros:
- Can be used inside MS Office applications e.g. Excel.
- Already contained in Excel, i.e. if you have Excel installed you can start coding.
- The VBA code is stored within the spreadsheet, allowing any user with access to the spreadsheet to easily run the code.
- VBA is easy to learn. Especially if you are already experienced in Excel.
- Good for automating tasks in Excel.
- Still used in many companies.
Cons:
- A programming language, meaning the initial learning curve is steeper.
- It will take you some time to become familiar with the interface and master the various functions.
- Since a compiled language, compiling code may take time.
- Powerful inside Excel but other programming languages are better to learn for general tasks.
- An old programming language (Microsoft stopped investing in VBA in 2008).
R
R is an interpreted language with step-by-step execution of source code (no pre-runtime translation takes place) from the command line or using a script file.
Pros:
- There is a clear division between data entry and analysis. You import the data, create an object that is a copy of the raw data and do manipulations on this copy. That is, the original data are never altered in any way and there is no way to mess up the raw data.
- Manipulating a copy of the data enables you to experiment. A line of code that fails to produce the expected result can be tweaked and rerun.
- All manipulations can be done in code.
- The process of analysis are easily reproduced by the code. That is, the use of code for data analysis enables the creation of more reproducible research.
- With code all analysis is documented instead of being hidden behind mouse clicks.
- Saving analysis in code has the immediate benefit that it can be easily rerun anytime that new data is added or the code can also be applied to a completely new data set.
- Free and with a large community that promotes sharing of libraries for data analysis.
- Can produce complex and advanced data visualizations.
Cons:
- R is a programming language, meaning the initial learning curve is steeper.
- It will take you some time to become familiar with the interface and master the various functions.
1.4 How a computer works
As a prerequisite for this course you need some basic knowledge about what a computer is. Have a look at these slides and this video.
1.5 How the notes are organized
Module 1 (this module) gives a short introduction to the course. The course notes consists of different parts each containing teaching modules about specific topics:
Part I consider tools for analytics using VBA in Excel (mainly programming). Module 2 gives you an introduction to VBA so you can get started programming. In Module 3 loop and conditional statements are introduced and Module 4 focus on how to make procedures. Next, we consider advanced data types and usage in Module 5. Finally, Module 6 considers generation of random numbers in VBA and how they can be used for simulation.
Part II consider tools for analytics using R. Module 7 give advices on how to install R on your laptop. An introduction to R is given in Module 8. Loops and conditionals are covered in Module 9. Module 10 cover how to create functions in R. A introduction to the tidyverse packages are given in Module 11 which also consider how to write reproducible reports. Importing and exporting data are considered in Module 12. In Module 13 ways to transform data is given. Finally, Module 14 look at data visualization in R.
Part III Provides you with some extra topics in R that might be handy when you e.g. write your master thesis. This part is NOT part of syllabus.
The appendix contains different modules that may be helpful for you including hints on how to work in groups, how to get help if you are stuck and how to annotate the course notes.
1.6 Acknowledgements
Some of the materials in these notes are taken from various places
- The bookdown skeleton and some notes are based on the Stat545 course.
- Some parts in Module 1 are inspired by Chapter 1 in H. Wickham (2017).
- The VBA modules are inspired by the book Wøhlk (2010). This also holds for some of the exercises.
- Module 7 is inspired by Chapter 1 in Bryan (2017).
- Module 8 is using some text and images from Chapter 1 in Ismay and Kim (2020) and Chapter 2 in Bryan (2017). A few exercises are inspired by Chapter 2 in Irizarry (2020).
- Notes about git and GitHub in the appendix are based on Bryan, STAT 545 TAs, and Hester (2020).
- Exercise 13.5.1 is a revision of Chapters 6-7 in Bryan (2017).
- Exercise 13.5.2 is a revision of Session 3 in the Welcome to the tidyverse course.
- Exercise 14.6.1 is a revision of Chapter 9 in Irizarry (2020).
- Exercise 14.6.3 is inspired by the COVID19 application exercise at the data science in a box course.
- Exercise 14.6.4 is inspired by the Lego homework exercise at the data science in a box course.
- Exercise 13.5.4 is inspired by the Fisheries application exercise at the data science in a box course.
I would like to thank all for their inspiration. Also thanks to Solveig for proofreading the draft.
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.
1.7 Exercises
1.7.1 Exercise - How to annotate
The online course notes can be annotated using hypothes.is. You can create both private and public annotations. Collaborative annotation helps people connect to each other and what they’re reading, even when they’re keeping their distance. You may also use public notes to help indicate spell errors, unclear content etc. in the notes.
- Sign-up at hypothes.is. If you are using Chrome you may also install the Chrome extension.
- Go back to this page and login in the upper right corner (there should be some icons e.g.
<
). - Select some text and try to annotate it using both a private and public annotation (you may delete it again afterwards).
- Go to the slides for this module and try to annotate the page with a private comment.