gantt title KNBS R Curriculum dateFormat X axisFormat %H section Beginner Introduction to R :a1, 0, 16h Introduction to Data Visualisation :a2, after a1 start, 16h Optional Modules :a3, after a2, 4h Introduction to GIS in R :a4, after a2, 3h R Control Flow Loops and Functions :a5, after a2, 3h section Intermediate Level Optional Modules :b1, after a5, 25h More GIS in R :b3, after a5, 3h Introduction to Git :b4, after a5, 4h Statistics in R :b5, after a5, 16h section Advanced Introduction to NLP in R :c1, after b5, 16h Introduction to Machine Learning in R :c2, after b5, 24h Additional Analysis Modules: c3, after c2, 38h Packaging and Documentation :c4, after b5, 4h Additional Developer Modules: c5, after c4, 6h
Curriculum
R Curriculum
This comprehensive curriculum is designed to take KNBS employees from novice R users to advanced practitioners, equipping them with the skills necessary to perform sophisticated data analysis and programming tasks. The curriculum is structured into three distinct levels - Beginner, Intermediate, and Advanced - each building upon the knowledge gained in the previous level.
Curriculum Overview
The journey begins with the foundational “Introduction to R” course, which provides non-R users with the essential skills to perform basic work tasks in R. This is followed by a series of courses that broaden the base knowledge of R and introduce complementary languages and tools such as SQL, Bash, and Git. These courses are designed to give analysts a well-rounded skill set that is crucial for most data analysis tasks in R.
As learners progress, the Intermediate level courses delve into more specific and advanced topics. These courses introduce various analytical methods and best practices in coding for automation. They are designed to enhance the analysts’ capabilities in handling complex data tasks and improving code efficiency and reproducibility.
The Advanced level courses introduce cutting-edge techniques and tools in data science and software development. These courses cover topics such as machine learning, natural language processing, big data processing with Sparklyr, and modern software development practices like continuous integration.
Course Runthrough
Beginner Level
Introduction to R
This redesigned course focuses on applying skills and building confidence in R programming. Participants will learn about data types, importing data, and working with DataFrames through hands-on exercises. The course is structured with a week between sessions to allow for practice and assimilation of learning, making it ideal for beginners to develop independence and resilience in their R programming journey.
- Data Types
- Importing Data
- DataFrames, Manipulation, and Cleaning
Course Length: 2 Days
Prerequisites: None
Introduction to Data Visualisation
Data Visualisation is the art of displaying data in a clear and understandable way that allows for maximum impact. This comprehensive course covers both theoretical and practical aspects of data visualization in R. Participants will learn best practices for presenting data clearly and professionally. The course then transitions into practical application, teaching how to produce production-ready visualizations using R. By the end, students will have a solid foundation in creating impactful and consistent data visualizations.
Course Length: 2 Days
Prerequisites: Introduction to R
Introduction to GIS in R
This course introduces the fundamentals of working with geospatial data in R. Participants will learn to load spatial data, manipulate spatial objects, and create both static and interactive maps. The course covers important concepts such as spatial joins, and coordinate reference systems. By the end, students will be able to perform basic spatial analysis and create visually appealing maps using R.
Course Length: 2-3 hours
Prerequisites: Introduction to Data Visualisation
R Control Flow Loops and Functions
This course focuses on more advanced programming concepts in R, specifically loops, control flow, and function writing. Participants will learn how to use these tools to reduce code repetition, improve readability, and follow good coding practices. The course includes practical examples and exercises to reinforce learning, and provides guidance on areas for additional study to further enhance programming skills.
Course Length: 3 Hours
Prerequisites: Introduction to Data Visualisation
Optional Modules
Introduction to RAP
This course introduces the concepts, motivation, and techniques for creating a Reproducible Analytical Pipeline (RAP). Participants will learn about the benefits of RAP and how to overcome common barriers in implementation. The course is tailored for government analysts and managers of analysis, providing a foundation for enhancing reproducibility and efficiency in analytical workflows.
Course Length: 1 Hour
Prerequisites: Introduction to Data Visualisation
Introduction to RMarkdown
This course provides a comprehensive introduction to RMarkdown, a powerful tool for creating dynamic, reproducible documents in R. Participants will learn how to set up new documents, create various types of content, and output to different formats including HTML, PDF, and Word. The course covers text formatting, tables, images, and RMarkdown-specific features, equipping students with the skills to create professional, data-driven reports.
Course Length: 1-2 Hours
Prerequisites: Introduction to Data Visualisation
Best Practice in Programming - Clean Code
This workshop emphasizes best practices in programming for improved code readability and collaboration. Participants will learn general principles that enhance coding skills and make code more accessible to others. The course covers techniques for writing clean, maintainable code and encourages the development of good coding habits. By the end, students will have a solid foundation in creating code that is easy to read, understand, and modify.
Course Length: 1 Hour
Prerequisites: Introduction to Data Visualisation
Intermediate Level
Introduction to Git
This course provides a comprehensive understanding of the Git version control system. Participants will gain hands-on experience working both locally and collaboratively with Git. The course covers the fundamentals of Git, its practical applications, and the benefits of using it for individual and team projects. By the end, students will be equipped with the skills to effectively use Git for version control and collaboration in their development workflows.
Course Length: 4 Hours
Prerequisites: R Control Flow Loops and Functions
Statistics in R
This comprehensive course covers both statistical theory and its practical application in R. Participants will learn about exploratory data analysis, various statistical tests, linear regression, model adequacy and selection, and generalized linear models. The course provides a balanced mix of theoretical understanding and hands-on coding experience. By the end, students will be equipped to perform advanced statistical analyses and interpret results using R.
Course Length: 16 Hours
Prerequisites: R Control Flow Loops and Functions
More GIS in R
Building on the introductory GIS course, this advanced session covers more complex geospatial operations and analysis techniques in R. Participants will learn about buffers, intersections, area summary statistics, and network analysis. The course also focuses on troubleshooting common error messages and improving the accuracy of spatial analyses. By the end, students will be able to conduct and present comprehensive spatial analyses using R.
Course Length: 2-3 Hours
Prerequisites: R Control Flow Loops and Functions
Optional Modules
Foundations of SQL
This course provides a comprehensive introduction to SQL, covering syntax applicable to various database systems. Using an online platform with SQLite, participants will learn through hands-on exercises. The course covers basic SQL queries, table manipulation, joining tables, and database alterations. By the end, students will have a solid foundation in SQL, enabling them to work with databases effectively in their data analysis projects.
Course Length: 6 Hours
Prerequisites: R Control Flow Loops and Functions
Command Line Basics
This course introduces the powerful world of command line interfaces for both UNIX and Windows systems. Participants will learn basic commands and understand how to navigate file systems and execute operations using the command line. The course aims to make participants comfortable using this essential tool in their work, enhancing their ability to interact with computers efficiently and perform advanced operations.
Course Length: 2 Hours
Prerequisites: R Control Flow Loops and Functions
Reproducible Reporting with RMarkdown
This course delves into advanced features of RMarkdown for creating reproducible reports. Participants will learn to embed executable code and data into reports, work with YAML headers and theme options, and use parameters for dynamic reporting. The course also covers markdown syntax, code chunks, and chunk options. By the end, students will be able to create professional, reproducible reports that seamlessly integrate code, data, and narrative.
Course Length: 2 Hours
Prerequisites: Introduction to RMarkdown, Introduction to RAP
Modular Programming in R
This course focuses on the principles of modular design in R programming. Participants will learn the importance of well-structured, reproducible code and how to implement modular design principles. The course covers techniques for converting code into functions and modules, improving code organization and reusability. By the end, students will be able to write more efficient, maintainable, and scalable R code.
Course Length: 4-5 hours
Prerequisites: R Control Flow Loops and Functions
Hypothesis Testing in R
This intermediate course focuses on advanced concepts in hypothesis testing using R. Participants will learn to define and calculate Type I and Type II errors, determine effect sizes, and calculate statistical power. The course also covers sample size calculations for various statistical tests. By the end, students will have a deeper understanding of the nuances of hypothesis testing and be able to design more robust statistical analyses in R.
Course Length: 4-6 Hours
Prerequisites: Statistics in R
Dates and Times in R
This course provides a comprehensive overview of handling date and time data in R. Participants will learn how to create, convert, and manipulate date-time objects, understanding the underlying storage mechanisms. The course covers practical operations such as extracting parts of dates, performing date-time arithmetic, and working with time zones. By the end, students will be proficient in managing and analyzing time-based data in R.
Course Length: 4 Hours
Prerequisites: R Control Flow Loops and Functions
Advanced Level
Introduction to NLP in R
This course provides a comprehensive introduction to Natural Language Processing (NLP) using R. Participants will learn fundamental NLP concepts and techniques, including text cleaning, exploratory analysis, and feature engineering. The course covers practical applications such as sentiment analysis on real datasets. By the end, students will have the skills to preprocess text data, extract meaningful features, and perform basic NLP tasks in R.
Course Length: 2 days
Prerequisites: Statistics in R
Introduction to Machine Learning in R
This course introduces the fundamentals of machine learning using R’s state-of-the-art “mlr3” package. Participants will learn about classification, regression, and cluster analysis experiments. The course covers the entire machine learning workflow, from data preparation to model evaluation and interpretation. By the end, students will have hands-on experience implementing various machine learning algorithms and understand how to apply them to real-world problems.
Course Length: 3 days
Prerequisites: Statistics in R
Packaging and Documentation
This course guides participants through the process of building and sharing R packages. Students will learn how to create custom functions, organize them into a package structure, and document their code effectively. The course covers best practices in package development, including version control and collaboration. By the end, participants will be able to create their own R packages, enhancing code reusability and facilitating collaboration with other R users.
Course Length: 4 Hours
Prerequisites: R Control Flow Loops and Functions, Best Practice in Programming - Clean Code, Introduction to Git
Additional Analysis Modules
Quality Assurance of Predictive Modelling
This course explores critical quality issues in statistical modeling and machine learning for prediction. Participants will learn best practices for model design, validation, and usage across various industries. The course covers topics such as model interpretability, bias detection, and robustness testing. By the end, students will be equipped with the knowledge to ensure the reliability and effectiveness of their predictive models.
Course Length: 6-8 hours
Prerequisites: Statistics in R
Introduction to Sparklyr
This course introduces Sparklyr, the R interface to Apache Spark for big data processing. Participants will learn how to handle and analyze massive datasets that exceed the capabilities of traditional R. The course covers data manipulation, querying, and processing techniques specific to Sparklyr. By the end, students will be able to leverage the power of distributed computing for their large-scale data analysis projects in R.
Course Length: 2 days
Prerequisites: Introduction to Machine Learning in R
Introduction to Bayesian Data Analysis
This course provides a foundational understanding of Bayesian Data Analysis and its implementation in R. Participants will explore the principles of Bayesian statistics, including prior and posterior distributions, and Markov Chain Monte Carlo (MCMC) methods. The course includes hands-on examples of Bayesian analysis in R, covering model fitting and interpretation. By the end, students will be able to apply Bayesian techniques to their own data analysis projects.
Course Length: 6 Hours
Prerequisites: Statistics in R
Additional Developer Modules
Introduction to Unit Testing
This course focuses on the principles and implementation of unit testing in R for ensuring code quality. Participants will learn how to design, create, and execute tests for their R code. The course covers both the theoretical aspects of software development and practical implementation in R. By the end, students will be able to implement robust testing strategies in their R projects, improving code reliability and maintainability.
Course Length: 4 Hours
Prerequisites: Packaging and Documentation
Introduction to Continuous Integration
This course introduces modern software development approaches, focusing on Continuous Integration (CI) concepts. Participants will learn about DevOps and CI/CD philosophies and how they contribute to efficient software development. The course covers tools and practices that automate testing and deployment processes. While not hands-on, this course provides essential background knowledge for advanced technical courses on building modern development infrastructures. Course Length: 2 Hours
Prerequisites: Packaging and Documentation