Teaching

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems

Scraped data is often the backbone of an investigation, but some websites are more difficult to scrape than others. This session covers best practices for dealing with tricky sites, including coping with captchas, using proxy and other scraping services, plus the tradeoffs and costs of these approaches.
1. 24 May 2025 Dataharvest 2025Mechelen, Belgium
2. 1 Jun 2024 Dataharvest 2024Mechelen, Belgium
Finding needles in haystacks with fuzzy matching

Fuzzy matching is a process for linking up names that are similar but not quite the same. It can be an important part of data-led investigations, identifying connections between key people and companies that are relevant to a story. This class covers how it fits into the investigative process, and includes a practical introduction to using the CSV Match tool I developed.
1. 7 Mar 2025 Nicar 2025Minneapolis, USA
2. 8 Mar 2024 Nicar 2024Baltimore, USA
3. 3 Mar 2023 Nicar 2023Nashville, USA
4. 6 Mar 2022 Nicar 2022Atlanta, USA
5. 4 Mar 2022 Nicar 2022Atlanta, USA
6. 3 Mar 2021 Nicar 2021Online
7. 5 Mar 2020 Nicar 2020New Orleans, USA
8. 10 Mar 2019 Nicar 2019Newport Beach, USA
9. 9 Mar 2019 Nicar 2019Newport Beach, USA
10. 28 Jun 2018 CIJ Summer Conference 2018London, UK
11. 26 May 2018 Dataharvest 2018Mechelen, Belgium
12. 10 Mar 2018 Nicar 2018Chicago, USA
Tracking changes with GitHub Actions

Sometimes how data changes can be more interesting than the data itself. For example, Wikipedia lets you see how a page has been edited - adding or cutting out certain bits of information. Using GitHub Actions, we can do something similar for any webpage. This session covers using Actions to regularly run a scraper, analysing the output, and identifying changes over time.
1. 31 May 2024 Dataharvest 2024Mechelen, Belgium
2. 2 Jun 2023 Dataharvest 2023Mechelen, Belgium
Time travel for beginners: how to create and use web archives

Ever relied upon an online source, only later to find it deleted or changed? This class covers how to get the most out of resources like the Wayback Machine – what they’re good for, and what they’re not. We also cover when and how to build your own private archives of web content.
1. 22 Sep 2023 Global Investigative Journalism Conference 2023Gothenburg, Sweden
2. 29 Jun 2023 CIJ Summer Conference 2023London, UK
3. 21 May 2022 Dataharvest 2022Mechelen, Belgium
Web basics: how the web works, and how to scrape it

Have you ever wondered how exactly your stories reach your readers? Ever wanted to know how to build a simple webpage? Or how to scrape information from the web? This session covers the principles of how web pages get onto your screen, and working with the two key web technologies of HTML and CSS. Dataharvest sessions taught with Rui Barros.
1. 28 Jun 2023 CIJ Summer Conference 2023London, UK
2. 2 Jun 2023 Dataharvest 2023Mechelen, Belgium
3. 20 May 2022 Dataharvest 2022Mechelen, Belgium
An introduction to data for investigations

Where do you start using data in investigations? This training morning covers what data really is, developing a ‘data state of mind’ to spot opportunities, data sourcing including using unstructured data, hands-on scraping websites and interviewing datasets to get answers to your questions, as well as developing rigorous working practices that help you avoid mistakes.
1. 24 Aug 2021 Birn Summer School 2021Mlini, Croatia
Scraping from scratch

You may have come across acronyms like HTTP and HTML, but what do they mean, and what does it matter? This class explains the concepts that underpin how the web works – which are simpler than you might think – as well as how you can use this knowledge to extract out the information you need, and understand how exactly your stories reach your readers.
1. 18 May 2021 Dataharvest Digital 2021Online
Introduction to code for journalists

Want to take your first steps with code but not sure how to begin? Or want to learn how code is being used in the newsroom and if it can help you and your team? This weekend workshop is an introductory primer to learning to code, showing recent story examples, explaining the fundamental concepts in programming, and demystifying the jargon.
1. 8 Feb 2020 CIJ CoursesLondon, UK
2. 19 Oct 2019 CIJ CoursesLondon, UK
3. 27 Jul 2019 CIJ CoursesLondon, UK
4. 23 Feb 2019 CIJ CoursesLondon, UK
5. 10 Feb 2018 CIJ CoursesLondon, UK
Exploring networks with graph databases

Graph databases are incredibly useful to find connections or patterns within our data. This is a hands-on introduction to graph database Neo4j, showing examples of its use for investigative stories including the Panama and Paradise Papers, and teaching attendees how to build a graph of noteworthy individuals and match them with corporate data to see the networks involved.
1. 28 Sep 2019 Global Investigative Journalism Conference 2019Hamburg, Germany
2. 5 Jul 2019 CIJ Summer Conference 2019London, UK
3. 17 May 2019 Dataharvest 2019Mechelen, Belgium
4. 29 Jun 2018 CIJ Summer Conference 2018London, UK
5. 26 May 2018 Dataharvest 2018Mechelen, Belgium
6. 23 Jun 2017 CIJ Summer Conference 2017London, UK
7. 21 May 2017 Dataharvest 2017Mechelen, Belgium
GitHub for journalists

Whether you find yourself collaborating on code, data, or prose, GitHub can work for journalists. This class covered what GitHub is, the benefits of using it, and how it is typically used both by people doing data analysis and by developers. Attendees were shown how to create a first repository and make pull requests.
1. 8 Mar 2018 Nicar 2018Chicago, USA
Code for journalists

How can code help you or your team with investigations? This two-hour session was a hand-holding hands-on introduction to programming, showing recent examples of published stories and demystifying the jargon. Attendees were guided through the tools needed, including text editors and an introduction to the command line. This later evolved into a weekend course.
1. 23 Jun 2017 CIJ Summer Conference 2017London, UK
How to work with web developers (and what we wish you knew)

Good communication between management and techies can make the difference between a website or app that makes money and one that loses customers, but the culture divide can be vast. This evening course covered working methods, jargon, and how to brief to avoid tension between the business parts of an organisation and mysterious, headphone-wearing coders.
1. 28 Apr 2014 Guardian MasterclassesLondon, UK
2. 22 Jul 2013 Guardian MasterclassesLondon, UK

Teaching

Scraping the unscrapable: advanced approaches to deal with complex sites and evade anti-scraping systems

Finding needles in haystacks with fuzzy matching

Tracking changes with GitHub Actions

Time travel for beginners: how to create and use web archives

Web basics: how the web works, and how to scrape it

An introduction to data for investigations

Scraping from scratch

Introduction to code for journalists

Exploring networks with graph databases

GitHub for journalists

Code for journalists

How to work with web developers (and what we wish you knew)