Crypto Trends

Behind the Scenes of Using Web Scraping and AI in Investigative Journalism

While the work of investigative journalists sometimes involves contacting anonymous sources for hidden information or even going undercover, threads for great stories often lie in open sources accessible to everyone. Due to this reason, web scraping has become indispensable for journalists over the last couple of decades. Recently, developments in AI have provided another way to upgrade the reporter’s toolkit.

Why does web scraping matter to journalists?

Web scraping is the automated collection of data from the Internet using specialized software tools known as web scrapers. As a robust data collection method, it can be used for both good and bad. The general public often hears more about the latter, which fuels the belief that web scraping is something shady that should probably be banned altogether. However, when the case whose outcome threatened to make web scraping illegal appeared before the U.S. Supreme Court, it was journalists who stood up against it. An investigative nonprofit newsroom, The Markup, filed an amicus brief claiming that web scraping is vital to democracy.

This is not an overstatement. In some cases, only web data extraction tools allow journalists to keep government agencies accountable. By scraping public information, investigators can check if the data really supports the official position, report on otherwise ignored anomalies, or uncover negligent data management practices of state institutions.

Additionally, tracking disinformation spread all across the web would not be possible without such automated solutions. Artificial Intelligence can boost this spread by easily generating fake visual and audio content. On the bright side, AI-powered web scraping tools can also monitor, identify, and remove such fakes.

Web scraping also enables journalists to uncover stories from the criminal underground. Here, the work of journalistic and forensic investigators resemble each other. Both types of investigators can use data scraping to detect human trafficking activities and illegal marketplaces.

How to use the latest tech for high-quality journalism?

Investigative journalism today is closely related to data journalism, which uses data as the primary source for investigating and reporting stories. However, not all journalists are data scientists, analytics, or coders. And even for tech-savvy reporters, ways to leverage web scraping and AI tools for journalism don’t always come straightforwardly. A few things might help reporters get started.

Utilize no-code tools

Tools and tutorials are available for those who do not possess coding skills yet believe in the power of data to bring forth relevant stories. Some advocates of scraping in journalism share content online on using such no-code tools and provide tips for leveraging web scraping in investigations and storytelling. For example, one can seek guidance from fellow journalists in the Global Investigative Journalism Network on using free browser extensions like Data Miner to extract data from the web.

Think about the scale

Sometimes, the work of journalists is made harder by the abundance of information rather than lack. This is evident on the Internet, where the truth can be publicly accessible yet drowned in more disinformation than even an army of humans could quickly sift through.

Thus, one way to approach a scraping-based investigation is by thinking about the threads of the stories that would be impossible to follow manually. For example, if you notice some suspicious reporting, you might want to review all the articles written by the same reporter. However, searching for them all manually can be hard and time-consuming. With web scraping, you can quickly discover that the quantity of articles itself proves your suspicions.

This happened when data scraping tools helped show that 38,000 articles published in the same year on the war in Ukraine attribute authorship to the same “journalist.” Thus, real journalists can untangle fake journalism of non-existent persons with the help of proper scraping tools.

Let AI read and connect the dots

While web scraping can help journalists get large data sets, AI tools are well-suited to assist in going through this data. These tools have been used for years to analyze satellite imagery, which would take immense personnel, time, and resources to do manually. Recently, the New York Times utilized AI just this way to reinforce its findings on the bombardment of Gaza.

However, journalistic investigations often involve reading documents and putting the pieces scattered in vast amounts of textual information together. This needed to be done when the International Consortium of Investigative Journalists (ICIJ) got hold of the 11.5 million documents comprising the “Panama Papers.” A few years later, ICIJ collaborated with Stanford AI Lab to find how emerging machine learning (ML) techniques could be enlisted in such projects and quickly learned the value of such mutually beneficial collaborations.

In a more recent case, a Filipino journalist used OpenAI’s feature, allowing you to create agents on top of ChatGPT to build one that helps watchdog journalism. The custom agent can read and summarize many pages of audit reports and other official documents to identify potentially newsworthy angles. Without such solutions, journalists have to spend hours on one report while governments can publish thousands of them every year.

Ethical data gathering and AI usage

The strict ethical guidelines journalists follow when conducting investigations also apply to utilizing data scraping and AI solutions. Journalists are advised to identify their scrapers to the website when possible. In some cases, however, this would ruin the investigation. For example, journalists can only achieve their goals using proxy IPs when monitoring illegal activity on dark web forums and marketplaces. They can only avoid being blocked or targeted by hackers by hiding their real online identity.

Additionally, reporters should be careful about the data they gather and store to avoid breaking laws or leaking sensitive information. In this area, specially trained AI can help manage data-gathering activities so that only important public data is targeted. However, AI itself should never be trusted with final decisions when reporting a story. Ultimately, human oversight, journalistic integrity, and domain expertise remain the most important investigative tools that AI does not threaten to replace.

Conclusion

Data journalism is now a vital part of investigative journalism. Both web scraping and emerging AI technologies boost journalists’ work and help track elusive threads of fascinating stories behind mountains of data. In the future, AI tools will likely be used increasingly more for generating story ideas, catching anomalies, and visualizing the findings, among many other tasks. Meanwhile, the power of web scraping to extract value from public data and reveal what is hidden in plain sight can make it the definitive tool of investigative journalism in the 21st century.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button