6
Web-Scraping Effects
In Session 5 of the workshop, Yijun (Frank) Wei (National Agricultural Statistics Service [NASS]) described the agency’s preliminary efforts at web scraping to provide early detections of disease incidence. The detection modeling effort described in the previous chapter can detect a shock but with a one-quarter lag, he noted. The hope for web scraping is to obtain signals of a shock that can be discussed during the preparation of the initial quarterly estimates. Katherine Ensor (Rice University) moderated the session.
DESCRIPTION OF NASS WEB SCRAPING
Wei introduced web scraping as part of the next step in NASS modeling efforts. The approach uses a combination of web scraping and natural language processing (NLP) for hog disease outbreak detection. Though disease is just one type of shock, early detection is challenging because the initial incidents may be small and local. The web-scraping and NLP approaches are intended to detect the very early signals of a disease outbreak. It is hoped that web scraping could detect an outbreak and geo-locate it into states or counties. It has the potential to help predict the pattern of the spread and rate of spread of a disease.
There are two stages in this approach. The first stage is to detect a hog disease outbreak using the scraping of disease report repository websites, such as the Swine Disease Global Surveillance Project (SDGSP)1 and
___________________
1 SDGSP is a project sponsored by the University of Minnesota Swine Center to monitor hog disease outbreaks on an international scale. It publishes reports every 2 weeks.
the U.S. Department of Agriculture’s Animal and Plant Health Inspection Service (APHIS). The second stage looks for related news, mostly from national, state, and local news feeds; extension service websites; producer organizations’ websites; and blogs. NLP extracts information from the related news using the four steps of information extraction: normalize time, normalize word, keyword identification, and named entity recognition.
Wei provided an example of the first stage, identifying an outbreak of African swine fever in Vietnam. The initial report was in SDGSP on March 4, 2019. A summary was extracted from the text noting that the outbreak affected 96 households/farms in six provinces and cities, and the Ministry of Agriculture and Rural Development required culling all those affected, quarantining the outbreak area, and testing all neighboring farms.
Wei’s example of the second phase was the tracking of African swine fever in China. The first two outbreaks were documented online on the Pig Site2 in February 2019. The Ministry of Agriculture and Rural Affairs said the first outbreak was on a farm with 5,600 hogs in the Xushui district of Baoding City. It reported the farm had been quarantined and the herd slaughtered. Reuters reported a second outbreak in the remote Greater Khingan Mountains in Inner Mongolia, where 210 of the 222 wild boar raised on a farm died and the rest were slaughtered.
In the next step, NLP was used to extract information from the news reports. Because there is no temporal information included within the text, time was not normalized. Word normalization changed “raised” to “raise,” “slaughtered” to “slaughter,” and “quarantined” to “quarantine.” The key word was defined as “outbreak,” but other key words could be used. The named entity was the Ministry of Agriculture and Rural Affairs, location was Xushui district of Baoding City, and disease noted was swine fever. The summaries of the NLP text processing of these reports is shown in the following two bullets:
- Noun: ‘outbreak.’ Source: The Ministry of Agriculture and Rural Affairs,’ Location: ‘a farm in the Xushui district of Baoding City,’ Stats: ‘has 5,600 hogs.’
___________________
2 The Pig Site is a knowledge-sharing platform with premium news, analysis, and resources for the global pig industry. For more information see https://thepigsite.com.
- Noun: ‘outbreak.’ Source: ‘Reuters,’ Location: remote Greater Khingan Mountains in Inner Mongolia,’ Stats: ‘210 of the 222 wild boar died.’
Wei summarized the potential for this project. It provides information at a fine geographic scale (state or county) that will be potentially useful in spatial disease modeling and mapping, it provides information to understand the time course of the spread, and it provides external documentation confirming disease and response to the outbreak. It could provide information to the pre-board, the Agricultural Statistics Board, or other experts. It could also provide information to incorporate into the modeling system. One advantage of web scraping is that it can be done without the time limitations of the production system.
DISCUSSION
Chris Wikle asked about the need for a text corpus as a training sample for the algorithm to understand grammar. He noted that the results of NLP can be sensitive to the training sample used. Wei replied that he used Python trained from Wikipedia-like text data.
Andrew Lawson asked whether Wei had any U.S. examples of web scraping for disease. Wei replied that he did not because there is no current disease outbreak occurring within the United States. Porcine Epidemic Diarrhea virus (PEDv) occurred 6 years ago, and news about it has disappeared. Nell Sedransk added that web scraping began at NASS within the past 6 months and is in preliminary form. Most of the sources that would have carried the news about PEDv have been archived.
Lawson said that his group has a project on ontology based on scraping abstracts from the National Library of Medicine. One element of NLP is understanding what is meant. That can be difficult with superficial scraping, he noted, and there can be interpretational issues in web scraping. For example, there could be very fuzzy statements that say “this might be an epidemic,” when it is not.
Kamina Johnson (APHIS) reported that APHIS had a similar effort 15 years ago but using what would now be considered archaic or ancient systems. APHIS developed an algorithm to filter the information that came in, setting a wide net, with a human analyst to review and catalog the information. Web scraping is not a perfect science, but a multistep process, she emphasized. She said that Wei might use the Seneca Valley
virus to see if his approach would pick up on that disease. She also suggested testing the system by searching for disease outbreaks that do not involve swine, such as virulent Newcastle disease, currently occurring in California, or low path avian influenza in the fall. These two would test for detection of diseases with lower levels of reporting. High path influenza gets a lot of attention when discovered.
She also recommended the inclusion of potentially new sources in web scraping. APHIS uses the reporting from SDGSP and instant email notifications from ProMED. Additionally, the World Health Organization for Animal Health (OIE) sends out instant notifications about outbreaks. The OIE and ProMED reports are released in a very distinct structured format that would be easy to use in a web-scraping tool, she noted. The OIE identifies diseases that it tracks, so its reports are disease specific, while ProMED also includes non-OIE-reportable diseases.
Lee Schulz asked about the accuracy of news as a variable when it is always changing, being updated, and occasionally redacted. He wondered whether it could be used to construct a variable accurate enough for possible input to a model, referring to the discussion of the accuracy issues related to trade expectations. Wei responded that the project is still in a preliminary stage, and NASS is exploring what can be done with the information.
Linda Young expressed doubt that any board number would be changed based on web scraping, but it might give an early alert to something happening that would then need to be confirmed to be useful. Dan Kerestes agreed with Young’s comment. The board is looking for more information. If the information can be used, perhaps in conjunction with comments sent in by the regional offices, it might add to the discussion. Analysis of the project has not yet been carried out.
Schulz asked about the current process for experts to become informed and whether web scraping might help fill a gap by speeding up the process. Kerestes replied its main attribute will be as a confirmation of other information.
Travis Averill (NASS) observed that the estimation process is focused on the survey and auxiliary data for a reference period, the first of March, June, September, and December. This process also results in comments and other input from respondents and regional offices that are difficult to
analyze and use. Web scraping has the potential to make NASS aware of possible confirmatory information that might help to understand the situation in the field and the potential impact of events.
Wikle asked about the potential for others to manipulate this type of information, especially if NASS scrapes blogs and sites where people might report incorrect information once they know how it is being used. He asked about a mechanism for detecting false placement of key indicators. He also questioned using web-scraped data as input to a spatial epidemic model. He cautioned there is a big step between taking the information and using it as input for a model.
This page intentionally left blank.