National Academies Press: OpenBook
« Previous: 5 Modeling Efforts
Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×

6

Web-Scraping Effects

In Session 5 of the workshop, Yijun (Frank) Wei (National Agricultural Statistics Service [NASS]) described the agency’s preliminary efforts at web scraping to provide early detections of disease incidence. The detection modeling effort described in the previous chapter can detect a shock but with a one-quarter lag, he noted. The hope for web scraping is to obtain signals of a shock that can be discussed during the preparation of the initial quarterly estimates. Katherine Ensor (Rice University) moderated the session.

DESCRIPTION OF NASS WEB SCRAPING

Wei introduced web scraping as part of the next step in NASS modeling efforts. The approach uses a combination of web scraping and natural language processing (NLP) for hog disease outbreak detection. Though disease is just one type of shock, early detection is challenging because the initial incidents may be small and local. The web-scraping and NLP approaches are intended to detect the very early signals of a disease outbreak. It is hoped that web scraping could detect an outbreak and geo-locate it into states or counties. It has the potential to help predict the pattern of the spread and rate of spread of a disease.

There are two stages in this approach. The first stage is to detect a hog disease outbreak using the scraping of disease report repository websites, such as the Swine Disease Global Surveillance Project (SDGSP)1 and

___________________

1 SDGSP is a project sponsored by the University of Minnesota Swine Center to monitor hog disease outbreaks on an international scale. It publishes reports every 2 weeks.

Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×

the U.S. Department of Agriculture’s Animal and Plant Health Inspection Service (APHIS). The second stage looks for related news, mostly from national, state, and local news feeds; extension service websites; producer organizations’ websites; and blogs. NLP extracts information from the related news using the four steps of information extraction: normalize time, normalize word, keyword identification, and named entity recognition.

Wei provided an example of the first stage, identifying an outbreak of African swine fever in Vietnam. The initial report was in SDGSP on March 4, 2019. A summary was extracted from the text noting that the outbreak affected 96 households/farms in six provinces and cities, and the Ministry of Agriculture and Rural Development required culling all those affected, quarantining the outbreak area, and testing all neighboring farms.

Wei’s example of the second phase was the tracking of African swine fever in China. The first two outbreaks were documented online on the Pig Site2 in February 2019. The Ministry of Agriculture and Rural Affairs said the first outbreak was on a farm with 5,600 hogs in the Xushui district of Baoding City. It reported the farm had been quarantined and the herd slaughtered. Reuters reported a second outbreak in the remote Greater Khingan Mountains in Inner Mongolia, where 210 of the 222 wild boar raised on a farm died and the rest were slaughtered.

In the next step, NLP was used to extract information from the news reports. Because there is no temporal information included within the text, time was not normalized. Word normalization changed “raised” to “raise,” “slaughtered” to “slaughter,” and “quarantined” to “quarantine.” The key word was defined as “outbreak,” but other key words could be used. The named entity was the Ministry of Agriculture and Rural Affairs, location was Xushui district of Baoding City, and disease noted was swine fever. The summaries of the NLP text processing of these reports is shown in the following two bullets:

  • Noun: ‘outbreak.’ Source: The Ministry of Agriculture and Rural Affairs,’ Location: ‘a farm in the Xushui district of Baoding City,’ Stats: ‘has 5,600 hogs.’

___________________

2 The Pig Site is a knowledge-sharing platform with premium news, analysis, and resources for the global pig industry. For more information see https://thepigsite.com.

Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
  • Noun: ‘outbreak.’ Source: ‘Reuters,’ Location: remote Greater Khingan Mountains in Inner Mongolia,’ Stats: ‘210 of the 222 wild boar died.’

Wei summarized the potential for this project. It provides information at a fine geographic scale (state or county) that will be potentially useful in spatial disease modeling and mapping, it provides information to understand the time course of the spread, and it provides external documentation confirming disease and response to the outbreak. It could provide information to the pre-board, the Agricultural Statistics Board, or other experts. It could also provide information to incorporate into the modeling system. One advantage of web scraping is that it can be done without the time limitations of the production system.

DISCUSSION

Chris Wikle asked about the need for a text corpus as a training sample for the algorithm to understand grammar. He noted that the results of NLP can be sensitive to the training sample used. Wei replied that he used Python trained from Wikipedia-like text data.

Andrew Lawson asked whether Wei had any U.S. examples of web scraping for disease. Wei replied that he did not because there is no current disease outbreak occurring within the United States. Porcine Epidemic Diarrhea virus (PEDv) occurred 6 years ago, and news about it has disappeared. Nell Sedransk added that web scraping began at NASS within the past 6 months and is in preliminary form. Most of the sources that would have carried the news about PEDv have been archived.

Lawson said that his group has a project on ontology based on scraping abstracts from the National Library of Medicine. One element of NLP is understanding what is meant. That can be difficult with superficial scraping, he noted, and there can be interpretational issues in web scraping. For example, there could be very fuzzy statements that say “this might be an epidemic,” when it is not.

Kamina Johnson (APHIS) reported that APHIS had a similar effort 15 years ago but using what would now be considered archaic or ancient systems. APHIS developed an algorithm to filter the information that came in, setting a wide net, with a human analyst to review and catalog the information. Web scraping is not a perfect science, but a multistep process, she emphasized. She said that Wei might use the Seneca Valley

Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×

virus to see if his approach would pick up on that disease. She also suggested testing the system by searching for disease outbreaks that do not involve swine, such as virulent Newcastle disease, currently occurring in California, or low path avian influenza in the fall. These two would test for detection of diseases with lower levels of reporting. High path influenza gets a lot of attention when discovered.

She also recommended the inclusion of potentially new sources in web scraping. APHIS uses the reporting from SDGSP and instant email notifications from ProMED. Additionally, the World Health Organization for Animal Health (OIE) sends out instant notifications about outbreaks. The OIE and ProMED reports are released in a very distinct structured format that would be easy to use in a web-scraping tool, she noted. The OIE identifies diseases that it tracks, so its reports are disease specific, while ProMED also includes non-OIE-reportable diseases.

Lee Schulz asked about the accuracy of news as a variable when it is always changing, being updated, and occasionally redacted. He wondered whether it could be used to construct a variable accurate enough for possible input to a model, referring to the discussion of the accuracy issues related to trade expectations. Wei responded that the project is still in a preliminary stage, and NASS is exploring what can be done with the information.

Linda Young expressed doubt that any board number would be changed based on web scraping, but it might give an early alert to something happening that would then need to be confirmed to be useful. Dan Kerestes agreed with Young’s comment. The board is looking for more information. If the information can be used, perhaps in conjunction with comments sent in by the regional offices, it might add to the discussion. Analysis of the project has not yet been carried out.

Schulz asked about the current process for experts to become informed and whether web scraping might help fill a gap by speeding up the process. Kerestes replied its main attribute will be as a confirmation of other information.

Travis Averill (NASS) observed that the estimation process is focused on the survey and auxiliary data for a reference period, the first of March, June, September, and December. This process also results in comments and other input from respondents and regional offices that are difficult to

Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×

analyze and use. Web scraping has the potential to make NASS aware of possible confirmatory information that might help to understand the situation in the field and the potential impact of events.

Wikle asked about the potential for others to manipulate this type of information, especially if NASS scrapes blogs and sites where people might report incorrect information once they know how it is being used. He asked about a mechanism for detecting false placement of key indicators. He also questioned using web-scraped data as input to a spatial epidemic model. He cautioned there is a big step between taking the information and using it as input for a model.

Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×

This page intentionally left blank.

Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
Page 37
Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
Page 38
Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
Page 39
Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
Page 40
Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
Page 41
Suggested Citation:"6 Web-Scraping Effects." National Academies of Sciences, Engineering, and Medicine. 2019. Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25526.
×
Page 42
Next: 7 Modeling Swine Population Dynamics »
Using Models to Estimate Hog and Pig Inventories: Proceedings of a Workshop Get This Book
×
Buy Paperback | $70.00 Buy Ebook | $54.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

In 2014, the National Agricultural Statistics Service (NASS) engaged the National Academies of Sciences, Engineering, and Medicine to convene a planning committee to organize a public workshop for an expert open discussion of their then-current livestock models. The models had worked well for some time. Unfortunately beginning in 2013, an epidemic that killed baby pigs broke out in the United States. The epidemic was not fully realized until 2014 and spread to many states. The result was a decline in hog inventories and pork production that was not predicted by the models. NASS delayed the workshop until 2019 while it worked to develop models that could help in times both of equilibrium and shock (disease or disaster), as well as alternative approaches to help detect the onset of a shock. The May 15, 2019, workshop was consistent with NASS’s 2014 intention, but with a focus on a model that can help predict hog inventories over time, including during times of shock. This publication summarizes the presentations and discussions from the workshop.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!