The COVID-19 in Niagara Dataset
Traditionally in the archival profession, records arrive in archives when they are no longer an active part of the day-to-day functions or needs of the creator. The materials still have informational, evidentiary, or historic value. Thus, these need to be preserved for long term access. Most often, such records come to the archives years after their active life. In today’s digital world, records can change quickly – day to day, even hour by hour. The need to capture and store this information effectively for future reference and study is important.
When the COVID-19 outbreak came to Canada in mid-March 2020, websites became one of the primary sources for information about the virus. The messaging and reactions surrounding COVID-19 seemed to change daily. In April 2020, the Brock University Archives started to capture COVID-19 related webpages of major municipal governments, businesses, and organizations in the Niagara region of Canada using the web archiving tool Archive-It in an effort to save the evolving record of this area’s response to this historic pandemic.
We identified 56 key local institutions, governments, and organizations that needed to communicate their pandemic related messages to their clients, constituents, and patrons often. Each of these groups created a COVID-19 webpage on their websites. These webpages have been crawled and saved using Archive-It weekly since April 2020. Our intention is to continue to do so through the duration of all pandemic measures surrounding COVID-19. As of this writing, there has been over 3.3 million documents saved equaling 301 GB of data. Here are some samples:
To supplement this information, we have also collected local news stories about the pandemic.
Because Archive-It is a paid subscription service, there is a cap on the amount of allotted storage space that Brock gets every year. To manage our allotment, most of these webpages were captured at either the One Page or One Page+ setting depending on the amount of data the creator placed on their websites weekly. For the One Page setting, hyperlinks within the page may not lead to another webpage. For One Page+, the first page as well as the first page of any URLs directly linked off of your seed was archived. We also set a data limit to each webpage to ensure that a few content heavy webpages would not use all the storage space at the detriment of the rest. At the start of the pandemic, the weekly capture of information ranged from 5 - 11 GB. Since October 2020, the average has settled to 2 -3 GB per week.
After managing this dataset over the past year and a half, I cannot wait to see the discoveries that lie within and the fine research that will be done.