Week 3: Web Scraping

web-scraper-order web-scraper-start-url news headline
1697795180-1 https://www.bbc.co.uk/news Rising homelessness could bankrupt seaside town
1697795180-2 https://www.bbc.co.uk/news Coaches can't keep up with my questions - Raducanu
1697795180-3 https://www.bbc.co.uk/news Care crisis: People forced to pay or wait
1697795180-4 https://www.bbc.co.uk/news Warm weather in September hits autumn coat sales
1697795180-5 https://www.bbc.co.uk/news Don't dress as Barbie at Halloween, striking Hollywood actors told
1697795180-6 https://www.bbc.co.uk/news Elon Musk says X to have two new premium tiers
1697795180-7 https://www.bbc.co.uk/news Megan Thee Stallion settles legal battle with label

Between week 2 and 3, we covered aspects of HTML (hence this website's existence) and web scraping. In the workshop of week 3, I opted to use webscraper.io to build a web scraper, which you can see above. Initially we discussed OutWit Hub as a scraper to try although this was problematic with my laptop's operating system, so webscraper.io - a Chrome extension - was decided as a better fit. I also tried ScrapeHero, which felt very user friendly, and an approachable way to scrape data (although there was a limited option of websites to scrape with a free account).

I found the process itself fairly straightforward to execute - thanks to the advice given via Minerva and within the workshop - and to try and make sense of the digital research method, I opted to scrape some headlines from BBC News. In a research setting, some sort of pattern might emerge from a larger dataset, allowing for critical analysis and interpretation.

I then gave things a try on Google Colab, to scrape the 'Programming Buddies' group on old reddit. Following the instructions made this understandable; see below for the code and printed outcome.

a screenshot of trying out a scrape, looking at titles and upvotes of the subreddit

I then played around with outputs and managed to get the data written out in the format of subreddit title / upvote number, as seen below...

a screenshot of the output of a trial of scraping

I plan to trial this with other websites, and/or subreddits, to develop my skills.

It should be considered that, from a research perspective, solely relying on web scraping might overlook particular types of data. It could be missed by the scraper, or a website's source HTML change therefore impacting the scrape. With that in mind, though, web scraping can also be useful in occasions such as audience/consumer study (eg by scraping from social media comments), or, in academia, research to find patterns which might suggest or uncover certain phenomena.

Personally, I would be interested to try and scrape data from a website such as Instagram, or X, specifically its comment section of, for example, a footballer. There is often talk of a rise in online abuse towards sportspeople and scraping data comments might enable statistics to be produced to back up a campaign to stop online abuse.