Jul 15, 2020 Web Scraping is an automat i c way to retrieve unstructured data from website and store them in a structured format. For example, if you want to analyse what kind of face mask can sell better in Singapore, you may want to scrape all the face mask information on E-Commerce website like Lazada.
This is the second episode of my web scraping tutorial series. In the first episode, I showed you how you can get and clean the data from one single web page. In this one, you’ll learn how to scrape multiple web pages (3,000+ URLs!) automatically, with one 20-line long bash script.
- Mar 14, 2020 This is the second article of my web scraping guide. In the first article, I showed you how you can find, extract, and clean the data from one single web page on IMDb. In this article, you’ll learn how to scrape multiple web pages — a list that’s 20 pages and 1,000 movies total — with a Python web scraper.
- The webScraper.io Chrome extension is one of the best web scrapers you can install as a Chrome extension. With over 300,000 downloads – and impressive customer reviews in the store, this extension is a must-have for web scrapers. With this tool, you can extract data from any website of your choice in an easy and swift manner.
This is going to be fun!
Note: This is a hands-on tutorial. I highly recommend doing the coding part with me! If you haven’t done so yet, please go through these articles first:
Where did we leave off?
Scraping TED.com…
In the previous article, we scraped a TED talk’s transcript from TED.com.
Note: Why TED.com? As I always say, when you run a data science hobby project, you should always pick a topic that you are passionate about. My hobby is public speaking. But if you are excited about something else, after finishing these tutorial articles, feel free to find any project that you fancy!
This was the code that we used:
And this was the result we got:
Let’s continue from here…
By the end of this article you won’t scrape only one but all 3,000+ TED talk transcripts. They will be downloaded to your server, extracted and cleaned — ready for data analysis.
I’ll guide you through these steps:
- You’ll extract the unique URLs from TED.com’s html code — for each and every TED talk.
- You’ll clean and save these URLs into a list.
- You’ll iterate through this list with a for loop and you’ll scrape each transcript one by one.
- You’ll download, extract and clean this data by reusing the code we have already created in the previous episode of this tutorial.
So in one sentence: you will scale up our little web scraping project!
We will get there soon… But before everything else, you’ll have to learn how for loops work in bash.
Bash For Loops — a 2-minute crash course
Note: if you know how for loops work, just skip this and jump to the next headline.
If you don’t want to iterate through 3,000+ web pages one by one manually, you’ll have to write a script that will do this for you automatically. And since this is a repetitive task, your best shot is to write a loop.
I’ve already introduced bash while loops.
But this time, you will need a for loop.
A for loop works simply. You have to define an iterable (which can be a list or a series of numbers, for instance). And then you’ll use your for loop to go through and execute one or more commands on each element of this iterable.
Here’s the simplest example:
What does this code do?
It iterates through the numbers between 1 and 100 and it prints them to the screen one by one.
And how does it do that? Let’s see that line by line:
for i in {1.100}
This line is called the header of the for loop. It tells bash what you want to iterate through. In this specific case, it will be the numbers between1
and100
. You’ll usei
as a variable. In each iteration, you’ll store the upcoming element of your list in thisi
variable. And with that, you’ll be able to refer to this element (and execute commands on it) in the “body” of your for loop.
Note: the variable name doesn’t have to be i… It can be anything:f
,g
,my_variable
or anything else…do
This line tells bash that here starts the body of your for loop.
In the body of the for loop, you’ll add the command(s) that you want to execute on each element of the list.echo $i
The actual command. In this case, it’s the simplest possible example: returning the variable to the screen.done
This closes the body of the for loop.
Note: if you have worked with Python for loops before, you might recognize notable differences. E.g. indentations are obligatory in Python, in bash it’s optional. (It doesn’t make a difference – but we like to use indentations in bash, too, because the script is more readable that way.) On the other hand, in Python, you don’t need the do
and done
lines. Well, in every data language there are certain solutions for certain problems (e.g. how to indicate the beginning and the end of a loop’s body). Different languages are created by different people… so they use different logic. It’s like learning English and German. You have to learn different grammars to speak different languages. It’s just how it is…
By the way, here’s a flowchart to visualize the logic of a for loop:
Quite simple.
So for now, I don’t want to go deeper into for loops, you’ll learn the other nuances of them throughout this tutorial series anyway.
Finding the web page(s) we’ll need to scrape
Okay!
Time to get the URLs of each and every TED talk on TED.com.
But where can you find these?
Obviously, these should be somewhere on TED.com… So before you go to write your code in the command line, you should discover the website in a regular browser (e.g. Chrome or Firefox). After like 10 seconds of browsing, you’ll find the web page you need: https://www.ted.com/talks
Well, before you go further, let’s set two filters!
- We want to see only English videos for now. (Since you’ll do text analysis on this data, you don’t want to mix languages.)
- And we want to sort the videos by the number of views (most viewed first).
Using these filters, the full link of the listing page changes a bit… Check the address bar of your browser. Now, it looks like this:
Cool!
The unlucky thing is that TED.com doesn’t display all 3,300 videos on this page… only 36 at a time:
And to see the next 36 talks, you’ll have to go to page 2. And then to page 3… And so on. And there are 107 pages!
That’s way too many! But that’s where the for loops will come into play: you will use them to iterate through all 107 pages, automatically.
…
But for a start, let’s see whether we can extract the 36 unique URLs from the first listing page.
If we can, we will be able to apply the same process for the remaining 106.
Extracting URLs from a listing page
You have already learned curl
from the previous tutorial.
And now, you will have to use it again!
Notice a small but important difference compared to what you have used in the first episode. There, you typed curl
and the full URL. Here, you typed curl
and the full URL between quotation marks!
Why the quotation marks? Because without them curl
won’t be able to handle the special characters (like ?
, =
, &
) in your URL — and your command will fail… or at least it will return improper data. Point is: when using curl
, always put your URL between '
quotation marks!
Note: In fact, to stay consistent, I should have used quotation marks in my previous tutorial, too. But there (because there were no special characters) my code worked without them and I was just too lazy… Sorry about that, folks!
Anyway, we returned messy data to our screen again:
It’s all the html code of this listing page…
Let’s do some data cleaning here!
This time, you can’t use html2text
because the data you need is not the text on the page but the transcripts’ URLs. And they are found in the html code itself.
When you build a website in html, you define a link to another web page like this:
So when you scrape an html website, the URLs will be found in the lines that contain the href
keyword.
So let’s filter for href
with a grep
command! (grep
tutorial here!)
Cool!
If you scroll up, you’ll see URLs pointing to videos. Great, those are the ones that we will need!
But you’ll also see lines with URLs to TED’s social media pages, their privacy policy page, their career page, and so on. You want to exclude these latter ones.
When doing a web scraping project this happens all the time…
There is no way around it, you have to do some classic data discovery. In other words, you’ll have to scroll through the data manually and try to find unique patterns that separate the talks’ URLs from the rest of the links we won’t need.
Lucky for us, it is a pretty clear pattern in this case.
All the lines that contain:
are links to the actual TED videos (and only to the videos).
Note: It seems that TED.com uses a very logical site structure and the talks are in the /talks/
subdirectory. Many high-quality websites use a similar well-built hierarchy. For the great pleasure of web-scrapers like us. 🙂
Let’s use grep
with this new, extended pattern:
And there you go:
Only the URLs pointing to the talks: listed!
Cleaning the URLs
Well, you extracted the URLs, that’s true… But they are not in the most useful format. Yet.
Here’s a sample line from the data we got:
How can we scrape the web pages based on this? No way…
We are aiming for proper, full URLs instead… Something like this:
If you take a look at the data, you’ll see that this issue can be fixed quickly. The unneeded red and blue parts are the same in all the lines. And the currently missing yellow and purple parts will be constant in the final URLs, too. So here’s the action plan:
STEP #1:
Keep the green parts!
STEP #2:
Replace this:
with this:
STEP #3:
Replace this:
with this:
There are multiple ways to solve these tasks in bash.
I’ll use sed
, just as in the previous episode. (Read more about sed
here.)
Note: By the way, feel free to add your alternative solutions in the comment section below!
So for STEP #1, you don’t have to do anything. (Easy.)
For STEP #2, you’ll have to apply this command:
And for STEP #3, this:
Note: again, this might seem very complicated to you if you don’t know sed
. But as I mentioned in episode #1, you can easily find these solutions if you Google for the right search phrases.
So, to bring everything together, you have to pipe these two new commands right after the grep
:
Run it and you’ll see this on your screen:
Nice!
There is only one issue. Every URL shows up twice… That’s an easy fix though. Just add one more command – the uniq
command – to the end:
Awesome!
The classic URL trick for scraping multiple pages
This was the first listing page only.
But we want to scrape all 107!
So go back to your browser (to this page) and go to page 2…
You’ll see that this web page looks very similar to page 1 — but the structure of the URL changes a bit.
Now, it’s:
There is an additional &page=2
parameter there. And it is just perfect for us!
If you change this parameter to 1
, it goes back to page 1:
But you can change this to 11
and it’ll go to page 11:
By the way, most websites (not just TED.com’s) are built by following this logic. And it’s perfect for anyone who wants to scrape multiple pages…
Why?
Because then, you just have to write a for loop that changes this page parameter in the URL in every iteration… And with that, you can easily iterate through and scrape all 107 listing pages — in a flash.
Just to make this crystal clear, this is the logic you’ll have to follow:
Scraping multiple pages (URLs) – using a for loop
Let’s see this in practice!
1) The header of the for loop will be very similar to the one that you have learned at the beginning of this article:
for i in {1.107}
A slight tweak: now, we have 107 pages — so (obviously) we’ll iterate through the numbers between 1
and 107
.
2) Then add the do
line.
3) The body of the loop will be easy, as well. Just reuse the commands that you have already written for the first listing page a few minutes ago. But make sure that you apply the little trick with the page parameter in the URL! So it’s not:
but:
This will be the body of the for loop:
4) And then the done
closing line, of course.
All together, the code will look like this:
You can test this out in your Terminal… but in its final version let’s save the output of it into a file called ted_links.txt
, too!
Here:
Now print the ted_links.txt
file — and enjoy what you see:
Very nice: 3,000+ unique URLs listed into one big file!
With a few lines of code you scraped multiple web pages (107 URLs) automatically.
This wasn’t an easy bash code to write, I know, but you did it! Congratulations!
Scraping multiple web pages again!
We are pretty close — but not done yet!
In the first episode of this web scraping tutorial series, you have created a script that can scrape, download, extract and clean a single TED talk’s transcript. (That was Sir Ken Robinson’s excellent presentation.)
This was the bash code for it:
And this was the result:
And inn this article, you have saved the URLs for all TED talk transcripts to the ted_links.txt
file:
All you have to do is to put these two things together.
To go through and scrape 3,000+ web pages, you will have to use a for loop again.
The header of this new for loop will be somewhat different this time:
for i in $(cat ted_links.txt)
Your iterable is the list of the transcript URLs — found in the ted_links.txt
file.
The body will be the bash code that we’ve written in the previous episode. Only, the exact URL (that points to Sir Ken Robinson’s talk) should be replaced with the $i
variable. (As the for loop goes through the lines of the ted_links.txt
file, in each iteration the $i
value will be the next URL, and the next URL, and so on…)
So this will be the body:
If we put these together, this is our code:
Let’s test this!
After hitting enter, you’ll see the TED talks printed to your screen — scraped, extracted, cleaned… one by one. Beautiful!
But we want to store this data into a file — and not to be printed to our screen… So let’s just interrupt this process! (Scraping 3,000+ web pages would take ~1 hour.) To do that, hit CTRL + C
on your keyboard! (This hotkey works on Mac, Windows and Linux, too.)
Storing the data
Storing the transcripts into a file (or into more files) is really just one final touch on your web scraping bash script.
There are two ways to do that:
The lazy way and the elegant way.
1) I’ll show you the lazy way first.
It’s as simple as adding > ted_transcripts_all.txt
to the end of the for loop. Like this:
Just run it and after ~1 hour processing time (again: scraping 3,000+ web pages can take a lot of time) you will get all the transcripts into one big file (ted_transcripts_all.txt
).
Great! But that’s the lazy way. It’s fast and it will be a good enough solution for a few simple text analyses.
2) But I prefer the elegant way:saving each talk into a separate file. That’s much better in the long term. With that, you will be able to analyze all the transcripts separately if you want to!
To go further with this solution, you’ll have to create a new folder for your new files. (3,000+ files is a lot… you definitely want to put them into a dedicated directory!) Type this:
And then, you’ll have to come up with a naming convention for these files.
For me talk1.txt
, talk2.txt
, talk3.txt
(…) sounds pretty logical.
To use that logic, you have to add one more variable to your for loop. I’ll call it $counter
, its value will be 1
in the first iteration — and I’ll add 1
to it in every iteration as our for loop goes forward.
Great — the only thing left is to add the file name itself:
talk$counter.txt
(In the first iteration this will be talk1.txt
, in the second as talk2.txt
, in the third as talk3.txt
and so on — as the $counter
part of it changes.)
Okay, so add the >
character, your freshly created folder’s name (ted_transcripts
) and the new file name (talk$counter.txt
) into the right place of the for loop’s body:
Let’s run it!
And in ~1 hour, you will have all the TED transcripts sorted into separate files!
You are done!
You have scraped multiple web pages… twice!
Saving your bash script — and reusing it later
It would be great to save all the code that you have written so far, right?
You know, just so that you’ll be able to reuse it later…
Let’s create a bash script!
Type this to your command line:
More funding for mac. This will open our favorite command line text editor, mcedit
and it will create a new file called ted_scraper.sh
.
Note: We use .sh
as the file extension for bash scripts.
You can copy-paste to the script all the code that you have written so far… Basically, it will be the two for loops that you fine-tuned throughout this article.
And don’t forget to add the shebang in the first line, either!#!/usr/bin/env bash
For your convenience, I put everything in GitHub… So if you want to, you can copy-paste the whole thing directly from there. Here’s the link.
This is how your script should look in mcedit:
Click the 10-Quit
button at the bottom right corner! Save the script!
And boom, you can reuse this script anytime in the future.
Even more, you can run this bash script directly from this .sh
file.
All you have to do is to give yourself the right privileges to run this file:
And then with this command…
you can immediately start your script.
(Note: if you do this, make sure that your folder system is properly prepared and you have indeed created the ted_transcripts
subfolder in the main folder where your script is located. Since you’ll refer to this ted_transcripts
subfolder in the bash script, if you don’t have it, your script will fail.)
One more comment:
Another text editor that I’ve recently been using quite often — and that I highly recommend to everyone — is Sublime Text 3. It has many great features that will make your coding life as a data scientist very, very efficient… And you can use it with a remote server, too.
In Sublime Text 3, this is how your script looks:
Pretty nice!
Conclusion
Whoaa!
Probably, this was the longest ever tutorial on the Data36 Blog, so far. Scraping multiple URLs can get complex, right? Well, we have written only ~20 lines of code… but as you can see, even that can take a lot of thinking.
Anyway, if you have done this with me — and you have all 3,800+ TED talks scraped, downloaded, extracted and cleaned on your remote server:be really proud of yourself!
It wasn’t easy but you have done it! Congratulations!
Your next step, in this web scraping tutorial, will be to run text analyses on the data we got.
We will continue from here in the web scraping tutorial episode #3! Stay with me!
- If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
- Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.
Cheers,
Tomi Mester
C# is still a popular backend programming language, and you might find yourself in need of it for scraping a web page (or multiple pages). In this article, we will cover scraping with C# using an HTTP request, parsing the results, and then extracting the information that you want to save. This method is common with basic scraping, but you will sometimes come across single-page web applications built in JavaScript such as Node.js, which require a different approach. We’ll also cover scraping these pages using PuppeteerSharp, Selenium WebDriver, and Headless Chrome.
Note: This article assumes that the reader is familiar with C# syntax and HTTP request libraries. The PuppeteerSharp and Selenium WebDriver .NET libraries are available to make integration of Headless Chrome easier for developers. Also, this project is using .NET Core 3.1 framework and the HTML Agility Pack for parsing raw HTML.
Part I: Static Pages
Setup
If you’re using C# as a language, you probably already use Visual Studio. This article uses a simple .NET Core Web Application project using MVC (Model View Controller). After you create a new project, go to the NuGet Package Manager where you can add the necessary libraries used throughout this tutorial.
In NuGet, click the “Browse” tab and then type “HTML Agility Pack” to find the dependency.
Install the package, and then you’re ready to go. This package makes it easy to parse the downloaded HTML and find tags and information that you want to save.
Finally, before you get started with coding the scraper, you need the following libraries added to the codebase:
Making an HTTP Request to a Web Page in C#
Imagine that you have a scraping project where you need to scrape Wikipedia for information on famous programmers. Wikipedia has a page with a list of famous programmers with links to each profile page. You can scrape this list and add it to a CSV file (or Excel spreadsheet) to save for future review and use. This is just one simple example of what you can do with web scraping, but the general concept is to find a site that has the information you need, use C# to scrape the content, and store it for later use. In more complex projects, you can crawl pages using the links found on a top category page.
Using .NET HTTP Libraries to Retrieve HTML
.NET Core introduced asynchronous HTTP request libraries to the framework. These libraries are native to .NET, so no additional libraries are needed for basic requests. Before you make the request, you need to build the URL and store it in a variable. Because we already know the page that we want to scrape, a simple URL variable can be added to the HomeController’s Index()
method. The HomeController Index()
method is the default call when you first open an MVC web application.
Add the following code to the Index()
method in the HomeController file:
Using .NET HTTP libraries, a static asynchronous task is returned from the request, so it’s easier to put the request functionality in its own static method. Add the following method to the HomeController file:
Let’s break down each line of code in the above CallUrl()
method.
This statement creates an HttpClient
variable, which is an object from the native .NET framework.
If you get HTTPS handshake errors, it’s likely because you are not using the right cryptographic library. The above statement forces the connection to use the TLS 1.3 library so that an HTTPS handshake can be established. Note that TLS 1.3 is deprecated but some web servers do not have the latest 2.0+ libraries installed. For this basic task, cryptographic strength is not important but it could be for some other scraping requests involving sensitive data.
This statement clears headers should you decide to add your own. For instance, you might scrape content using an API request that requires a Bearer
authorization token. In such a scenario, you would then add a header to the request. For example:
The above would pass the authorization token to the web application server to verify that you have access to the data. Next, we have the last two lines: Phillies twitter.
These two statements retrieve the HTML content, await the response (remember this is asynchronous) and return it to the HomeController’s Index()
method where it was called. The following code is what your Index()
method should contain (for now):
The code to make the HTTP request is done. We still haven’t parsed it yet, but now is a good time to run the code to ensure that the Wikipedia HTML is returned instead of any errors. Make sure you set a breakpoint in the Index()
method at the following line:
This will ensure that you can use the Visual Studio debugger UI to view the results.
You can test the above code by clicking the “Run” button in the Visual Studio menu:
Visual Studio will stop at the breakpoint, and now you can view the results.
If you click “HTML Visualizer” from the context menu, you can see a raw HTML view of the results, but you can see a quick preview by just hovering your mouse over the variable. You can see that HTML was returned, which means that an error did not occur.
Parsing the HTML
With the HTML retrieved, it’s time to parse it. HTML Agility Pack is a common tool, but you may have your own preference. Even LINQ can be used to query HTML, but for this example and for ease of use, the Agility Pack is preferred and what we will use.
Before you parse the HTML, you need to know a little bit about the structure of the page so that you know what to use as markers for your parsing to extract only what you want and not every link on the page. You can get this information using the Chrome Inspect function. In this example, the page has a table of contents links at the top that we don’t want to include in our list. You can also take note that every link is contained within an <li> element.
From the above inspection, we know that we want the content within the “li” element but not the ones with the tocsection
class attribute. With the Agility Pack, we can eliminate them from the list.
We will parse the document in its own method in the HomeController, so create a new method named ParseHtml()
and add the following code to it:
In the above code, a generic list of strings (the links) is created from the parsed HTML with a list of links to famous programmers on the selected Wikipedia page. We use LINQ to eliminate the table of content links, so now we just have the HTML content with links to programmer profiles on Wikipedia. We use .NET’s native functionality in the foreach
loop to parse the first anchor tag that contains the link to the programmer profile. Because Wikipedia uses relative links in the href
attribute, we manually create the absolute URL to add convenience when a reader goes into the list to click each link.
Exporting Scraped Data to a File
The code above opens the Wikipedia page and parses the HTML. We now have a generic list of links from the page. Now, we need to export the links to a CSV file. We’ll make another method named WriteToCsv()
to write data from the generic list to a file. The following code is the full method that writes the extracted links to a file named “links.csv” and stores it on the local disk.
The above code is all it takes to write data to a file on local storage using native .NET framework libraries.
The full HomeController code for this scraping section is below.
Part II: Scraping Dynamic JavaScript Pages
In the previous section, data was easily available to our scraper because the HTML was constructed and returned to the scraper the same way a browser would receive data. Newer JavaScript technologies such as Vue.js render pages using dynamic JavaScript code. When a page uses this type of technology, a basic HTTP request won’t return HTML to parse. Instead, you need to parse data from the JavaScript rendered in the browser.
Dynamic JavaScript isn’t the only issue. Some sites detect if JavaScript is enabled or evaluate the UserAgent value sent by the browser. The UserAgent header is a value that tells the web server the type of browser being used to access pages (e.g. Chrome, FireFox, etc). If you use web scraper code, no UserAgent is sent and many web servers will return different content based on UserAgent values. Some web servers will use JavaScript to detect when a request is not from a human user.
You can overcome this issue using libraries that leverage Headless Chrome to render the page and then parse the results. We’re introducing two libraries freely available from NuGet that can be used in conjunction with Headless Chrome to parse results. PuppeteerSharp is the first solution we use that makes asynchronous calls to a web page. The other solution is Selenium WebDriver, which is a common tool used in automated testing of web applications.
Using PuppeteerSharp with Headless Chrome
For this example, we will add the asynchronous code directly into the HomeController’s Index()
method. This requires a small change to the default Index()
method shown in the code below.
In addition to the Index()
method changes, you must also add the library reference to the top of your HomeController code. Before you can use Puppeteer, you first must install the library from NuGet and then add the following line in your using
statements:
Now, it’s time to add your HTTP request and parsing code. In this example, we’ll extract all URLs (the <a> tag) from the page. Add the following code to the HomeController to pull the page source in Headless Chrome, making it available for us to extract links (note the change in the Index()
method, which replaces the same method in the previous section example):
Similar to the previous example, the links found on the page were extracted and stored in a generic list named programmerLinks
. Notice that the path to chrome.exe
is added to the options
variable. If you don’t specify the executable path, Puppeteer will be unable to initialize Headless Chrome.
Using Selenium with Headless Chrome
If you don’t want to use Puppeteer, you can use Selenium WebDriver. Selenium is a common tool used in automation testing on web applications, because in addition to rendering dynamic JavaScript code, it can also be used to emulate human actions such as clicks on a link or button. To use this solution, you need to go to NuGet and install Selenium.WebDriver and (to use Headless Chrome) Selenium.WebDriver.ChromeDriver. Note: Selenium also has drivers for other popular browsers such as FireFox.
Add the following library to the using
statements:
Now, you can add the code that will open a page and extract all links from the results. The following code demonstrates how to extract links and add them to a generic list.
Notice that the Selenium solution is not asynchronous, so if you have a large pool of links and actions to take on a page, it will freeze your program until the scraping completes. This is the main difference between the previous solution using Puppeteer and Selenium.
Magic Url Scraper
Conclusion
Web scraping is a powerful tool for developers who need to obtain large amounts of data from a web application. With pre-packaged dependencies, you can turn a difficult process into only a few lines of code.
Top 30 Free Web Scraping Software In 2021 | Octoparse
One issue we didn’t cover is getting blocked either from remote rate limits or blocks put on bot detection. Your code would be considered a bot by some applications that want to limit the number of bots accessing data. Our web scraping API can overcome this limitation so that developers can focus on parsing HTML and obtaining data rather than determining remote blocks.