I often use web scraping code, of which the below is an example snippet, for looking at technology as part of an IT Assessment, Due Diligence or Review. For this post, I am assuming that the latest stable version of php and curl are installed and working. Below is some generic web scraping code that works well for most web sites. Web scraping or data mining is a way to get the desired data from web pages programmatically. Most of the businesses uses web scraping systems to get the useful data from other websites to use in their businesses. As a developer, we sometimes write a simple script to scrape the data from websites.
Using grep, curl, and tail to scrape data from a Web page
Posted on: Sunday, Feb 04, 2018
Another post on this blog provides a Bash script that automates the installation of the most recent version of Firefox Developer Edition (FFDE). The original version of that script required the manual input of FFDE's latest version number. Looking up that number was a hassle to the say the least--and added lots of friction to a process that should simple and fast.
Rather than require you to look up the most recent version number and then provide that value as argument to the Bash script, the script now uses three of the really handy utilities that lurk within Linux.
tail work together to fetch the most recent version number from the FFDE downloads page. This post goes into the detail of how that script uses these Linux utilities to get the latest FFDE version number. With these in place, running that script is quite simple now.
You can read more about
curl- transfer the contents of a URL
grep- find lines matching a pattern
tail- output the last part of files
Scraping data from a Web page
Mozilla provides a 'releases' download page that shows the versions of FFDE available. The most recent version number is the last number in the list. Visit the releases page to see it. There isn't much to it, it's mostly just a list of version numbers.
Follow each of these steps by clicking the clipboard icon to copy a line to your clipboard then pasting it in a terminal session to run it.
In an open a terminal session pull down the FFDE release page's HTML with
curl to see the release page's HTML.
This script needs that HTML in a text file, so it uses
curl's -o flag to specify an output file:
curl to save the release page's HTML in the releases.txt file.
releases.txt file available, we'll run
grep against that file to extract the version numbers from it. To do so, grep uses a simple regular expression that matches a FFDE version number (59.0b6. for example), where
[0-9] specifies a single numeric digit,
. looks for a single period (unescaped, the
. means any character to regex), and [a-z] specifies a letter from between
grep to see the version numbers in a terminal.
This list is all of the version numbers (with each one repeated twice), but we only need the last number (the most recent version) in the list. To get the last number,
grep pipes its output into the
grep's output into the
tail command to get the last number (the last line of the file).
The last bit of this step is get the most recent version number into a Bash variable for use in a Bash script. This is done with Bash's substitution operator,
Capture the most recent version number in a Bash variable and show it.
While that was a long explanation it distills down to three lines (including a line to delete the
The general technique here of pulling down from a page with
curl and then parsing it with
tail (and whatever other Linux utilities you need to use) is very handy. Please let me know in the comments what tasks you're using Linux utilities for.
Hi, here on the forum guys advised a cool Dating site, be sure to register - you will not REGRET it [url=https://bit.ly/3mmOmi0]https://bit.ly/3mmOmi0[/url]
Submitted by Haroldfep7 months ago
Hi, here on the forum guys advised a cool Dating site, be sure to register - you will not REGRET it [url=https://bit.ly/2RA7I5l]https://bit.ly/2RA7I5l[/url]
excellent points altogether, you just gained a new reader. What would you suggest in regards to your post that you made a few days ago? Any positive? <A HREF='https://liveone9.com/baccaratsite/' TARGET='_blank'>모바일바카라</A>
Submitted by 모바일바카라7 months ago
Respect to website author , some good selective information . <A HREF='https://www.nolza2000.com' TARGET='_blank'>바카라사이트</A>
Very good blog post.Really looking forward to read more. Much obliged.<A HREF='https://vfv79.com/theking/' TARGET='_blank'>더킹카지노</A>
Submitted by 더킹카지노7 months ago
https://bit.ly/34GSPFv - Sex without obligation - https://bit.ly/34RxuK7
п»їhttps://bit.ly/34GSPFv - Sex without obligation in your city https://bit.ly/3pn28mx - Sex without obligation @@@
Submitted by Gabrieldek4 months ago
https://t.me/Dating_Flirting - Flirting in your city https://hot-desire.com/T1kMpvjB - Sex without obligation in your city @@@
https://hot-desire.com/T1kMpvjB - Sex without obligation in your city https://bit.ly/2TBGPyA - Meet, be inspired, communicate and continue flirting! Follow the link @@@
Submitted by Gabrieldek4 months ago
Online casino, first deposit bonus of $100 [url=https://bit.ly/2LG4A7Z]https://bit.ly/2LG4A7Z[/url]
Submitted by Randallric3 months ago
Hi, here on the forum guys advised a cool Dating site, be sure to register - you will not REGRET it [url=https://bit.ly/2MpL94b]https://bit.ly/2MpL94b[/url]
Great site. Chic design. I found a lot of interesting things here. Check out my theme site - and rate it http://scunt.xyz AVN NEWS VIDEO FOR ADULTS ^^XxX=+
Submitted by Gabrieldek2 months ago
Great site. Chic design. I found a lot of interesting things here. Check out my theme site - and rate it http://didlo.xyz Wonderful Video ^^XxX=+
https://bbw-xxx.info - biseksual as$#$%f!&
Submitted by FrancisSpups1 month ago
https://milf-xxx.xyz - sex xxx as$#$%f!&
Add your comment
Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code to extract the data you want.
“But you should use an API for this!'
However, not every website offers an API, and APIs don't always expose every piece of information you need. So, it's often the only solution to extract website data.
There are many use cases for web scraping:
- E-commerce price monitoring
- News aggregation
- Lead generation
- SEO (search engine result page monitoring)
- Bank account aggregation (Mint in the US, Bankin’ in Europe)
- Individuals and researchers building datasets otherwise not available.
The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browsers (except Google - they all want to be scraped by Google).
So, when you scrape, you do not want to be recognized as a robot. There are two main ways to seem human: use human tools and emulate human behavior.
This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these obstacles.
Emulate Human Tool: Headless Chrome
Why Using Headless Browsing?
When you open your browser and go to a webpage, it almost always means that you ask an HTTP server for some content. One of the easiest ways to pull content from an HTTP server is to use a classic command-line tool such as cURL.
The thing is, if you just do:
curl www.google.com, Google has many ways to know that you are not a human (for example by looking at the headers). Headers are small pieces of information that go with every HTTP request that hits the servers. One of those pieces of information precisely describes the client making the request, This is the infamous “User-Agent” header. Just by looking at the “User-Agent” header, Google knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great. As an experiment, just go over here. This webpage simply displays the headers information of your request.
Php Curl Web Scraping Example
Headless Browsers will behave like a real browser except that you will easily be able to programmatically use them. The most popular is Chrome Headless, a Chrome option that behaves like Chrome without all of the user interface wrapping it.
The easiest way to use Headless Chrome is by calling a driver that wraps all functionality into an easy API. SeleniumPlaywright and Puppeteer are the three most famous solutions.
However, it will not be enough as websites now have tools that detect headless browsers. This arms race has been going on for a long time.
Php Scraping Library
While these solutions can be easy to do on your local computer, it can be trickier to make this work at scale.
Managing lots of Chrome headless instances is one of the many problems we solve at ScrapingBee
This is why there is an everlasting arms race between web scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.
However, in this arms race, web scrapers tend to have a big advantage here is why:
Another thing to know is that while running 20 cURL in parallel is trivial and Chrome Headless is relatively easy to use for small use cases, it can be tricky to put at scale. Because it uses lots of RAM, managing more than 20 instances of it is a challenge.
If you want to learn more about browser fingerprinting I suggest you take a look at Antoine Vastel's blog, which is entirely dedicated to this subject.
That's about all you need to know about how to pretend like you are using a real browser. Let's now take a look at how to behave like a real human.
What is it?
TLS stands for Transport Layer Security and is the successor of SSL which was basically what the “S” of HTTPS stood for.
This protocol ensures privacy and data integrity between two or more communicating computer applications (in our case, a web browser or a script and an HTTP server).
Similar to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify users based on the way they use TLS.
How this protocol works can be split into two big parts.
First, when the client connects to the server, a TLS handshake happens. During this handshake, many requests are sent between the two to ensure that everyone is actually who they claim to be.
Then, if the handshake has been successful the protocol describes how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check out this great introduction by Cloudflare.
Most of the data point used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database.
On this website, you can see that the most used fingerprint last week was used 22.19% of the time (at the time of writing this article).
This number is very big and at least two orders of magnitude higher than the most common browser fingerprint. It actually makes sense as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.
Those parameters are, amongst others:
- TLS version
- Handshake version
- Cipher suites supported
If you wish to know what your TLS fingerprint is, I suggest you visit this website.
How do I change it?
Php Curl Https
Ideally, in order to increase your stealth when scraping the web, you should be changing your TLS parameters. However, this is harder than it looks.
Firstly, because there are not that many TLS fingerprints out there, simply randomizing those parameters won't work. Your fingerprint will be so rare that it will be instantly flagged as fake.
Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies. So, changing them is not straight-forward.
For examples, the famous Python
requests module doesn't support changing the TLS fingerprint out of the box. Here are a few resources to change your TLS version and cypher suite in your favorite language:
- Python with HTTPAdapter and requests
- NodeJS with the TLS package
- Ruby with OpenSSL
Keep in mind that most of these libraries rely on the SSL and TLS implementation of your system, OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.
Emulate Human Behaviour: Proxy, Captcha Solving and Request Patterns
A human using a real browser will rarely request 20 pages per second from the same website. So if you want to request a lot of page from the same website you have to trick the website into thinking that all those requests come from different places in the world i.e: different I.P addresses. In other words, you need to use proxies.
Proxies are not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxy IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.
There are several proxy solutions on the market, here are the most used rotating proxy providers: Luminati Network, Blazing SEO and SmartProxy.
There is also a lot of free proxy lists and I don’t recommend using these because they are often slow and unreliable, and websites offering these lists are not always transparent about where these proxies are located. Free proxy lists are usually public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is important. Anti-crawling services are known to maintain an internal list of proxy IP so any traffic coming from those IPs will also be blocked. Be careful to choose a good reputation. This is why I recommend using a paid proxy network or build your own.
Another proxy type that you could look into is mobile, 3g and 4g proxies. This is helpful for scraping hard-to-scrape mobile first websites, like social media.
To build your own proxy you could take a look at scrapoxy, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then, you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist …) it can be a little tedious to put in place.
You could also use the TOR network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in a dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many cases, TOR won’t help you, compared to classic proxies. It's worth noting that traffic through TOR is also inherently much slower because of the multiple routing.
Php Curl Web Scraping Tutorial
Sometimes proxies will not be enough. Some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you'll need to use CAPTCHAs solving service (2Captchas and DeathByCaptchas come to mind).
While some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.
If you use the aforementioned services, on the other side of the API call you'll have hundreds of people resolving CAPTCHAs for as low as 20ct an hour.
But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your data extraction process.
Another advanced tool used by websites to detect scraping is pattern recognition. So if you plan to scrape every IDs from 1 to 10 000 for the URL www.example.com/product/
Some websites also do statistic on browser fingerprint per endpoint. This means that if you don't change some parameters in your headless browser and target a single endpoint, they might block you.
Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try to not do it with proxies in Vietnam.
But from experience, I can tell you that rate is the most important factor in “Request Pattern Recognition”, so the slower you scrape, the less chance you have of being discovered.
Emulate Machine Behaviour: Reverse engineering of API
Sometimes, the server expect the client to be a machine. In these cases, hiding yourself is way easier.
Reverse engineering of API
Basically, this “trick” comes down to two things:
- Analyzing a web page behaviour to find interesting API calls
- Forging those API calls with your code
For example, let's say that I want to get all the comments of a famous social network. I notice that when I click on the “load more comments” button, this happens in my inspector:
Notice that we filter out every requests except “XHR” ones to avoid noise.
When we try to see which request is being made and which response do we get… - bingo!
Now if we look at the “Headers” tab we should have everything we need to replay this request and understand the value of each parameters. This will allow us to make this request from a simple HTTP client.
The hardest part of this process is to understand the role of each parameter in the request. Know that you can left-click on any request in the Chrome dev tool inspector, export in HAR format and then import it in your favorite HTTP client, (I love Paw and PostMan).
This will allow you to have all the parameters of a working request laid out and will make your experimentation much faster and fun.
Reverse-Engineering of Mobile Apps
The same principles apply when it comes to reverse engineering mobile app. You will want to intercept the request your mobile app make to the server and replay it with your code.
Doing this is hard for two reasons:
- To intercept requests, you will need a Man In The Middle proxy. (Charles proxy for example)
- Mobile Apps can fingerprint your request and obfuscate them more easily than a web app
For example, when Pokemon Go was released a few years ago, tons of people cheated the game after reverse-engineering the requests the mobile app made.
What they did not know was that the mobile app was sending a “secret” parameter that was not sent by the cheating script. It was easy for Niantic to then identify the cheaters. A few weeks after, a massive amount of players were banned for cheating.
Also, here is an interesting example about someone who reverse-engineered the Starbucks API.
Here is a recap of all the anti-bot techniques we saw in this article:
|Anti-bot technique||Counter measure||Supported by ScrapingBee|
|Browser Fingerprinting||Headless browsers||✅|
|IP-rate limiting||Rotating proxies||✅|
|Banning Data center IPs||Residential IPs||✅|
|TLS Fingerprinting||Forge and rotate TLS fingerprints||✅|
|Captchas on suspicious activity||All of the above||✅|
|Systematic Captchas||Captchas-solving tools and services||❌|
I hope that this overview will help you understand web-scraping and that you learned a lot reading this article.
We leverage everything I talked about in this post at ScrapingBee. Our web scraping API handles thousands of requests per second without ever being blocked. If you don’t want to lose too much time setting everything up, make sure to try ScrapingBee. The first 1k API calls are on us :).
We recently published a guide about the best web scraping tools on the market, don't hesitate to take a look!