Crawl a web page, extract all domains and resolve them to IP addresses


Today I did a practical bash exercise: crawl a web page, extract all domains and resolve them to IP addresses with bash and common GNU/Linux tools.




domains=$(curl $url -s | grep -E 'https?://[^"]*' | cut -d '/' -f 3 | cut -d '"' -f 1 | uniq)


for domain in $domains
dig +noall +answer $domain | awk '/\sA\s/ {print $5}' >> $filename

cat $filename | sort -u


curl $url -s: I used curl rather wget because we don't need to store the downloaded web page. Then I used the -s option (--silent) that will avoid progress bar or error messages to display.

grep -E 'https?://[^"]*': -E (--extended-regexp) is required for using patterns such as s? that allow us to match http as well as https. For teh sake of easiness and efficiency I won't consider domains without the a protocol, when the protocol is capitalized, etc. [^"]* is for matching all characters but ", trying to stop after">.

cut -d '/' -f 3: cut the output using / as a delimiter and keep only the 3rd columns so what is after http://.

cut -d '"' -f 1: will use " as a delimiter and will keep the first column to help us clean the output avoiding ugly output such as"> that [^"]* couldn't avoid.

uniq: will keep only unique values.

filename: I stored the result of dig in a file because it was far more easier to parse. In the loop, concatenating the output in a string won't keep line feed characters and also a domain can match several IP addresses so even storing the result in a bash set will have an unexpected result when dig will answer several IP addresses. Of course I could have only display the result but it won't allow me to sort and filter it.

dig +noall +answer $domain: I was nearly forced to use dig rather than the cleaner drill used on ArchLinux: in our case we are only interested in getting IP addresses and parsing the drill result to keep only the answer lines would have been difficult and would have required complex regexp because regexp for multi-lines matches along with group matching is hard. So the easier way is to use options that dig has and that drill doesn't. But since ArchLinux did a dnsutils to ldns migration (for good reasons) I had to install dig with pacman -S bind-tools. So +noall +answer allowed me to display only the answer section of the DNS response.

awk. awk '/\sA\s/ {print $5}': Then we want only the A records (not CNAME for example) so I used \sA\s/ to match only those. First I did this filter using grep -E '/\sA\s/' but as I was forced to use awk for the next piping I used the regexp directly in awk. awk '/\sA\s/ {print $5}' is cleaner than grep -E '/\sA\s/' | awk '{print $5}'. I said I was forced to use awk and not cut (which would have been easier) because cut delimiter mechanism works only when there is a constant number of the delimiter but depending of the domain length dig will sometimes output one tab and sometimes several, so the column we want to extract will vary. And cut has no way to remove void columns or consider consecutive delimiter as only one delimiter. So I sued awk that can manage this pretty well with {print $5}'.

sort -u: obviously sort will sort results and I used -u (--unique) to keep only unique results without having to pipe sort to uniq.



The script is quite short but is very dense. In general bash script are all about piping commands, filtering results and finding the right options.