Recursively extracting nested links from a webpage using Racket - Luca Bolognese

Recursively extracting nested links from a webpage using Racket

Luca -

☕ 6 min. read

This is my first at­tempt at writ­ing Racket or LISP code. It might be ugly … For now, let’s build a web crawler, next, I shall write my­self a lan­guage. That’s where Racket re­ally shine.

The code is here. Thanks to Mike for re­view­ing it.

Why

I want to trans­late a web­site, in­clud­ing re­cur­sively reached pages, to a pdf to read on my e-reader. This pro­gram does the first step: go­ing from a URL to a list of URLs by re­cur­sively nav­i­gat­ing the links on the page.

The PS1Script di­rec­tory con­tains scripts to per­from the other steps:

  1. Translate all the links to PDF pages (using Chrome head­less).
  2. Combine all the PDFs to a sin­gle doc­u­ment (using cpdf).

Prelude

First you de­clare which lan­guage you are us­ing. Racket is a lan­guage to build lan­guages and the lan­guage it­self is just one of many (try reread­ing this phrase).

#lang racket

We then ex­pose the main func­tions of the li­brary. Aka re­cur­sively get­ting all links from a web­page to a cer­tain nest­ing level both as a Racket list and as a new­line de­lim­i­tated string.

; Provides functions to extract links from web pages recursively

(provide
; uriString nestingLevels -> list of uriStrings
; Doesn't follow links to non-html resources or pointing to a different domain then uriString.
uri->nestedLinks
; uriSTring nestingLevels -> string of newline separated Uri Strings
uri->nestedLinksNl)

Implementation

This is sim­i­lar, but not iden­ti­cal, to C# using state­ment as you de­clare which pack­ages are needed.

(require (planet neil/html-parsing:2:0)
net/url
xml
html
sxml/sxpath
threading)

The pro­gram crawl just links to html files in the same do­main as the first URI given to it. It is pos­si­ble to ex­tend this more by hav­ing the pro­gram take a reg­u­lar ex­pres­sion (or * ex­pres­sion) to iden­tify which file sto leave out of the crawl­ing.

(define invalid-suffixes
'("./" ".xml" ".jpg" ".jpeg" ".png" ".gif" ".tiff" ".psd"
".eps" ".ai" ".indd" ".raw" ".svg"))

(define invalid-prefixes '("#" "mailto:" "javascript:"))

(define (different-domain? baseUrl l)
(define url (string->url l))
(and (url-host url) (not (equal? (url-host baseUrl) (url-host url)))))

(define (good-link? baseUrl l) (not (or (different-domain? baseUrl l)
(ormap (curry string-suffix? l) invalid-suffixes)
(ormap (curry string-prefix? l) invalid-prefixes))))

Next we have to parse the HTML. We use XPath for that. Racket is par­tic­u­larly good for XML pars­ing as it maps nat­u­rally to ex­pres­sions which are the bread and but­ter of the lan­guage. I don’t en­joy the nest­ing of ex­pres­sions that LISP like lan­guages force onto you (as in the ex­pres­sion be­low). But see later for a par­tial so­lu­tion.

(define (xexp->links xexp) (flatten (map cdr ((sxpath "//a/@href") xexp))))

The strange ~>> op­er­a­tor in the code be­low comes from the Threading package. This set of macros lets you build pipelines of com­pu­ta­tions sim­i­larly to the F# |> op­er­a­tor. The ini­tial value comes first and then the func­tions are ap­plied in se­ries each one to the re­sult of the pre­vi­ous. By do­ing that you flatten’ the nested ex­pres­sions, mak­ing them more read­able (at least to my eyes).

This ca­pa­bil­ity of chang­ing the core be­hav­ior of the lan­guage is some­thing very pe­cu­liar to the LISP fam­ily, and the reaa­son why I am at­tracted to Racket in the first place.

This func­tion ex­tracts all the good’ links from a url. BTW: I love that you can use -> to name sym­bols.

(define (url->links url)
(~>> (call/input-url url get-pure-port html->xexp)
xexp->links
(filter (curry good-link? url))))

λ~> is the func­tion com­po­si­tion op­er­a­tor in the Threading’ li­brary. You got to love em­bed­ding lamb­das in the code.

(define uri->links
(λ~> string->url url->links))

This is the main re­cur­sive work­horse of the pro­gram. It works some­thing like this (numbers marked in the code):

  1. Treat links to sub­parts of a web page as if they were links to the web­page
  2. If it is not a good link, re­turn the links al­ready vis­ited (visited)
  3. Same thing if the link is al­read in visited
  4. If we reached the nest­ing level spec­i­fied, add the link to visited and re­turn visited’
  5. Otherwise add the link to visited and call your­self on all sub­links on the page

The func­tion is not tail re­cur­sive, but that is not a huge deal in Racket as the stack is very large. It does­n’t blow up as eas­ily as in most other lan­guages.

(define (uri->nestedLinks-rec baseUrl uri visited levels)

(define abs-url (if (string-contains? uri "#") ; <0>
(~> uri (string-split "#") first (combine-url/relative baseUrl _))
(~>> uri (combine-url/relative baseUrl))))

(log-info "~a, ~a, ~a:~a~n" (url->string baseUrl)
levels uri (url->string abs-url))

(cond [(not (good-link? baseUrl uri)) visited] ; <1>
[(member abs-url visited) visited] ; <2>
[(zero? levels) (cons abs-url visited)] ; <3>
[else (for/fold ([acc (cons abs-url visited)]) ;<4>
([l (in-list (url->links abs-url))])
(uri->nestedLinks-rec abs-url l acc (sub1 levels)))]))

Finally we can triv­ially de­fine the two main func­tions in the mod­ule.

(define (uri->nestedLinks uri levels)
(reverse (uri->nestedLinks-rec (string->url uri) "" '() levels)))

(define (uri->nestedLinksNl uri levels)
(define links (uri->nestedLinks uri levels))
(string-join (map url->string links) "\n" #:after-last "\n"))

Test

To my great plea­sure, Racket al­lows (encourages?) you to have tests in the same file as the code. They just go into sub mod­ules, that can be con­structed piece­wise with the module+ in­struc­tion.

You could add the tests be­side each func­tion, but I de­cided to have a sep­a­rate sec­tion in the file in­stead. To run them you call raco test FILENAME.

(define (uri->path test-uri)
(build-path "./data" (~> test-uri first uri->file string->path)))
(define uri->file (λ~> string->url url-host))
(define test-uris '(
("https://www.lucabol.com" 3)
("https://beautifulracket.com/" 3)
("https://en.wikipedia.org/wiki/Typeface" 1)
("https://brieferhistoryoftime.com" 3)
("https://mobydick.wales/" 3)
("https://resilientwebdesign.com" 3)
("https://www.c82.net/euclid/" 3)
))

(module+ test
(require rackunit)

I got a bit sloppy not nam­ing my lamb­das here … But, does­n’t the lambda sym­bol look cool?

  (for-each (λ (test-uri)
(with-input-from-file
(uri->path test-uri)
(λ () (begin
(define saved-result (port->string))
(define calc-result
(uri->nestedLinksNl (first test-uri) (second test-uri)))
(check-equal? calc-result saved-result test-uri)))
#:mode 'text
))
test-uris))

This is used to re­gen­er­ate the test data. You can then in­spect it man­u­ally be­fore run­ning tests.

(define (refresh-test-data)
(for-each (λ (test-uri)
(with-output-to-file
(uri->path test-uri)
(λ () (display (uri->nestedLinksNl (first test-uri) (second test-uri))))
#:exists 'replace))
test-uris))

Main

Main goes into its own sub­mod­ule as well. Racket is not as pure as Haskell, so you can nat­u­rally man­age side ef­fects like user in­put and such. You got to ap­pre­ci­ate the con­ci­siv­ness of the com­mand line pars­ing li­brary.

The code be­low looks a bit odd to me. It could prob­a­bly be refac­tored so that the parser ex­pres­sion re­turns the val­ues in­stead of fill­ing out pa­ra­me­ters.

(module+ main

(define levels (make-parameter "3"))
(define uri (make-parameter #f))

(define parser
(command-line
#:program "website-links"
#:usage-help "Extracts links from a webpage recursively to a specified level."
#:once-each
[("-l" "--levels") LEVELS "How many nested levels to process (default 3)." (levels LEVELS)]
#:args (URI) (uri URI)))

(display (uri->nestedLinksNl (uri) (string->number (levels)))))

Conclusion

I liked Racket very much. It takes a lit­tle while to get use to the ex­pres­sion syn­tax, which is very dif­fer­ent from the C-like one most of us are used to. It also takes a while to get used to the style of the doc­u­men­ta­tion, which is writ­ten very pre­cisely for the care­ful reader. We are more used to the ‘here is an ex­am­ple, copy it’ kind of doc­u­men­ta­tion. For the dis­tracted pro­gram­mer …

0 Webmentions

These are webmentions via the IndieWeb and webmention.io.