Skip to content

URL search: CDX server API

VascoRatoFCCN edited this page Dec 5, 2023 · 51 revisions

Arquivo.pt supports CDX Server API.

CDX-server API allows automatic access in order to list, sort, and filter preserved pages from a given URL.

url

The only required parameter to the cdx-server api is the URL, for example:

will return a list of captures for 'publico.pt'.

from / to

Setting from= or to= will restrict the results to the given date/time range (inclusive).

Timestamps may be <=14 digits and will be padded to either lower or upper bound, for example:

will return results of sapo.pt that have a timestamp between 20140101000000 and 20141231235959

matchType

The cdx-server supports the following matchType

  • exact -- default setting, will return captures that match the url exactly

  • prefix -- return captures that begin with a specified path, eg: http://sapo.pt/noticias/*

  • host -- return captures which for a begin host (the path segment is ignored if specified)

  • domain -- return captures for the current host and all subdomains, eg. *.example.com

Instead of specifying a separate matchType parameter, wildcards may be used in the url:

limit

Setting limit will limit the number of index lines returned. Limit must be set to a positive integer, up to 100,000. If no limit is provided, up to 100,000 matching lines are returned, which may be slow.

Example: https://arquivo.pt/wayback/cdx?url=http://www.sapo.pt/noticias/&matchType=prefix&limit=1500 will show the first 1500 results.

sort

The sort param can be set as follows:

  • reverse: will sort the matching captures in reverse order. It is only recommended for exact query as reverse a large match may be very slow.

  • closest: setting this option also requires setting closest= where is a specific timestamp to sort by. This option will only work correctly for exact query and is useful for sorting captures based no time distance from a certain timestamp.

output (JSON output)

Setting output=json will return each line as a proper JSON dictionary. (Default format is text which will return the native format of the underlying CDX index, and may not be consistent). Using output=json is recommended for extensive analysis.

Example: https://arquivo.pt/wayback/cdx?url=publico.pt&output=json

filter

The filter param can be specified multiple times to filter by specific fields in the cdx index. Field names correspond to the fields returned in the JSON output. Filters can be specified as follows:

?url=publico.pt/*&filter==mime:text/html&filter=!=status:200 Return captures from publico.pt/* where mime is text/html and http status is not 200.

The ! modifier before =status indicates negation. The = and ~ modifiers are optional and specify exact resp. regular expression matches. The default (no specific modifier) is to filter whether the query string is contained in the field value. Negation and exact/regex modifier may be combined, eg. filter=!~text/.*

The formal syntax is: filter=<fieldname>:[!][=|~]<expression> with the following modifiers:

modifier(s) example description
(no modifier) filter=mime:html field "mime" contains string "html"
= filter==mime:text/html exact match: field "mime" is "text/html"
~ filter=~mime:.*/html$ regex match: expression matches beginning of field "mime" (cf. re.match)
! filter=!mime:html field "mime" does not contain string "html"
!= filter=!=mime:text/html field "mime" is not "text/html"
!~ filter=!~mime:.*/html expression does not match beginning of field "mime"

fields

The fields param can be used to specify which fields to include in the output. The standard available fields are: urlkey, timestamp, url, mime, status, digest, length, offset, filename

Fields can be comma delimited, for example ?url=publico.pt&fields=url,timestamp,status will only include the url, timestamp and status in the output.