[NTLK] Scraping the waybackmachine for Newton material - seeking advice

Alex Santos santoscork at me.com
Tue Feb 4 16:54:51 EST 2020


I am using a batch script and the wayback_machine_downloader to download old sites that might have once carried the following types of files:
pdf, hqx, sit, dd, pkg, abs, bin, sea, cpt, dmg, txt

A lot of material comes down. The reason behind this project is to expose files that may obscured due to the navigation required to view a site’s history. Trying to find files manually would be an impossible challenge. Eventually my findings will go online for the public to consume. Ironically I might very well return them to the internet archive as consumable downloads, one per site and I may be targeting the Macintosh Garden and/or approach UNNA to understand if they want to review the material and put it online. If the wayback machine captured it it’s downloadable.

At the moment I have roughly 1000 URLs to process, some those will surely be duplicate top level domain (TLD) with unique subdomains but I have a lot to process but it’s say for me to setup a list and batch process these and just let it run for days capturing files.

The question that I have to ask before I go through this is if there are other filetypes that I should capture. Were Newton packages distributed as pkg files primarily (though these would cross over to Mac OS X as well) or simply put, should I capture any other filetypes beyond what I noted at the top.

Also, does anyone want me to upload to their FTP server? I do have a FTPSE (encrypted with a cert) server running so if you are an UNNA or otherwise and would like access to these I could create an account on my FTP so that you can download these.

Ah, before I forget, are there any old URLs or companies from back in the day that I should prioritize or be sure to include? I already downloaded the mo site (Motorola) and did so up until the 2003 year make in the hopes of finding a PDF of the Motorola Marco user manual but that wasn’t to be found. So if you know any material that once existed on some site I can certainly try to see if it is on the waybackmachine and download it to boot.

Hope this interests folks. The main purpose of this is to expose any and all files that are thought to be lost but which might otherwise already exist.


More information about the NewtonTalk mailing list