Wikipedia:Link rot/URL change requests/Archives/2020/October

From Wikipedia, the free encyclopedia

Racing Post (cont..)

Continuation of the above, next phase on www.racingpost.com URLs (vs. bloodsport.racingpost.com)

Conversion types found:

Soft404 ("S404") examples to watch for:

Approach:

  • Check all URLs (~40,000).
  • If a redirect exists and the redirect URL is not 404 or a known S404, change URL.
  • If no redirect, status 200 and not S404 leave as-is.
  • If no redirect and 404/S404, attempt to create new URL using formulas above, verify it works or not.
  • If URL is dead last resort add archive URL.
  • Use |publisher= for any of the "_id" URLs or anything with /horse|jockey|owner|trainer|results/. Use |work= for everything else.
  • Convert existing |publisher= and |work= to uniform values.

-- GreenC 03:54, 27 September 2020 (UTC)

Thanks for this. There are also some URLs of the form http://bloodstock.racingpost.com/dam/dam_home.sd?horse_id=439081#damTabs=dam_progeny_sales which are now located at https://www.racingpost.com/profile/horse/439081/thimblerigger/progeny-sales. In those cases, the horse name is required in the URL (with modifications - lowercasing, removal of special characters, conversion of embedded spaces to hyphens), and the horse referred to is not necessarily the one in the article title, so the work probably can't readily be done by bot. But a list of such URLs would be useful so they can be fixed by hand. Colonies Chris (talk) 20:38, 27 September 2020 (UTC)
Redirects exist: try https://www.racingpost.com/profile/horse/439081 .. the bot checks for redirects and uses the redirect URL which includes the horse name. -- GreenC 20:43, 27 September 2020 (UTC)
Yes but the redirect doesn't handle the tab parameter (#damTabs=dam_progeny_sales); for that, the full url including the horse name is required and the parameter of dam_progeny_sales has to change to a subfolder named progeny_sales. Does the bot handle that? Colonies Chris (talk) 22:33, 27 September 2020 (UTC)
https://www.racingpost.com/profile/horse/439081/thimblerigger/progeny-sales is a dead link (soft404). https://www.racingpost.com/profile/horse/439081 opens a redirect to https://www.racingpost.com/profile/horse/439081/thimblerigger/ which works which is what the bot uses. -- GreenC 23:32, 27 September 2020 (UTC)
Ah I see, progeny-sales tab is premium content you have to be logged in/subscribed to access it. When not logged in returns a S404 https://www.racingpost.com/?authme -- GreenC 23:37, 27 September 2020 (UTC)

There were 42 (!) URLs of this type that were converted:

Extended content

-- GreenC 23:53, 27 September 2020 (UTC)

Thanks. I'll sort those out by hand and flag them as subscription required. Colonies Chris (talk) 15:25, 28 September 2020 (UTC)

I've worked out that for URLs of the form http://www.racingpost.com/horses/result_home.sd?race_id=267296&r_date=1999-06-04&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS, the correct destination is https://www.racingpost.com/results/10/catterick/1999-06-04/267296. The id and date can be derived from the source URL, but the course (here, "catterick") has to come from context and the digits after /results are a bit of a mystery - I suspect they may correspond to the position in an alphabetical list of courses. I derived the destination URL by entering the date and course in the advanced search facility at https://www.racingpost.com/results/, but I can't see any way this could be automated, unfortunately. Colonies Chris (talk) 22:06, 28 September 2020 (UTC)

Ah interesting. It is also the majority of links so worth pursuing (in the old "http://" set). By looking at the HTML of the archive url it contains <title>Results from the 2.35 race at CATTERICK - 4 June 1999 | Racing Post</title>. Then by extracting the working links can determine there are 107 known combinations. For example:
  • 1016 riyadh
  • 104 yarmouth
  • 1079 kempton-aw
  • 107 york
  • 1083 chelmsford-aw
  • 10 catterick
  • 1138 dundalk-aw
  • 1190 pakenham
  • 11 cheltenham
  • 1212 ffos-las
  • 1231 meydan

The numbers and names appear to be consistent, so now have a map. The final step is extracting the target name from the archive HTML title and finding it on the map. It probably won't always be an exact match so a fuzzy match might be required. I'll work on it hopefully this week. -- GreenC 03:06, 29 September 2020 (UTC)

That would be great. I've come across yet another format: http://www.racingpost.com/horses/horse_home.sd?horse_id=642105#topHorseTabs=horse_race_record&bottomHorseTabs=horse_form; this one seems to be fairly fixable - it should go to https://www.racingpost.com/profile/horse/642105/sixties-icon/form. Colonies Chris (talk) 09:18, 29 September 2020 (UTC)
  • Hi @Colonies Chris: I uploaded about 100 diffs using the new code with fuzzy matching and course mapping etc. Can you take a look (example). I plan on processing the remaining 2,500 or so articles tomorrow. Thank you! -- GreenC 04:32, 1 October 2020 (UTC)
I've checked a random selection, and found just one problem, in Daylami; it looks like the original URLs were invalid (e.g. url=http://www.racingpost.com/horses/result_home.sd?race_id=243391&r_date=7 September 1997&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS), so the bot has misinterpreted them. Otherwise, looking fine. I'm intrigued that the bot has rescued some citations from archives because they're not actually dead? (e.g. in Istabraq). Colonies Chris (talk) 13:32, 1 October 2020 (UTC)
Yes Daylami was trouble, there were a dozen or two where the URL used dmy instead of ISO. I fixed them, reprocessed and posted the diff manually, hopefully that is the only one as it is not checking for them. In Istabraq, is a standard bot function because editors will misunderstand the purpose of |archive-url= - it's for web archive URLs like web.archive.org or archive.today, sometimes they put the original URL thinking it automatically turns into an archive. But by taking the spot, it actually prevents bots from adding an archive URL (when the link dies). -- GreenC 13:40, 1 October 2020 (UTC)
Just one little quirk I've noticed; where the bot adds work=[[Racing Post]], it seems to consistently add it between the |last= and |first= parameters of the author name. Not fundamentally a problem, of course, but a little strange from the point of view of some later editor. Could it be placed elsewhere? Colonies Chris (talk) 16:28, 1 October 2020 (UTC)
Well it's not intentional or consistent like in this case it first removed |newspaper= then added |work= - it should be programmed to add it following the |url=, but there are some conditions (like when |url= is the last argument) where it will instead add it following the first argument, and if the first argument is |first= that's probably what happened. I'd have to see an example to know for sure what happened. -- GreenC 16:52, 1 October 2020 (UTC)
Here are a couple of examples: Critérium International (horse race), Triumph Hurdle. There are quite a few like those. Colonies Chris (talk) 17:38, 1 October 2020 (UTC)
I know this seems like an easy fix but it gets into the core functions of a library. I'd rather not go into the code and want to finish today assuming no serious problems arise. Also half are already processed and waiting to upload diffs and there is no way to determine where this happened without reprocessing and logging the entire set which would be a lot of lost work. It's a side effect of the first and last being used often in those positions and the way the software library works. For the second half I'll try telling it to follow the title instead of url maybe that will make a difference. -- GreenC 18:51, 1 October 2020 (UTC)
Was able to change second to last instead of second from first. [4] -- GreenC 01:45, 2 October 2020 (UTC)
Missing courses in the map
  • The map above is incomplete because that is all we had on Wikipedia to test against. During the run it found 374 URLs that have no map entry and it was able to determine the course name. They comprise 75:
Extended content

aqueduct 255 arlington-park 276 auteuil 205 baden-baden 207 ballingarry ? bangor-on-dee 4 brighton 7 cagnes-sur-mer 216 camden ? cartmel 9 caulfield 469 chester 13 churchill-downs 308 clonmel 177 cologne 226 compiegne 291 delaware-park 248 del-mar 444 delta-downs ? doha 1196 doomben 467 downpatrick 179 down-royal 180 dusseldorf 240 ellis-park 638 evry ? exeter 14 fair-grounds 742 fair-hill ? fakenham 18 flemington 297 folkestone 19 frankfurt 231 hawthorne 604 hollywood-park ? hoppegarten 440 huntingdon 26 kenilworth 508 kranji 794 la-zarzuela 449 le-lion-d'angers 313 lone-star-park 674 los-alamitos 1307 ludlow 34 lyon-parilly 541 market-rasen 35 monmouth-park 253 moonee-valley 299 musselburgh 16 nad-al-sheba 483 nancy 559 newton-abbot 39 parx 578 pimlico 221 pisa 284 plumpton 44 prairie-meadows 808 quakerstown ? randwick 471 rosehill 311 saint-brieuc 713 santa-anita 257 saratoga 445 sedgefield 57 southwell 61 stratford 67 taby 271 tampa-bay-downs 724 taunton 73 towcester 83 turin ? uttoxeter 84 wincanton 90 wissembourg ? worcester 101

If an ID number was found for each the bot could convert those URLs from archived to live. Probably by searching for the course on the website. -- GreenC 01:45, 2 October 2020 (UTC)

Results
  • Edits to 2,202 articles
  • 8,231 changes to metadata
  • Switch 5,664 URLs from old to new site
  • Add 1,434 archive URLs
  • Add 179 {{dead link}}
  • Convert 118 bare to square links
  • Removed 4 archive URLs

-- GreenC 01:45, 2 October 2020 (UTC)

I've edited the list of racecourses above to add the reference numbers, where I could determine them (? indicates the ones I couldn't). Colonies Chris (talk) 10:56, 5 October 2020 (UTC)
@Colonies Chris: Wonderful! Example. Much better. Looks like 338 URLs in 107 articles converted from archives to the new live form. There are still around 200 race_id URLs unconverted but they are probably among the "?" or because the track is not identified in the archive URL title field. There were a few mistakes this run for mysterious reasons, I reverted them but probably related to site timeouts and how convoluted their headers are. Open the hood or take down the walls and find the craftwork underneath and often it's strung together in strange ways, much like the Internet. -- GreenC 17:57, 5 October 2020 (UTC)
@GreenC: That's great. I've fixed a couple of those by hand, and I'll take a look at the race_id ones to see if I can find any way of tracking down the intended pages. The search-style ones can be located by using the horse name in the search facility at https://www.racingpost.com/results/, but it's often necessary to pick from several results, so that will have to be a manual process. Fortunately there aren't many of those. I suspect the 'story=' ones are gone completely and will just have to rely on archived versions. Colonies Chris (talk) 20:01, 5 October 2020 (UTC)
@GreenC: I've managed to track down the remaining course ids - Wissembourg 750; Turin 295; Quakerstown 1256; Hollywood Park 259; Fair Hill 307; Evry 208; Delta Downs 923; Camden 230; Ballingarry 372. Colonies Chris (talk) 08:30, 6 October 2020 (UTC)
Thanks, and done. It only fixed 22 URLs with the new map additions, adding some for delta-downs, evry, hollywood-park, quakerstown, turin and wissembourg .. none for 'camden' for example. In Lonesome Glory is old url which the archive.org header identifies as 'Camden (USA)' and URL https://www.racingpost.com/results/207/camden/1994-11-13/60294 is 404 .. so assuming any it couldn't match are legitimate 404. -- GreenC 20:15, 6 October 2020 (UTC)
I've tried searching for some of the race_id ones, and found them all so far: e.g. http://www.racingpost.com/horses/result_home.sd?race_id=602790 --> https://www.racingpost.com/results/231/frankfurt/2014-05-11/602790, but that conversion only works if the racecourse name is available to the bot, so I suppose it'll just have to be hand-fixing for those. Colonies Chris (talk) 09:46, 7 October 2020 (UTC)
The bot has the frankfurt mapping in the collapsed box above, but the source URL is missing the date r_date= -- GreenC 13:51, 7 October 2020 (UTC)

When TV by the Numbers was defunct this past January, all of their TV by the Numbers ratings urls became dead urls. The main url: https://tvbythenumbers.zap2it.com now just redirect to https://tvlistings.zap2it.com/?aid=gapzap (just the TV Listings). For an example, http://tvbythenumbers.zap2it.com/2016/09/22/wednesday-final-ratings-sept-21-2016 redirects to https://tvlistings.zap2it.com/?aid=gapzap (just the TV Listings). Is it possible for a bot to fix this problem? The dead urls of TV by the Numbers affect a lot of American television series articles. — YoungForever(talk) 21:54, 25 October 2020 (UTC)

@YoungForever: What would be the fix be, to a different URL at tvlistings.zap2it.com or an archive URL? -- GreenC 23:16, 25 October 2020 (UTC)
Or you just said dead site so I guess archives. -- GreenC 23:17, 25 October 2020 (UTC)
@GreenC: Archive urls of the dead urls of TV by the Numbers. — YoungForever(talk) 23:37, 25 October 2020 (UTC)
Starting to process over 8,000 articles.. this is when I go watch TV :) -- GreenC 00:22, 26 October 2020 (UTC)
@GreenC: Lol. I want to point out that they also affect the List of episodes and/or season articles as well. — YoungForever(talk) 02:12, 26 October 2020 (UTC)

Results

Completed:

  • Checked 8,107 articles containing tvbythenumbers.zap2it.com
  • Edited 7,014 articles (difference by links already archived)
  • Added archives for 43,635 URLs (cites, bare and square links)
  • Unable to find archives for 211 links, added {{dead link}} (list avail on request)
  • Converted 13 instances of {{TV by the Numbers}} to square links with archives.

@YoungForever: If you see anything it missed let me know. Good find, 43k is a lot. -- GreenC 23:29, 26 October 2020 (UTC)

@GreenC: Thank you very much! I will let you know if I see anything that your bot missed. — YoungForever(talk) 23:54, 26 October 2020 (UTC)