Wikipedia:Bots/Requests for approval/InceptionBot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Bamyers99 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 00:34, Saturday November 2, 2013 (UTC)
Automatic, Supervised, or Manual: Automatic
Programming language(s): PHP
Source code available: Github
Function overview: Creates WikiProject new article lists.
Links to relevant discussions (where appropriate):
Edit period(s): Daily
Estimated number of pages affected: 2 per WikiProject x 430 currently configured WikiProjects = 860 user space pages
Exclusion compliant (Yes/No): No
Already has a bot flag (Yes/No):
Function details: For each configured WikiProject, read a rule page with regular expressions. Match the regexs against pages created in the last day. For pages that match the regexes, update a page with the matches and another page with the article scoring. Both pages are in user space.
Discussion
[edit]This is a replacement for TedderBot task TedderBot 6. Tedder has indicated that he would like to retire the bot here.
- So unless I've misread something, it's only going to edit in userspace. You don't need a BRFA for that. →Σσς. (Sigma) 19:58, 2 November 2013 (UTC)[reply]
- Correct, it only edits predefined userspace pages. Can I get the bot flag set, as it needs to read the contents of >1000 new articles per day? --Bamyers99 (talk) 22:10, 2 November 2013 (UTC)[reply]
- I'm pretty sure 800+ pages per run does not fall under "no approval needed for bot's or botop's userspace". That clause is for small-scale testing, run reports/logs or small one-off tasks. — HELLKNOWZ ▎TALK 22:14, 2 November 2013 (UTC)[reply]
- Actually, bots are allowed pretty much free reign in their userspace: any bot or automated editing process that affects only the operator's or their own userspace (user page, user talk page, and subpages thereof), and which are not otherwise disruptive, may be run without prior approval. But now, the question is whether maintaining a static "directory tree" that contains 800 pages is disruptive. →Σσς. (Sigma) 05:54, 3 November 2013 (UTC)[reply]
- We've never (that I know of) explicitly told a botop that this many pages was okay. We always reply with it's okay for "small-scale testing, runing reports/logs or small one-off tasks". In anything, BOTPOL needs an update to reflect the actual practice. — HELLKNOWZ ▎TALK 11:16, 3 November 2013 (UTC)[reply]
- Actually, bots are allowed pretty much free reign in their userspace: any bot or automated editing process that affects only the operator's or their own userspace (user page, user talk page, and subpages thereof), and which are not otherwise disruptive, may be run without prior approval. But now, the question is whether maintaining a static "directory tree" that contains 800 pages is disruptive. →Σσς. (Sigma) 05:54, 3 November 2013 (UTC)[reply]
I am currently running the bot on my laptop for 18 projects with output to files, then copy/pasting the results to the wiki pages. --Bamyers99 (talk) 01:30, 5 November 2013 (UTC)[reply]
- I take it you're not running Tedder's existing code. Why not? Josh Parris 10:24, 5 November 2013 (UTC)[reply]
- I rewrote it in PHP to see if the regular expression searching would be faster than jit compiled Java. It wasn't, but since I had to implement most of the existing functionality for the test, I decided to keep using the new code. I have implemented several enhancements and bug fixes. The PHP code fixes broken lead detection. Only regex scans new pages since the last run instead of new pages for the past 7 days. Requests new page content 50 pages at a time instead of 1 page at a time. I have created a web page that validates regex rules for updated/new project rules. So instead of having to wait for the next bot run to find invalid regexs, it is immediate. Very easy to do with PHP using the same codebase, not so easy publishing a web page from the Java code of a standalone program. --Bamyers99 (talk) 02:19, 6 November 2013 (UTC)[reply]
- So, bug fixes and performance improvements. If you're offered a trial, will it be 18 pages? Josh Parris 05:48, 6 November 2013 (UTC)[reply]
- 36 pages, 2 per project. --Bamyers99 (talk) 19:57, 6 November 2013 (UTC)[reply]
- So, bug fixes and performance improvements. If you're offered a trial, will it be 18 pages? Josh Parris 05:48, 6 November 2013 (UTC)[reply]
- I rewrote it in PHP to see if the regular expression searching would be faster than jit compiled Java. It wasn't, but since I had to implement most of the existing functionality for the test, I decided to keep using the new code. I have implemented several enhancements and bug fixes. The PHP code fixes broken lead detection. Only regex scans new pages since the last run instead of new pages for the past 7 days. Requests new page content 50 pages at a time instead of 1 page at a time. I have created a web page that validates regex rules for updated/new project rules. So instead of having to wait for the next bot run to find invalid regexs, it is immediate. Very easy to do with PHP using the same codebase, not so easy publishing a web page from the Java code of a standalone program. --Bamyers99 (talk) 02:19, 6 November 2013 (UTC)[reply]
Approved for trial (36 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Josh Parris 08:59, 7 November 2013 (UTC)[reply]
- Trial complete. The results look good to me. There are 3 extra edits for the HipHop project because I was testing the wiki update functionality. Here are the issues I had to overcome, for future bot writer reference. 1) "Unknown error: wikimedia-globalblocking-ipblocked" for Linode VPS IPv6 range global address block. Forced curl to use IPv4 with curl_setopt($ch, CURLOPT_IPRESOLVE, CURL_IPRESOLVE_V4). 2) Captcha challenge for edit with {{la}} in the content because the template creates external links and the bot was not autoconfirmed yet. --Bamyers99 (talk) 03:20, 9 November 2013 (UTC)[reply]
- It seems you ought to ask for an IP block exemption for yourself and the bot.
- Is there a benefit to using v6 vs v4? My VPS has a static address of each type. The v4 address is not blocked. Don't know why I would need an exemption for myself. My home connection is unrelated to the VPS.
- Can I suggest the edit summaries not be boilerplate? "most recent errors and scoring" when there are no errors is... unhelpful. What comprises an error for this report, as none were reported in this run?
- Summary changed. The only error reported is for a malformed regex.
- User:InceptionBot/NewPageSearch/Forestry/errors has some metrics that will be easier to read with a separator other than a space - and could be renamed, because it's more a log than a list of errors, isn't it?
- Separator changed to —. Page renamed to log. Tested on the Forestry log. The log scoring section is absent because I ran the bot twice with the same start date and there were no new pages to add to the results.
- [1] adds two entries, and removes some. The documentation at User:AlexNewArtBot doesn't suggest this would happen. In fact, that documentation is kind of sparse at a meta-level.
- Added Process overview section.
- User:AlexNewArtBot/ForestrySearchResult doesn't give any contextual information when I stumble across it. I suggest a <noinclude> block explaining what the heck it is, including a note to see WhatLinksHere.
- Both the search result and log pages now have a <noinclude> block. Tested on the Forestry results.
- Are negative weightings permitted? They're not documented.
- Now documented.
- Are PHP and Perl regexs completely compatible? Josh Parris 05:19, 9 November 2013 (UTC)[reply]
- PHP regexs are perl compatible.
- Isn't the expensive part downloading the wikitext; for efficiency shouldn't you be downloading one article and then testing it for every project? Josh Parris 05:22, 9 November 2013 (UTC)[reply]
- All the new article wikitext is downloaded and file cached at the start of the bot. The processing time metric on the log page is mostly for the regex matching.
- Thanks for taking the time to thoroughly analyze the bot output and documentation. --Bamyers99 (talk) 02:36, 10 November 2013 (UTC)[reply]
- It seems you ought to ask for an IP block exemption for yourself and the bot.
Bot Operator appears to be in good standing; task is straight-forward, in user-space and developed competently. Approved. Josh Parris 20:54, 11 November 2013 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.