User talk:Chire/Archive 1

This is an archive of past discussions with User:Chire. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

Welcome

Welcome!

Hello, Chire2, and welcome to Wikipedia! Thank you for your contributions. I hope you like the place and decide to stay. Here are some pages that you might find helpful:

The five pillars of Wikipedia
Tutorial
How to edit a page and How to develop articles
How to create your first article (using the Article Wizard if you wish)
Manual of Style

I hope you enjoy editing here and being a Wikipedian! Please sign your messages on discussion pages using four tildes (~~~~); this will automatically insert your username and the date. If you need help, check out Wikipedia:Questions, ask me on my talk page, or ask your question on this page and then place {{helpme}} before the question. Again, welcome! Dawnseeker2000 19:31, 11 May 2010 (UTC)

Proposed deletion of Environment for DeveLoping KDD-Applications Supported by Index-Structures

The article Environment for DeveLoping KDD-Applications Supported by Index-Structures has been proposed for deletion because of the following concern:

WP:Notability: No significant coverage in reliable sources that are independent of the subject. Current references were all written by the developers.

While all contributions to Wikipedia are appreciated, content or articles may be deleted for any of several reasons.

You may prevent the proposed deletion by removing the {{dated prod}} notice, but please explain why in your edit summary or on the article's talk page.

Please consider improving the article to address the issues raised. Removing {{dated prod}} will stop the proposed deletion process, but other deletion processes exist. The speedy deletion process can result in deletion without discussion, and articles for deletion allows discussion to reach consensus for deletion. Qwfp (talk) 13:13, 29 May 2010 (UTC)

{{helpme}} What is the appropriate way to stop this overzealous deletion request?

This is not a commercial product, but a research software. Both the institution Ludwig-Maximilians-Universität München and the leading Professor Hans-Peter Kriegel definitely qualify as notable. The software is mature and useful, but on an expert level. I have used it to produce visualizations for R-Tree, de:OPTICS (havn't worked on the english version of this page yet) and Local Outlier Factor. It's not a point-and-click datamining software for commercial use.

It has been published in highly respected conferences, and as such been peer-reviewed by experts on the fields of data mining, databases and time series.

And the "roundtrip time" in scientific research is a few years, so it just takes some time to be able to prove notability via external references. So maybe it is just a bit early to really judge on relevancy just based on external citations? --Chire (talk) 14:47, 29 May 2010 (UTC)

Anyone can cancel a proposed deletion - as you clearly disagree, I removed it for you. See WP:PROD. Chzz ► 15:17, 29 May 2010 (UTC)

I never said it was a commercial product. If it's too early to have significant coverage in independent reliable sources it's too early to have an article here. Nothing to stop you re-creating it in a few years. Qwfp (talk) 17:11, 29 May 2010 (UTC)

Articles for deletion nomination of Environment for DeveLoping KDD-Applications Supported by Index-Structures

I have nominated Environment for DeveLoping KDD-Applications Supported by Index-Structures, an article that you created, for deletion. I do not think that this article satisfies Wikipedia's criteria for inclusion, and have explained why at Wikipedia:Articles for deletion/Environment for DeveLoping KDD-Applications Supported by Index-Structures. Your opinions on the matter are welcome at that same discussion page; also, you are welcome to edit the article to address these concerns. Thank you for your time.

Please contact me if you're unsure why you received this message. Qwfp (talk) 17:30, 29 May 2010 (UTC)

Data mining

You undid my adding of an external link about the Panton Principles. Wondering if you might explain further? Data mining of scientific data runs into ethical issues -- can this data be re-used? How can it be re-used? What's generally been happening is that in many instances, scientists are unsure whether they can re-use data, and are afraid of copyright issues or legal stuff if they use it; the whole idea behind the Panton Principles here (Panton Principles for Open Data in Science at Citizendium) is to have a well-understood tag attached to scientific data which alerts future users of data that it can be re-used. Many feel it's an advancement for science in general since it encourages the sharing of information. Please consider restoring the link to the data mining article.--Tomwsulcer (talk) 13:37, 15 June 2010 (UTC)

The article is way too long, crowded with references and links. This is why the whole links area was stripped to a bare minimum, and the bar should be really high for links - ACM SIGKDD is the top conference here, and ODP is a suitable catch all. There is an ethics sections with 6 or so references, that should cover the relevant ethical discussion well enough, without the external link to this third-party article on panton principles, that are just marginally related. They might be referenced by the given references, that ought to be enough. In any way, an scientific source on the panton principles should be preferred to any Wiki. --Chire (talk) 16:19, 15 June 2010 (UTC)

OK, sounds fair enough, thanks for sharing your thinking.--Tomwsulcer (talk) 17:43, 15 June 2010 (UTC)

Yes it need

Yes sir! M-Tree article needs a lot of work. I have been busy by now, soon it will be improved. Nice picture. —Preceding unsigned comment added by Diego diaz espinoza (talk • contribs) 03:53, 21 June 2010 (UTC)

Article rescue: Rexer survey

Hello. I am new to Wikipedia and would like to dispute the notability tag added by User:Melcombe to the Rexer's Annual Data Miner Survey article. Can you help? Thanks. --Luke145 (talk) 20:30, 22 April 2011 (UTC)

Categories for discussion nomination of Category:Data mining software

Category:Data mining software, which you created, has been nominated for discussion. If you would like to participate in the discussion, you are invited to add your comments at the category's entry on the Categories for discussion page. Thank you. I've proposed to rename this to Category:Data mining and machine learning software --93.104.79.59 (talk) 08:04, 16 October 2011 (UTC)

Knowledge Grid vs Data mining

Yes, you have a point that it might need to be better organized. Thanks for the info in Wikipedia:Articles for deletion/Knowledge Grid as you can see it is not clear what to do. Probably worth a discussion on Talk:Data mining? W Nowicki (talk) 17:44, 18 October 2011 (UTC)

machine learning

Data Mining is an application area which is a branch of machine learning. http://aaai.org/AITopics/MachineLearning

Please also check IEEE, ACM and other Ph.D. opinions.

For example, I can use a Neural Network to "learn for classification" or to "understand a classification". This can be accomplished by visualizing the hidden layer, controlling the inputs and outputs, etc...

For absolute clarity - the order of the areas are: Computer Science (discipline) > Artificial Intelligence (theory) > Machine Learning (technique) > Data Mining (application area)

DM is a branch of ML which is a branch of AI which is a branch of CS. — Preceding unsigned comment added by Aiwing (talk • contribs) 10:26, 27 October 2011 (UTC)

pami vs sig kdd

I will find more references but please note the direction: a) SIG KDD is not the authority - it is only one source b) the TOP A.I. journal is IEEE PAMI - it has the highest impact factor and lowest acceptance rate

Let me clarify the point on why it appears that my statement is mixed: DM: application area - the process of finding something (analysis, comparison and investigation of the data) ML: technique - the way to find something (the algos, model, etc...)

I chose neural networks not because of my interest but for the following reasons: a) neural networks have both supervised and unsupervised modes. b) neural networks (also genetic algos and agent theory) were inspired by biology and therefore differentiate from pure math. c) neural networks is considered the most basic pure A.I. fuzzy machine learning (ignoring decision trees which are not really fuzzy) d) neural networks have been extensively trusted.

I only have one paper on NN and so it is not the area I am interested in. It is clear for the public. — Preceding unsigned comment added by Aiwing (talk • contribs) 01:17, 28 October 2011 (UTC)

"Orphaned"

WP:ORPHS is not a very good reason to delete an article, so I suggest you to stop using it in PRODs. Greetings, Ian (212.87.13.73 (talk) 05:15, 28 October 2011 (UTC))

Which article are you referring to? The ones I found in my edit history -- I did a lot of cleanup in ML yesterday -- were usually metioning orphan just as one of multiple reasons, usually including "WP:Notability". I tend to do "orphan prod" for single-line unused dictionary terms only. --Chire (talk) 06:58, 28 October 2011 (UTC)

by analogy

The reason I disagree with databases is because the general population will think that database work is related to data mining. There is clearly a difference between the storage aspect and the mining aspect. The key emphasis is the "process of mining" and not the "process of storage for mining". The techniques of indexing a database have nothing to do with pattern analysis and this is why the AI community rejects databases. Including databases as the definition of data mining is analogous to "IT = computer science" or "computer science = programming". It is the confusion of a support tool with the actual effort. C++ is a tool used by many but it is not data mining specific. However, using a mathematical model to find a pattern IS data mining specific. Hence, machine learning is IN data mining and databases and programming languages are NOT IN data mining. — Preceding unsigned comment added by Aiwing (talk • contribs) 08:42, 1 November 2011 (UTC)

Your recent destructive edits on Comparison of revision control software

I am sorry but destructive edits like you do from time to time are not helpful. If you like to make things better, you are of course welcome, but you did no more than removing information. --Schily (talk) 15:31, 8 November 2011 (UTC)

Indeed, my edits were very destructive, not. Fact check time: [1] [2]. --Chire (talk) 15:44, 8 November 2011 (UTC)

Pune pilot analysis plan

Hi! As you were very active in discussions about the India Education Program's Pune pilot, I wanted to draw your attention to Wikipedia:India_Education_Program/Analysis, a page that documents our analysis plan for the next few months. I encourage you to join the discussion if you have any thoughts. -- LiAnna Davis (WMF) (talk) 23:10, 1 December 2011 (UTC)

DBScan seems to have an error

Chire,

It appears to me that the article on DBScan is incorrect, specifically the section labeled algortithm. It says that if a point is a part of the cluster, everything in its epsilon nighborhood is added too. That is true if the neighborhood is dense, but I believe it is not true if the neighborhood is not dense. For example, in one dimension, suppose we have points that line up like this

PA-PB-PC-PD-P1-P2-P3 minPts = 4 PA, PB, PC, PD & P1 are all close, P2 is 0.9*epsilon away from P1. We start a cluster for P1 and P2 is in it. P3 is also about 0.9*epsilon from P2. The nhd for P2 has P1 and P3 - not dense, but P2 is in the nhd for P1. If we added the nhd for P2, we would add P3, but P3 is not densely reachable from P1.

You can check what I say in the book, ^[1].

If you agree that the current statement is wrong, you can fix it yourself or, if you would like, I will. Just let me know. Do you agree it is incorrect? Who would you like to fix it?

Thanks, Mu.ting (talk) 01:17, 21 March 2012 (UTC)

In general, I prefer primary sources, such as the DBSCAN publications themselves (it actually is written DBSCAN, btw. - it is an abbreviation, not a "scan") . But I agree with you, the abstract algorithm doesn't take the core point requirement into account properly. I'll see if I can elaborate that accordingly. Thank you! --Chire (talk) 06:55, 21 March 2012 (UTC)

thank you, and request for coaching

Chire -- Thank you for the ideas about the Data mining page that you left for me at User talk:Krexer. At User talk:Krexer I responded. I'd love to get your perspective and coaching on ways I can improve the SIGKDD page, and which organization or conference page would be a good template when I follow your advice and create new pages for the conferences / organizations listed on Data mining. Karl (talk) 15:14, 14 November 2012 (UTC)

DBSCAN

Hi,

I've made a change in the code reducing one from MinPts. You undone my change saying that the the regionQuery also returns P. However expandCluster() starts in adding P to the cluster:

expandCluster(P, NeighborPts, C, eps, MinPts)

  add P to cluster C

Please consider your position again. — Preceding unsigned comment added by 87.68.213.25 (talk) 18:06, 29 April 2013 (UTC)

A) yes, I'm absolutely sure that the DBSCAN code is correct, and your interpretation is incorrect.

B) even if it were incorrect: Wikipedia does not "fix" algorithms. It reports on what has been published, and this is how DBSCAN was published. So if you believe this is incorrect, you should publish a fixed DBSCAN in a major journal (so it is peer reviewed by the scientific community) and then Wikipedia may consider to include your version. Until you show that the majority of the science community sees the need for a "-1" correction, I'm not willing to accept your change.

Because in the end, Wikipedia is an Encyclopedia, not a programming book. We report on Science, not on programming details. But trust me. I've implemented DBSCAN several times, and this pseudocode is correct. The problem is that most likely, your "regionQuery" is incorrect. It is a database query. It will return all objects in the desired range, including the query point. It is a core point if and only if the set - including the query point - is at least minPts objects total. So the published DBSCAN code is correct, and so is the copy in Wikipedia.

In fact, I've seen this misinterpretation of "regionQuery" a number of times. The problem is most likely that you are not actually using a database. If you were using an actual database, it should be obvious that *any* region query (forget about the query point - think of putting a *box* on your data set) is "dense" if it has at least minPts objects.

--Chire (talk) 20:02, 29 April 2013 (UTC)

but if it is so why the expandCluster(P, NeighborPts, C, eps, MinPts) adds P to the cluster? — Preceding unsigned comment added by 94.188.248.67 (talk) 08:10, 30 April 2013

First of all, please learn wiki syntax for threading and signatures. I had to add the missing signature, which you could have more easily added yourself. There is a big fat sign in the editing help that says Sign your posts on talk pages.

Maybe that line is redundant. But again, we don't want to speculate on the algorithm, we want to use sources. E.g. in the original sources, the query point is explicitly removed from the regionQuery result, and it does compare to minPts, not minPts - 1 as in your proposal. We try to stick to official sources, and I believe the current version is based on the book by two of the authors of DBSCAN. So let's just stick with this, unless you can find a scientific, reviewed source that shows otherwise. See Wikipedia:Reliable Sources for details on this Wikipedia Policy. Thank you. --Chire (talk) 01:04, 1 May 2013 (UTC)

Deletions of 'Border Pairs Method'

Hi!

My contribution 'Border Pairs Method' was deleted two years ago because of missing references.There is a new reference. Could you now turn on back my contribution please? Bojan PLOJ (talk) 09:09, 3 July 2013 (UTC)

No, I cannot, technically. I'm not an Wikipedia administrator or something, I can neither delete nor restore; I can only "nominate for deletion" (as can anybody). But given that this new reference is not an independent, third-party reference, it doesn't really help much IMHO. Wikipedia clearly requests for third-party references, but you are one of the authors of this article, too. If you look at Local Outlier Factor for example (a method that I'm very interested in, although written long before I was in science), has 1574 citations on Google Scholar. DBSCAN (another algorithm that I have a lot of interest in, invented when I was still in school) has 4867 citations in scholar. Random forests, a method I'd like to investigate more, has 9321 citations. Bootstrap aggregating, which I have used in a variation, has 9872 citations. To my understanding, Wikipedia is not about listing everything that has ever been done, but focusing on landmark methods such as these. Anything that gets an own article on Wikipedia should have a reasonable number of independent citations, should be discussed in books, hundreds of (independent) web sites etc.; and I don't think your method has received similar external verification and validation yet. To cite from the deletion discussion: Wikipedia:Articles_for_deletion/Border_pairs_method

"Google scholar finds only one research paper containing this phrase, with zero citations."

this is probably the key reason there was a consensus to delete the article. To put it down simply: you need more indepenent citations. --Chire (talk) 09:37, 3 July 2013 (UTC)

August 2013

Hello, I'm BracketBot. I have automatically detected that your edit to Generalized logistic distribution may have broken the syntax by modifying 1 "<>"s. If you have, don't worry, just edit the page again to fix it. If I misunderstood what happened, or if you have any questions, you can leave a message on my operator's talk page.

Thanks, BracketBot (talk) 16:53, 26 August 2013 (UTC)

Kudos on an old edit

Hi there! It's nice to meet you. I wanted to say kudos on this edit even though it was made two months ago. What a great improvement. Keep being awesome! EmilyREditor (talk) 05:10, 2 August 2014 (UTC)

unsupervised learning

i also added a link on unsupervised machine learning which can be used to visualize unknown patterns.

I will find more references but please note the direction: a) SIG KDD is not the authority - it is only one source b) the TOP A.I. journal is IEEE PAMI - it has the highest impact factor and lowest acceptance rate

Let me clarify the point on why it appears that my statement is mixed: DM: application area - the process of finding something (analysis, comparison and investigation of the data) ML: technique - the way to find something (the algos, model, etc...)

I chose neural networks not because of my interest but for the following reasons: a) neural networks have both supervised and unsupervised modes. b) neural networks (also genetic algos and agent theory) were inspired by biology and therefore differentiate from pure math. c) neural networks is considered the most basic pure A.I. fuzzy machine learning (ignoring decision trees which are not really fuzzy) d) neural networks have been extensively trusted.

I only have one paper on NN and so it is not the area I am interested in. It is clear for the public. — Preceding unsigned comment added by Aiwing (talk • contribs) 01:17, 28 October 2011

this can be archived. --Chire (talk) 16:54, 5 September 2014 (UTC)

MapReduce

Hello, I noticed your edit at MapReduce and request you provide some sort of cite as to my reading, at least the first sentence about single-threaded makes no sense to me. Thanks. WilliamKF (talk) 21:56, 3 October 2014 (UTC)

@WilliamKF: the "single-threaded" was not a typo. There are some really slow "map-reduce" implementations out there that use only a single thread, and offer none of the parallelism and fault tolerance benefits discussed in the original map reduce paper. There are plenty of complaints on the performance of MongoDB, for example:

https://stackoverflow.com/questions/3947889/mongodb-terrible-mapreduce-performance

Things apparently have improved slightly since they moved to multithreaded "map reduce" with the V8 JavaScript engine.

The point of that sentence is that just allowing "map" and "reduce" functions doesn't make up a proper MapReduce system. --Chire (talk) 12:51, 4 October 2014 (UTC)

Thanks, I've updated the article to try and make the meaning clearer and added a cite, please review. WilliamKF (talk) 17:37, 4 October 2014 (UTC)

@WilliamKF: Thanks. The only thing is that I don't think such a web site is a reliable source. :-( I'd prefer to have something that discusses how to design a performant MapReduce implementation. But I havn't found any such source yet. --Chire (talk) 17:00, 5 October 2014 (UTC)

I believe any cite is better than no cite, plus I feel stackoverflow is pretty reliable due to the voting etc. WilliamKF (talk) 05:33, 6 October 2014 (UTC)

OPTICS algorithms / dashes

Hi Chire,

I changed dashes to camel case, because of plain readability reasons. core-dist in a function context like max(core-dist) can quickly be interpretated as core minus dist.

Best regards,

user — Preceding unsigned comment added by 141.63.174.160 (talk) 14:30, 31 August 2016 (UTC)

There is little risk of misinterpreting these terms, as "core" is not defined, only "core-dist". By typography, in the math equations the shorter dash is visually different from a minus, and in the pseudocode, a minus should be surrounded by whitespace. Chire (talk) 09:22, 1 September 2016 (UTC)

ArbCom 2017 election voter message

Hello, Chire. Voting in the 2017 Arbitration Committee elections is now open until 23.59 on Sunday, 10 December. All users who registered an account before Saturday, 28 October 2017, made at least 150 mainspace edits before Wednesday, 1 November 2017 and are not currently blocked are eligible to vote. Users with alternate accounts may only vote once.

The Arbitration Committee is the panel of editors responsible for conducting the Wikipedia arbitration process. It has the authority to impose binding solutions to disputes between editors, primarily for serious conduct disputes the community has been unable to resolve. This includes the authority to impose site bans, topic bans, editing restrictions, and other measures needed to maintain our editing environment. The arbitration policy describes the Committee's roles and responsibilities in greater detail.

If you wish to participate in the 2017 election, please review the candidates and submit your choices on the voting page. MediaWiki message delivery (talk) 18:42, 3 December 2017 (UTC)

Speedy deletion nomination of Category:Pages with ACM-DL identifiers

A tag has been placed on Category:Pages with ACM-DL identifiers indicating that it is currently empty, and is not a disambiguation category, a category redirect, a featured topics category, under discussion at Categories for discussion, or a project category that by its nature may become empty on occasion. If it remains empty for seven days or more, it may be deleted under section C1 of the criteria for speedy deletion.

If you think this page should not be deleted for this reason you may contest the nomination by visiting the page and clicking the button labelled "Contest this speedy deletion". This will give you the opportunity to explain why you believe the page should not be deleted. Please do not remove the speedy deletion tag from the page yourself. Liz ^{Read! Talk!} 05:54, 10 March 2022 (UTC)

^ Introduction to Data Mining by Tan, Steinbach and Kumar

[1] Introduction to Data Mining by Tan, Steinbach and Kumar

[1]