开发者

Essential skills of a Data Scientist [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.

Closed 8 years ago.

Improve this question

What are the relevant skills in the arsenal of a Data Scientist? With new technologies coming in every day, how does one pick and choose the essentials?

A few ideas germane to this discussion:

  • Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data.
  • Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list.
  • Data now comes in the f开发者_如何学编程orm of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation.
  • What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ?
  • OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company

Thoughts?


To quote from the intro to Hadley's phd thesis:

First, you get the data in a form that you can work with ... Second, you plot the data to get a feel for what is going on ... Third, you iterate between graphics and models to build a succinct quantitative summary of the data ... Finally, you look back at what you have done, and contemplate what tools you need to do better in the future

Step 1 almost certainly involves data munging, and may involve database accessing or web scraping. Knowing people who create data is also useful. (I'm filing that under 'networking'.)

Step 2 means visualisation/ plotting skills.

Step 3 means stats or modelling skills. Since that is a stupidly broad category, the ability to delegate to a modeller is also a useful skill.

The final step is mostly about soft skills like introspection and management-type skills.

Software skills were also mentioned in the question, and I agree that they come in very handy. Software Carpentry has a good list of all the basic software skills you should have.


Just to throw in some ideas for others to expound upon:

At some ridiculously high level of abstraction all data work involves the following steps:

  • Data Collection
  • Data Storage/Retrieval
  • Data Manipulation/Synthesis/Modeling
  • Result Reporting
  • Story Telling

At a minimum a data scientist should have at least some skills in each of these areas. But depending on specialty one might spend a lot more time in a limited range.


JD's are great, and for a bit more depth on these ideas read Michael Driscoll's excellent post The Three Sexy Skills of Data Geeks:

  1. Skill #1: Statistics (Studying)
  2. Skill #2: Data Munging (Suffering)
  3. Skill #3: Visualization (Story telling)


At dataist the question is addressed in a general way with a nice Venn diagram:

Essential skills of a Data Scientist [closed]


JD hit it on the head: Storytelling. Although he did forget the OTHER important story: the story of why you used <insert fancy technique here>. Being able to answer that question is far and away the most important skill you can develop.

The rest is just hammers. Don't get me wrong, stuff like R is great. R is a whole bag of hammers, but the important bit is knowing how to use your hammers and whatnot to make something useful.


I think it's important to have command of a commerial database or two. In the finance world that I consult in, I often see DB/2 and Oracle on large iron and SQL Server on the distributed servers. This basically means being able to read and write SQL code. You need to be able to get data out of storage and into your analytic tool.

In terms of analytical tools, I believe R is increasingly important. I also think it's very advantageous to know how to use at least one other stat package as well. That could be SAS or SPSS... it really depends on the company or client that you are working for and what they expect.

Finally, you can have an incredible grasp of all these packages and still not be very valuable. It's extremely important to have a fair amount of subject matter expertise in a specific field and be able to communicate to relevant users and managers what the issues are surrounding your analysis as well as your findings.


Matrix algebra is my top pick


  • The ability to collaborate.

Great science, in almost any discipline, is rarely done by individuals these days.


There are several computer science topics that are useful for data scientists, many of them have been mentioned: distributed computing, operating systems, and databases.

Analysis of algorithms, that is understanding the time and space requirements of a computation, is the single most-important computer science topic for data scientists. It's useful for implementing efficient code, from statistical learning methods to data collection; and determining your computational needs, such as how much RAM or how many Hadoop nodes.


Patience - both for getting results out in a reasonable fashion and then to be able to go back and change it for what was 'actually' required.


Study Linear Algebra on MIT Open course ware 18.06 and substitute your study with the book "Introduction to Linear Algebra". Linear Algebra is one of the essential skill sets in data analytic in addition to skills mentioned above.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜