View on GitHub

Which Programming Languages should I Learn to Land a Data Scientist Job?

Analysis of Online Job Ads - Indeed & Stack Overflow

May 2016

Overview         Analysis         Takeaways

Project Overview

Motivation

As someone who is thinking about going into the field of Data Science and Machine Learning, I have looked through online job ads for data scientists and got overwhelmed by the amount of skills required. It would take a very long time to learn the whole repertoire of data science tools.

To help decide which tools to learn first, I decided to look at which data science skills are currently most in-demand by employers in major cities in Canada and U.S.

After doing some research on this topic, I came across this awesome blog by Jesse Steinweg-Woods published in March 2015. He scraped data off indeed.com and looked at trends for Seattle, Chicago, San Francisco, New York and U.S. Nationwide.

His findings from nationwide search suggested that Python is much more in demand than R. This really intrigued me to see if the trend is the same among different data science jobs.

Objectives

In this analysis, I'll be trying to answer the following questions:

Approach

Data was collected from three sources:

Analysis

Ranking of American Cities based on Data Science Job Ads

test

You can see that almost all 53 cities had the most number of job ads for Data analysts followed by those for Data scientists, except for San Francisco, San Jose and Seattle.

There were similar number of job ads for data engineers and data architects, except for New York, Newark, Jersey City, San Jose, San Francisco, Seattle, Oakland and Fremont where there was a significantly higher number of job ads for data engineers than data architects.

arrdept

In Newark, Jersey City, New York, Oakland, Washington, Fremont and San Francisco, there were a total of over 1000 job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers combined together. San Jose had over 800 whereas Boston & Seattle had over 500.

Comparing Demand of Data Science Tools: U.S. vs. Canada

arrdept arrdept

For data analysts, Excel seems to be more popular in U.S than Canada by 15%. However, SQL is much more popular in Canada by 15.7%. Most U.S. companies do not seem to be looking for R and Python skills in data analysts, with 12.4% of job ads asking for R and only 2.9% asking for Python! More Canadian companies are looking for data analysts with R and Python skills than U.S., with 16.1% of job ads asking for R and 12.3% for Python.

For data scientists, Python is winning over R in U.S by 1.1% but it is still a tie in Canada. When Jesse did the analysis last year, Python was ahead by about 8% so it seems R has caught up to it again. R language may have benefitted from Microsoft acquisition last year.

For data architects, SQL, Hadoop and Java are leading the pack with 45.2% of U.S. job ads asking for SQL & 18% for Hadoop, followed by 17.2% for Java and 32.7% of Canadian job ads asking for SQL, followed by 13.9% for Hadoop & Java.

For data engineers, SQL, Python and Hadoop are most requested in both U.S. and Canadian indeed job ads. Below, you can see, the list of top 15 data science tools that appeared in the indeed job ads for data analysts, data scientists, data architects and data engineers in U.S. and Canada.

arrdept

arrdept

Comparing Demand of Data Science Tools between U.S. Cities

Now, let’s see if the national trends are different from individual cities’. I will be looking at top 12 cities with the most number of jobs: Newark, Jersey City, New York, Oakland, Washington, Fremont, San Francisco, San Jose, Boston, Seattle, Chicago & Los Angeles.

arrdept arrdept arrdept

Looking at the plots above, you can see the following trends:

Comparing Demand of Different Tools for Data Scientists Around the World (Stack Overflow Data)

The stackoverflow data had 362 job ads with about 60% of them from U.S. and 40% from countries around the world.

Let’s see if the trends that we observed above for U.S. are similar “internationally”.

arrdept

For stackover flow job data, R is winning over Python by 3%. Hadoop and SQL were equally in demand for data scientists. It's intesting to see that about 44% of stack overflow job ads for data scientists mentioned machine learning! It seems that it’s becoming important for data scientists to be familiar with machine learning principles and algorithms. Surprisingly, only 24% mentioned Big Data.

Takeaways

So here are my takeaways from this analysis!

When deciding on which programming languages to learn first, you should make the decision based on which career in the “data science universe” you want to pursue.

Although Python seems to be gaining momentum over R, they have different strengths, with R being a language developed for statisticians (easier to apply statistical concepts without much experience in programming) vs. Python being a multi-purpose language. So you may want to use both for different purposes.

Overall, companies seem to prefer open source languages such as R and Python over commercial one such as Matlab and SAS. Very good news!

It seems that machine learning will become more and more important for data scientists to be familiar with.

As for the ranking of the cities, east coast cities currently have more data science jobs than west coast. So hooray for the east coasters!

However, please note that Silicon Valley has more jobs for data scientists than for data analysts and more for data engineers than data architects. In contrast, there are more jobs for data analysts than data scientists in east coast. East coast cities also have more jobs for data engineers than data architects but the gap isn’t as wide as in Silicon Valley.