I just finished the “Data Scientist with R” Career Track on DataCamp. Here’s my review
TL;DR. I think DataCamp is awesome. Scroll down to learn more about what it offers, why I chose it, and a proposed plan of study to start learning Data Science with R (including things to do outside DataCamp).
Learning “how to data science” has been on my bucket list for the past few years. While I am comfortable with traditional data manipulation and analysis software (e.g., Excel, Alteryx), and some data visualization tools (e.g., Tableau, Power BI), I have only written proper code on a few occasions, for specific projects (more like, I Googled my way into writing code — Thanks Stack Overflow). A month ago, while confined at home and having recently left my job, I decided to give it a proper try: dedicating a good portion of my day to an online data science course.
The review below could be useful to you as you consider this career track on DataCamp, but I would like to caveat it with the following:
- I started this course with a lot of free time on my hands. With a busy work schedule, I doubt I would have gone through the material this quickly.
- I have a good background in math and statistics. In particular, I have studied all kinds of regressions extensively during my undergrad statistics & econometrics courses. The regression stuff was thus easy to breeze through.
- Beyond regression, I had no experience with the more advanced machine learning techniques (I didn’t even know the difference between supervised and unsupervised learning before this course). So I was definitely tested by the later stages of the course. More on that below.
What is DataCamp and why did I choose it?
DataCamp is a website that offers resources to learn data science: how to load, clean, wrangle and visualize data using code, in order to derive useful insights from it. The website has been around for years, and has developed a good reputation as a do-it-yourself online resource.
1. Why did I choose DataCamp?
a. The courses are broken down into small doses: each career track/skill is broken down into 3–5 hour courses, which are broken down into 4–5 chapters, the latter then broken down into (max) 4 min videos and a series of exercises. I have the attention span of a typical millennial (a fact further amplified by the current pandemic…), so the fact that I could consume DataCamp in small bites was crucial. Specifically, I wanted to avoid the rigidity of hour-long videos in Coursera’s courses (I can only tolerate that for Oum Kalthoum’s songs).
b. The courses are interactive, meaning that you have to actively practice as you go through the material. In fact, you spend the majority of your time writing and running code as a learning mechanism. While a lot of the code comes pre-written, you still learn a ton by experimenting with different formulas, and interpreting the results of your code.
c. The whole website is gamified, and thus addictive. The more lessons, exercises, and practice you go through, the more points you accumulate. So if you are into games, or if you are competitive, you will keep logging back into the website to earn more. Of course, one has to be careful when the gamification supersedes the actual act of learning (I fell into that trap a few times, and when that happened, I knew that that was my cue to stop and take a break!)
d. The webiste is a great one-stop-shop for beginners. It covers the main two languages for data science: R and Python, and includes an extensive library of courses for both of them. And more courses (and features) are being continuously added.
e. DataCamp is affordable, given the quality of its content. Without a discount, the price tag is around USD 400 / year. That would have given me pause before purchasing it, but I landed on an amazing discount that offered a one-year membership at USD 100 / year (gotta love the tech world’s obsession with customer acquisition!). If you are not earning in US dollars, that might seem like a steep price tag, but I like to describe it as around USD 9 / month — that’s like buying a Coca Cola every other day every month (In my country, Morocco, a small Coca Cola costs around USD 0.60). I know Udemy also has affordable prices, but with Udemy, you’re paying per class. So if you aggregate the price for all the classes you would take on Udemy, DataCamp would turn out cheaper.
d. It came heavily recommended. I knew DataCamp was a solid choice because my previous employer had purchased subscriptions for all its employees (note: the membership I bought is separate from the one that was offered to me through my employer). Similarly, when I asked a lot of friends (some beginners, others advanced in the field) what they would recommend to get started, they said DataCamp.
2. What does DataCamp contain?
The website is divided into many components that comprehensively make up the DataCamp learning environment. Here are the main components:
a. (Short) Assessments: These short, 15 multiple choice questions, allow you to determine your level in the main skills a beginner should have: (1) Importing and Cleaning Data, (2) Data Manipulation, (3) Data Interpreting and (4) Machine Learning. The questions are simple “fill in some code in the blank” that draw on the courses, are timed and are adaptive (meaning that they adapt to your level: if you are answering the first few questions correctly, the questions will get harder. If you get stuck in the middle, they will become easier). At the end of each assessment (should take you 15 min), you get scored on a curve — how you did compared to others who took the assessment. They also provide you with an assessment of your skill gaps and will recommend courses to you. If you feel like you are not an absolute beginner, Assessments would be a great place to start. For me, I took one data prep assessment and scored so badly that I decided to do the most extensive set of beginner courses. As I went on and completed more courses, I went back to check my progress by taking the assessments. I was happy to see my score improve over time. Ultimately, the questions will start repeating themselves, so try to space out how often you take the assessments (one at the beginning and one at the end of a track is a good cadence).
b. Courses. Courses are stand-alone collection of lessons that take about 3- 5 hours to complete. They are comprised of chapters, which are in turn broken down into a mix of videos, interactive exercises and (sometimes) mutiple choice questions to check on your understanding. Each course is taught by 1 or 2 instructors, usually industry veterans or academics. One hidden treasure of the courses is that the slides that comprise the videos (which introduce and visualize the concepts, show the functions, etc) can be downloaded as PDFs. As things got advanced and harder to keep track of / remember, I started downloading, labeling and commenting those PDFs and referred to them later as I was doing my own projects.
Not sure where to get started? Well, the courses are either stand-alone, or organized into Skill or Career Tracks, both a collection of courses grouped together to develop your skills. Career Tracks are more extensive and take much longer than Skill Tracks to complete. For example: the Data Scientist with R is a 76-hour Career Track and teaches you the whole data science process, while the Machine Learning with R is a 16-hour Skill Track that assumes prior knowledge and goes more in-depth in machine learning techniques. For example, I decided to first try out the Career Track, but now that I finished it, my next step is to try a Skill Track before jumping into another extensive Career Track.
c. Practice. This is your chance to practice what you’ve learnt in the courses described above, again, in small bites. Each collection of similar courses has one of these, usually 5, untimed, short questions that review the concepts, formulas and functions learnt. Pro tip: Download the app on your phone and play with these practice questions at the beginning or end of your day, after you have had the chance to absorb the concepts from the courses.
d. Projects. These are real-life exercises that allow you to put into practice a large chunk of what you learnt in the courses. Unlike the Assessments and the Practice portions, these do not test your knowledge in a school-exam way, but rather in a real-life problem that data can help you solve. An example of a project is to visualize the progression of COVID-19 cases in different countries.
I found Projects useful, but not all that inspiring. I can summarize their shortcomings as : (i) having way too much guidance, (ii) difficult to choose which one to do after a collection of courses and (iii) heavy on the first part of the track (a bit lacking on the machine learning part of the track).
Data Scientist with R Track: Why?
1. Why R and not Python?
How I landed on R and not Python was not some analyitically-backed process, but followed the simple logic “I already know a bit of R so why don’t I keep going with it”. The data scientist I worked with at my last job used R, so I interacted a bit with the language, and the graduate program I am attending in the fall (in Economic Development) will also require some knowledge of R.
If you are wondering which of the two to choose, the general advantages of one over the other are that: a. R is for specialized work in statistics / machine learning, and is easier to learn if you have no programming experience, while Python is a full fledged programming language and thus more readily integrated with non-data science work (such as web applications). DataCamp has a detailed article on this.
2. General thoughts on the track
The “Data Scientist with R” Track is a 76-hour collection of courses. I dedicated around 3 hours / day during 20 days for coursework, in addition to 1 hour for Practice / Projects. So it took me around 60 hours to complete the course, but I dedicated the additional planned hour for practice, so in all, around 80 hours is an accurate description of how long it took me to complete the course.
In general, the courses were well-organized and included a lot of content. I would divide this track into two major themes: First, getting the data into a format that allows for analysis. Second, actually analyzing the data you just cleaned.
As you will see below, the first chunk is heavy on the tedious job of cleaning data. At times, it was difficult to keep going because a lot of it was just “this is a formula that helps you do this task quickly”.
The courses in the later part of the track were on the more analytical side, and so required more brain power, and were thus more interesting. In these courses, the instructors do a great job introducing the intuition behind the concepts, and explaining a process to follow to reproduce the results presented. However, a more techically oriented person, who would want to understand the “why” under how things work will not be satisfied by DataCamp. If you want to understand how the math works under the hood, this will be deeply unsatisfying. If, on the other hand, you want to intuitively understand how things work and quickly apply them, you will find DataCamp easily understandable and enjoyable.
3. A proposed course of study for the track
The course is organized in 19 courses. Looking back at the course, I would not change much about how the course is organized, but I would insert Projects in between chapters to practice. The below is a proposed course of study of sorts:
PART ONE: Data cleaning
(a) Basic R — first 2 courses. The first course , Introduction to R, introduces data structures in (what is a vector, a matrix, a data frame, a list?). The second course, Intermediate R, teaches basic programming concepts (conditionals, loops, functions, etc)
(b) Data cleaning — the next 3 courses. Introduction to Tidyverse introduce easy ways to subset, filter, group, and summarize data (and one chapter dedicated to visualizations). The next course, Data Maniputation with dplyr, introduces a few other functions but is largely redundant if you’ve just finished the previous course. You could also argue that it serves as reinforcing what you just learnt. The next course, Joining Data with dplyr, teaches you how to merge multiple datasets using all kinds of joins (if you’ve only been using Excel thus far, you can say goodbye to VLOOKUP and /or INDEX MATCH forever…).
(c ) Building visualizations — the next 2 courses both deal with the ggplot2 package (which helps you build beautiful visualizations such as density lines, histograms, box plots, etc). Taking the two courses back to back, they were a bit redundant & also repeated some concepts learned earlier in the Intro to Tidyverse course.
BONUS: At this point you are ready to do at least one project. I would recommend: Visualizing COVID-19 is a good, simple one to start.
(d) Loading data — includes 2 courses. Introduction to Importing Data in R introduces R packages that help you import data from Excel files (.csv, .xlsx and the like). The following course, Intermediate Importing Data in R, covers more advanced data loading, from sources such as databases, the web, or statistical packages. Personally, I found it a bit dreadful to go through this section of the course, because it was just a bunch of formulas you needed to know and remember to be able to export data. (As my friends in Mexico would say — nada del otro mundo!)
(e) Data cleaning — covers 2 courses that help you clean your data so it is ready for analysis. The first course, Data Cleaning in R, helps you deal with common data problems such as cleaning strings, finding and eliminating duplicates, missing data, categorical data with too many levels, and some creative techniques to merge data sets. The next course, Working with Dates and Times in R, covers all sorts of techniques to work with dates, such as parsing dates, computing things like durations and periods, converting time zones, etc.
BONUS: Time for another Project. I would recommend Exploring the Kaggle Data Science Survey to practice variable encoding & data summarizing, and the Who is Drunk and when in Amens, Iowa for some date wrangling.
(f) Writing functions in R — this is a standalone section with a standalone course that I found extremely useful. This course teaches you how to avoid redundancy in code by writing a piece of code that you can reproduce multiple times. For example, let’s say you want to merge one dataset with 6 other datasets using the same variable, writing a function will help you do that in a few lines of code instead of repeating your process 6 times.
PART TWO: Data Analysis
(a) Intro to data analysis contains two courses: the Exploratory Data Analysis in R introduces some basic statistics concepts such as measures of center (e.g., mean, median) and of spread (e.g., variance, standard deviation), as well as how to deal with outliers. The following course, Case Study, contains an extensive UN General Assembly datasets on historical voting and aims to put together everything you’ve learnt so far in a real life example.
BONUS: Now is the time to get out of DataCamp for a day and do your own real-life project! For me, I explored COVID-19 data and built visualizations to compare the progress of the disease in North Africa. Then I wrote a brief blog post about it. I needed to get out of DataCamp at this point because I thought the courses had way too much guidance, and if I did not get my hands dirty with some real life data, I was not going to learn much. This escapade took me ~3 hours, and I was pleasantly surprised by how much I had learnt through DataCamp.
(b) Linear regression includes a dedicated course on linear regression models: what they mean / do, how to interpret them and how to evaluate their efficacy. If you are unfamiliar with regression, this might require more focus and brain power than the previous sections.
(c ) Intro to supervised learning contains 2 courses. The first one, Supervised Learning: Classification, introduces 4 techniques to classify a variable: k-Nearest Neighbors, logistic regression, Naive Bayes methods, and classification trees. For me, this was when the real challeging stuff started, and I can say that while I have a LOT to learn about why these techniques work, at least I have built an intuition about how they work. I felt super accomplished and happy at the end of these two courses :)
BONUS: Time for some time off DataCamp to practice these awesome new skills. I asked around, and my friends recommended I join Kaggle, an online data science community which has competitions for all levels. A famous competition to get started is the Titanic challenge, using supervised learning to predict who will survive the famous accident, and who will not. Once you get over the morbid nature of the task, it’s actually a super fun exercise. Again, I was suprised by how much I learnt through DataCamp (and I was pleased that my code produced a model with high accuracy!).
(d) Intro to unsupervised learning also includes two courses. The first one, Unsupervised Learning in R, introduces some techniques to cluster and find patterns in data that we do not understand yet, and includes k-means clustering, hierarhical clustering, and dimensionality reduction through PCA. The second course, Cluster Analysis in R, revisits these concepts in more detail and greater technicality. The difficulty was upped a notch from the supervised learning courses, but with some concentration and some googling, I was able to get through it!
Now that I finished this introductory track, I am craving more learning!
However, I will take a break from DataCamp for the next 2–3 weeks to do two things. First, I want to absorb what I just learnt by creating summaries for the latter part of the course. I need to sit back, create a framework to understand these concepts so I can remember them. I might post some here, so stay tuned.
Second, I want to get my hands dirty with some real projects, where I am completely on my own! I will try to do as many competitions on Kaggle as is possible, in addition to some self-guided exploratory work. I think it’s important to break out of the hyper-gamified DataCamp environment to do some real-life experimentation.
After my experimentation is over, I hope to go back and start the Machine Learning Skill Track on DataCamp! I will report back when I am done :)
If you have read all the way to here: (1) Thank you and (2) make yourself known in the comment section so I can properly thank you for taking the time.