Hi,

This is a self note tracker for the Learning Content for acquiring skills for Data Scientist…

This content will be updated upon the finding some useful resources in mean while journey.

This content for Course Structure is taken from 52 weeks Curriculum to become Data Scientist in 2021

Course Structure

  1. Statistics and Probability (Week 1 to Week 6)
  2. Mathematics (Week 7 to 12)
  3. SQL (Week 13 to Week 21)
  4. Python and Programming (Week 22 to Week 28)
  5. Pandas (Week 29 to Week 33)
  6. Visualizing Data (Week 34 to Week 35)
  7. Data Exploration and Preparation (Week 36 to Week 39)
  8. Machine Learning (Week 40 to Week 51)
  9. Data Science Project (Week 52)

Statistics & Probability

Why Statistics and Probability?

Data science and machine learning are essentially a modern version of statistics. By learning statistics first, you’ll have a much easier time when it comes to learning machine learning concepts and algorithms! Even though it may seem like you’re not getting anything tangible out of the first few weeks, it will be worth it in the later weeks.

Week 1: Descriptive Statistics

Week 2: Probability

Week 3: Combinations and Permutations

Week 4: Normal Distribution and Sampling Distributions

Week 5: Confidence Intervals

Week 6: Hypothesis Testing

Why Mathematics?

Like statistics, many data science concepts build on fundamental mathematical concepts.

In order to understand cost functions, you need to know differential calculus. In order to understand hypothesis testing, you need to understand integration. And to give more one more example, linear algebra is essential to learning deep learning concepts, recommendation systems, and principal component analysis.

Week 7: Vectors and Spaces

Week 8: Dot Product and Matrix Transformations pt. 1

Week 9: Matrix Transformations pt. 2

Week 10: Eigenvalues and Eigenvectors

Week 11: Integrals

Week 12: Integrals Part 2!

Why SQL?

SQL is arguably the most important skill to learn across any type of data-related profession, whether you’re a data scientist, data engineer, data analyst, business analyst, the list goes on.

At its core, SQL is used to extract (or query) specific data from a database, so that you can do things like analyze the data, visualize the data, model the data, etc. Therefore, developing strong SQL skills will allow you to take your analyses, visualizations, and modeling to the next level because you will be able to extract and manipulate the data in advanced ways.

I came across Mode’s curriculum a while back and it is fantastic! So I would first get familiar with using SQL in Mode and then you’ll be able to go through the topics below!

Week 13: Basic SQL

Week 14: LOGICAL and COMPARISON Operators

Week 15: AGGREGATES

Week 16: DISTINCT, CASE WHEN

Week 17: JOINS and UNIONS

Week 18: Subqueries and Common Table Expressions

Week 19: String Manipulations

Week 20: Date-time manipulation

Week 21: Windows Functions

  • Windows Functions (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG, LEAD, SUM, COUNT, AVG)
  • See here for advanced window functions.

Python and Programming

Why Python?

I started with Python, and I’ll probably stick with Python for the rest of my life. It’s so far ahead in terms of open source contributions, and it’s straightforward to learn. Feel free to go with R if you want, but I have no opinions or advice to provide regarding R.

Week 22: Introduction to Python

Week 23: List, Tuples, Functions, Conditional Statements, Comparisons

Week 24: Dictionaries, Loops, Comments

Week 25: Try/Except, Reading & Writing files, Classes and Objects

Week 26: Recursion

Week 27: Binary Trees

Week 28: APIs and Anaconda

Pandas

Why Pandas?

Arguably the most important library to know in Python is Pandas, which is specifically meant for data manipulation and analysis.

Week 29: Getting and Knowing your data

Week 30: Filtering and Sorting

Week 31: Grouping

Week 32: Apply

Week 33: Merge

Visualizing Data

Why Data Visualizations?

The ability to visualize data and insights is so important because it’s the easiest way to communicate intricate information and a lot of information at once. As a data scientist, you’re always selling yourself and your ideas, whether your pitching a new project or convincing others why your model should be productionalized — data visualizations are a great tool to help you with that.

There are dozens of data visualization libraries out there, but I’m going to focus on two: Matplotlib and Plotly.

Week 34: Data Visualizations with Matplotlib

Week 35: Data Visualizations with Plotly

Data Exploration and Preparation

Why Data Exploration and Preparation?

“Garbage in, garbage out”

The models that you create can only be as good as the data that you feed into it. To understand what the state of your data is in, i.e. whether it’s “good” or not, you have to explore the data and prepare the data. Therefore, for the next four weeks, I’m going to provide several amazing resources for you to go through and get a better understanding of what data exploration and preparation entails.

Week 36: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) can be difficult because there’s no one set way of doing it — but that’s also what keeps it exciting. Generally, you want to…

  • Derive descriptive statistics (eg. central tendency)
  • Perform uni-variable analysis (distributions and spread)
  • Perform multi-variable analysis (scatterplots, correlation matrix, predictive power score, etc…)
  • Check for missing data and check for outliers

Check out a beginner’s guide to EDA here.

Week 37: Data Preparation: Feature Imputation and Normalization

Week 38: Feature Engineering and Feature Selection

Week 39: Imbalanced Datasets

Machine Learning

Why Machine Learning?

Everything that you’ve learned has led up to this point! Not only is machine learning interesting and exciting, but it is also a skill that all data scientists have. It’s true that modeling makes up a small portion of a data scientist’s time, but it doesn’t take away from its importance.

Later in your career, you might notice that I left out some machine learning algorithms, like K Nearest-Neighbors, Gradient Boost, and CatBoost. This is completely intentional — if you can understand the following machine learning concepts, you’ll have the skills to learn any other machine learning algorithms in the future.

Week 40: Introduction to Machine Learning

Week 41: Linear Regression

Week 42: Logistic Regression

Week 43: Regularization

Week 44: Decision Trees

Week 45: Naïve Bayes

Week 46: Support Vector Machines

Week 47: Clustering

Week 48: Principal Component Analysis

Week 49: Bootstrap Sampling, Bagging, and Boosting

Week 50: Random Forests and Other Boosted Trees

Week 51: Model Evaluation Metrics

Week 52: Data Science Project

If you feel comfortable with the materials above, you’re definitely ready to start your own data science project! Just in case, I’ve provided three ideas that you can use as inspiration to get started, but feel free to do whatever you like.

Idea 1: SQL Case Study

Link to the case.

The objective of this case is to determine the cause for a drop in user engagement for a social network called Yammer. Before diving into the data, you should read the overview of what Yammer does here. There are 4 tables that you should work with.

The link to the case above will provide you with much more detail pertaining to the problem, the data, and the questions that should be answered.

Check out how I approached this case study here if you’d like guidance.

Skills You’ll Develop

  • SQL
  • Data Analysis
  • Data Visualization if you choose to visualize your insights.

Idea 2: Trustpilot Webscraper

Learning how to webscrape data is simple to learn and extremely useful, especially when it comes to collecting data for personal projects. Scraping a customer review website, like Trustpilot, is valuable for a company as it allows them to understand review trends (getting better or worse) and see what customers are saying via NLP.

First I would get familiar with how Trustpilot is organized, and decide upon which kinds of businesses to analyze. Then I would take a look at this walkthrough of how to scrape Trustpilot reviews.

Skills You’ll Develop

  • Writing Python Scripts
  • Data Wrangling
  • BeautifulSoup/Selenium (webscraping libraries)
  • Data Analysis
  • Take it further and apply NLP to extract insights from reviews.

Idea 3: Titanic Machine Learning Competition

In my opinion, there’s no better way of showing that you’re ready for a data science job than to showcase your code through competitions. Kaggle hosts a variety of competitions that involves building a model to optimize a certain metric, one of them being the Titanic Machine Learning Competition.

If you want to get some inspiration and guidance, check out this step-by-step walkthrough of one of the solutions.

Skills You’ll Develop

  • Data Exploration and Cleaning with Pandas
  • Feature Engineering
  • Machine Learning Modelling

With Regards Subham (@webobite)