top of page
Search

What does a data scientist need to know?

  • Writer: tesfaygidey21
    tesfaygidey21
  • Apr 13, 2023
  • 4 min read

Updated: Jun 7, 2023

Due to the ongoing growth of data sources and the significance of big data in business decision-making, the Harvard Business Review [1] has dubbed the position of data scientist as "the hottest profession of the 21st century." An estimated 2.5 quintillion bytes of data are produced on the internet every day [2], but without the help of data scientists, all of this data may be meaningless.


One of the most difficult tasks for data scientists is working with high-dimensional datasets, as some of the information in the data may be redundant, irrelevant, or cause multicollinearity or model overfitting, making the data unsuitable for tasks involving exploratory data analysis (EDA) and predictive modeling. To put it more precisely, a data scientist is a knowledgeable specialist who uses data analysis to better understand and explain external events and assist businesses in making decisions. This is why data scientists create prediction models for theories and forecasting frequently.


One needs to be familiar with the fundamentals of data science in order to comprehend what skills a data scientist needs. In general, data science combines a variety of disciplines with specialized subject knowledge to uncover valuable insights hidden in a company's data and inform decision-making and strategy planning. These disciplines include probability and statistics, mathematics (algebra and calculus), advanced programming, data mining, artificial intelligence (AI), machine learning (ML), and deep learning (DL).


In summary, a data scientist might do the following tasks on a daily basis:

  • Understand the business domain

  • Find patterns and trends in datasets to uncover insights

  • Create algorithms and data models to forecast outcomes

  • Write programs that automate data processing and calculations.

  • Use machine learning techniques to improve the quality of data or product offerings

  • Explain how the results can be used to solve business problems

  • Communicate recommendations to decision-makers and stakeholders at every level of technical understanding.

  • Deploy data tools such as Python, R, SAS, or SQL in data analysis

  • Extract insights from big data using predictive analytics and artificial intelligence (AI), including machine learning models, natural language processing, and deep learning.

  • Collaborate with other data science team members, such as data and business analysts, IT architects, data engineers, and application developers.

  • Keep up to date in the data science field

The data science lifecycle, which comprises a range of roles, tools, and processes, can provide analysts with useful insights. A data science project often goes through the following phases:

1. Understanding business domain: It is about requirement engineering, identifying priorities, and cost-benefit analysis/budget of the project

2. Data collection: All forms of data, typically identified as structured and unstructured data, can be collected from various relevant sources using a variety of methods.

  • Methods of data collection can be but are not limited to manual entry, web scraping, and real-time streaming data from systems and devices.

  • Sources of Data: i) structured data: for instance, customer data, patient data, and ii) unstructured data: Examples: Web documents, texts, log files, video, audio, pictures, the Internet of Things (IoT), social media, etc.

3. Data storage and data preprocessing:

  • Data engineers are responsible to set standards for data storage depending on the type of data which is critical for the specification of the storage to facilitate workflows in predictive modeling.

  • The main components of data preprocessing techniques, which are essential for ensuring data quality before loading into a data warehouse, data lake, or another repository, are data cleaning, handling the high-dimensionality problem, transforming and combining the data using ETL (extract, transform, load) jobs or other data integration technologies.

4. Data analysis and modeling: data scientists would achieve both EDA and predictive analysis to address the business problem.

a) EDA is applied to:

  • Examine the dataset's distribution in relation to the response variable to identify or observe possible patterns and extract useful hidden features from the database.

  • Creating and testing plausible hypotheses

b) Predictive modeling is performed to:

  • Make inferences about the business domain on certain demand depending on the outcome variable such as customer satisfaction, therapy effect, system performance, etc.

5. Model of deployment and communication:

  • The final phase of the lifecycle of data science.

  • After a rigorous evaluation process, the model is finally prepared to be deployed in the desired format and preferred channel.

  • The model deployment stage entails developing the delivery mechanism required to get the mode out in the market among users or to another system.

Now, one can categorize data scientist skills into two categories: technical and non-technical skills. 1. Technical skills: - Statistics and probability: - Calculus and Algebra - Programming, packages, and software skills to conduct EDA and advanced statistical modeling. The tools that a DS needs to know for programming include R, Python, and Stata. And other common enterprise tools for statistical analysis include SAS, IBM SPSS, SPLUS - Data wrangling - Database management: Examples of popular DBMS are MySQL, SQL Server, Oracle, PostgreSQL, NoSQL (MongoDB, CouchDB, etc.), etc. - Data visualization: Popular tools are Tableau, Google Analytics, MS Excel, Plotly, SAS, R, SPLUS - Machine Learning and deep learning algorithms - Cloud computing: a) Given the size and availability of tools and platforms, understanding the concept of the cloud and cloud computing is a critical skill for a data scientist. b) Cloud computing gives a hand to data scientists using platforms such as AWS, Azure, and Google Cloud that provide access to databases, frameworks, programming languages, and operational tools. 2. Non-technical skills: - Communication - Analytical Mindset - Out-of-the-Box Thinking - Decision Making - Collaboration - Storytelling - Attention to detail - Continuous Learning

Visit the link below:


References

 
 
 

Recent Posts

See All
Foundations of Software Engineering

This course introduces the fundamental principles, methodologies, and practices essential to the field of software engineering. Students...

 
 
 
Simulation and Modeling

Course Description This course provides foundational knowledge in computer-supported system modeling and simulation approaches. Students...

 
 
 

Comments


bottom of page