Data science
Data science is the application of statistics, computer science, and the scientific method to the practice of data analysis to convert data into information, with an emphasis on making accurate predictions.
Data science skills
Data science practices may overlap significantly with business intelligence, data engineering, and data analysis practices. However, the primary focus of data science is to apply machine learning and statistical methods to data.
Broadly speaking, a data scientist has at least foundational knowledge of science, statistics, data analysis, and programming. Depending on the individual, the skill set may be significantly weighted more toward one or two of these skills rather than evenly balanced among all three.
The basic activities of data science are to collect, clean, and transform data to create descriptive statistics and visualizations that help understand and communicate the data and its overall quality, to build statistical models to support statistical inference, hypothesis testing, and predictions/projections, and to use machine learning to automate decision making and predictions.
According to the O'Reilly 2021 Data/AI Salary Survey, the most popular programming languages for data science are Python (61% of surveyed data scientists), SQL (54%), and JavaScript (32%). The most popular machine learning packages are PyTorch (19% of surveyed data scientists), TensorFlow (20%), and scikit-learn (27%), all of which are Python libraries.
Data science process
The data science process starts with a question that can come in the form of a hypothesis to be tested, a decision to be made, or a prediction to be made. Data is then collected or, in some cases, created through experimentation. The collected data is then prepared through data wrangling. Next, a data model is prepare; this can be a numerical, statistical, or machine learning model that helps to analyze evidence to validate/invalidate a hypothesis, support a decision, or predict an outcome. The model is then evaluated for accuracy and, once validated, deployed and put into formal use.
Generally these steps are followed iteratively and non-sequentially, with each step being repeated as needed to fully develop the model before it goes into production. Even after an initial production deployment, models are usually still iterated upon, improved, and redeployed.
Deeper Knowledge on Data Science
Machine Learning (ML)
Machine learning terms, processes, and methods
List of Public and Open Datasets
A list of freely available datasets for use in analytics and machine learning
Python Open-Source Data Libraries
Python libraries commonly used in data science and analysis
Data Wrangling
Transforming "raw" data into a more easily analyzed form through normalization and format standardization
Data Teams
The make up and measures of effective data teams
Data Products
Ways of making data available
Broader Topics Related to Data Science
Business Intelligence
Methods to bridge the gap between data and business
Data Analysis
The transformation of data to information
Statistics
The analysis of numerical data
Data
Facts, statistics, and references to information
Computer Science
The study of algorithms, data structures, information, and computation