Data Science is an interdisciplinary field which uses statistics, computer science, programming, and domain knowledge to collect, process, and analyze data for the purpose of acquiring knowledge or solving a problem. Data science also includes sharing acquired knowledge through storytelling, visualization, and other means of communication. Data science often employs methods such as machine learning, AI, natural language processing, algorithms, and other analytic tools to process and understand data.
Big data refers to datasets that are too large to process on a personal computer. Compared to traditional, smaller datasets that can be stored, analyzed, and easily managed on a personal computer, big data refers to datasets that are much larger, are created or added to more quickly, are more varied in their structures, and are stored on large, cloud-based storage systems.
Researchers working with big data use specialized software tools, supercomputers, and high performance computing clusters designed to handle the volume and complexity of the datasets. Creators of artificial intelligence often train their programs with big data, and researchers may use machine learning to better understand or describe large datasets.
Artificial Intelligence (AI) refers to actions that mimic human intelligence displayed by machines and to the field of study focused on this type of intelligence. AI consists of computer programs that are typically built to adaptively update and enhance their own performance over time. They are used to process, analyze, and recognize patterns in large datasets, and they use those patterns to get better at completing tasks or solving problems. AI programs are used for a variety of purposes, including recommending new television shows based on viewers’ preferences, guiding self-driving cars through cities, and learning how to defeat players in games like chess. Machine learning is a subset of AI.
Machine Learning involves sophisticated algorithms which can be trained to sort information, identify patterns, and make predictions within large sets of data. Machine learning algorithms are used by researchers to build models based on sample or training data in order to make predictions or decisions, without being explicitly programmed to do so. This approach can be “supervised” or “unsupervised” learning, which refers to the labeling or not labeling input data. Supervised machine learning including a level of human intervention and adjustment to the algorithm.
(From the Data Glossary, National Center for Data Services, National Library of Medicine)
The Data Science landscape at Berkeley is rich and complex! Here are some resources to navigate it...
From the Information School: The Data Science Life Cycle
From Computing, Data Science, and Society (CDSS): their home page, information about their Berkeley Institute of Data Science (BIDS), and their Data Science Discovery Program
From Berkeley Research, the Data Science page with selected Data Science programs on campus (and a link to a full listing)
A data scientist must know how to use code to create programs. They must have an advanced understanding ranging from basic coding to advanced analytical platforms. The many tools used include Apache Spark, C/C++, Java, Python, R and SQL. Each program has a specific use.
(From Bootcamp.Berkeley.Edu)
Data science differs from data analytics in that it uses computer science skills (e.g., Python programming) and focuses on large and complex data repositories, where “complex” may refer to the modality of the data (images, time series, text, as well as traditional tabular data) or other facets of the data in question (data can be complex because they are geographically distributed, "unclean" or "unstructured", characterized by extensive missing or inaccurate values).
Although occasionally used as an umbrella term for all aspects of data analysis (including data science), in a practical sense, traditional data analytics tends to focus on "simpler" and more straightforward data processes. For example, a data analytics team may work on the extraction of structured data from a database or repository, cleaning it, analyzing it with excel, and visualizing it in reports using Tableau, Power BI, Google Data Studio or other reporting tools.
(From Rice University)
In 2021 three UC Berkeley Library/Research IT researchers explored the big data landscape at UC Berkeley. "Based on interviews with big data researchers at UC Berkeley as part of an Ithaka S+R project, [their] local report provides insights on researcher practices and challenges in six thematic areas: data collection & processing; analysis: methods, tools, infrastructure; research outputs; collaboration; training; and balancing domain vs data science expertise."
Local UC Berkeley report, "Supporting Big Data Research at the University of California, Berkeley"
Local report Executive Summary (blog post)
National Report (compilation of local reports from 22 institutions), "Big Data Infrastructure at the Crossroads: Support Needs and Challenges for Universities"