Skip to Main Content

Research Methods--Quantitative, Qualitative, and More: Data Science Methods (Machine Learning, AI, Big Data)

About Data Science Methods, and Big Data

Data Science is an interdisciplinary field which uses statistics, computer science, programming, and domain knowledge to collect, process, and analyze data for the purpose of acquiring knowledge or solving a problem. Data science also includes sharing acquired knowledge through storytelling, visualization, and other means of communication. Data science often employs methods such as machine learningAInatural language processingalgorithms, and other analytic tools to process and understand data.

Big data refers to datasets that are too large to process on a personal computer. Compared to traditional, smaller datasets that can be stored, analyzed, and easily managed on a personal computer, big data refers to datasets that are much larger, are created or added to more quickly, are more varied in their structures, and are stored on large, cloud-based storage systems.

Researchers working with big data use specialized software tools, supercomputers, and high performance computing clusters designed to handle the volume and complexity of the datasets. Creators of artificial intelligence often train their programs with big data, and researchers may use machine learning to better understand or describe large datasets.

Artificial Intelligence (AI) refers to actions that mimic human intelligence displayed by machines and to the field of study focused on this type of intelligence. AI consists of computer programs that are typically built to adaptively update and enhance their own performance over time. They are used to process, analyze, and recognize patterns in large datasets, and they use those patterns to get better at completing tasks or solving problems. AI programs are used for a variety of purposes, including recommending new television shows based on viewers’ preferences, guiding self-driving cars through cities, and learning how to defeat players in games like chess. Machine learning is a subset of AI.

Machine Learning involves sophisticated algorithms which can be trained to sort information, identify patterns, and make predictions within large sets of data. Machine learning algorithms are used by researchers to build models based on sample or training data in order to make predictions or decisions, without being explicitly programmed to do so. This approach can be “supervised” or “unsupervised” learning, which refers to the labeling or not labeling input data. Supervised machine learning including a level of human intervention and adjustment to the algorithm. 

(From the Data Glossary, National Center for Data Services, National Library of Medicine)

Methods Texts

Data Science at Berkeley

The Data Science landscape at Berkeley is rich and complex!  Here are some resources to navigate it...

From the Information School: The Data Science Life Cycle

From Computing, Data Science, and Society (CDSS): their home page, information about their Berkeley Institute of Data Science (BIDS), and their Data Science Discovery Program

From Berkeley Research, the Data Science page with selected Data Science programs on campus (and a link to a full listing)

UC Berkeley Open Online Data Science Courses

The foundational undergraduate Data Science course at Berkeley is the legendary Data 8, which is available online:

Data 8

With Data 8 under your belt, check out Data 100, "Principles and Techniques of Data Science", which also has an open website:

Data 100

 

Data Science Tools

A data scientist must know how to use code to create programs. They must have an advanced understanding ranging from basic coding to advanced analytical platforms. The many tools used include Apache Spark, C/C++, Java, Python, R and SQL. Each program has a specific use. 

  • Apache Spark is preferred for analyzing data over other types of programs for its ability to store computations into memory. The platform more quickly runs complicated algorithms, which is necessary when dealing with large data sets. By caching memory, scientists are less likely to lose valuable information.
  • Hadoop is often used when data volumes exceed available memory. The platform is able to send data to different servers. Hadoop is also ideal for data exploration, filtration, sampling and summarization.
  • Python is becoming a more and more popular programming language. The platform is useful for a variety of processes needed by data scientists. The language’s versatility enables users to accomplish many different tasks that might include creating data sets or importing SQL tables.
  • SQL is often required knowledge for data scientists to accomplish various functions that include adding, deleting or extracting information from databases. SQL also has the capability to perform analytical functions. By using the platform’s precise commands, users are able to perform inquiries more quickly.

(From Bootcamp.Berkeley.Edu)

Data Science vs. Data Analytics

Data science differs from data analytics in that it uses computer science skills (e.g., Python programming) and focuses on large and complex data repositories, where “complex” may refer to the modality of the data (images, time series, text, as well as traditional tabular data) or other facets of the data in question (data can be complex because they are geographically distributed, "unclean" or "unstructured", characterized by extensive missing or inaccurate values).

Although occasionally used as an umbrella term for all aspects of data analysis (including data science), in a practical sense, traditional data analytics tends to focus on "simpler" and more straightforward data processes. For example, a data analytics team may work on the extraction of structured data from a database or repository, cleaning it, analyzing it with excel, and visualizing it in reports using Tableau, Power BI, Google Data Studio or other reporting tools.

(From Rice University)

Research About Big Data at Berkeley

In 2021 three UC Berkeley Library/Research IT researchers explored the big data landscape at UC Berkeley. "Based on interviews with big data researchers at UC Berkeley as part of an Ithaka S+R project, [their] local report provides insights on researcher practices and challenges in six thematic areas: data collection & processing; analysis: methods, tools, infrastructure; research outputs; collaboration; training; and balancing domain vs data science expertise."

Local UC Berkeley report, "Supporting Big Data Research at the University of California, Berkeley"

Local report Executive Summary (blog post)

National Report (compilation of local reports from 22 institutions), "Big Data Infrastructure at the Crossroads: Support Needs and Challenges for Universities"