In this project, we utilize the open-source
programming language, alongside Python, to evaluate the consistency of
large language models (LLMs) through intra- and inter-reliability
methods. For
,
we are using version 4.4.2 (2024-10-31 ucrt), and we manage
package dependencies with renv
. The renv
package allows for isolated project environments, ensuring the same
versions of
packages are used consistently. If you’re new to renv
, it
enables the creation of “snapshots” for package versions, ensuring
reproducibility across different systems. You can find all package
versions used in this project in our renv
repository at this
link.
This document is structured using RMarkdown, which seamlessly integrates both and Python code chunks. Readers can tell the difference between and Python code chunks by downloading the raw RMarkdown file from this repository. In addition, the document features a floating table of contents (TOC) on the side, which makes navigation through the sections more convenient. The TOC follows the reader as they scroll, allowing quick access to different sections.
We’ve also implemented foldable code chunks, enabling the user to expand and collapse code as needed to focus on the explanations or results. This feature improves the document’s readability by allowing you to hide the code while reading the main content. If you would like to view or modify the raw code, click the “Code” button at the top of the document to download the full RMarkdown file. This gives you access to all the code chunks used in the analysis.
For the Python sections, we are using a conda environment with
Python version 3.12.9. The required Python packages can be
installed via the requirements.txt
file, available here.
The RMarkdown document links to the conda environment, allowing Python
chunks to execute alongside the
code in the same workflow.
To securely store API secrets for the large language models (LLMs)
accessed in the Python chunks, we use a .env
file that
holds these keys in a safe, environment-specific manner. Similarly, for
the stock news API accessed in the
sections, we save the API secret in a project-based
.Renviron
file. Both files keep sensitive credentials
secure and allow us to avoid hardcoding API secrets directly in the
code. For obvious security reasons, we did not push these
.env
and .Renviron
files to our GitHub
repository.
All input and output files for this project can be accessed and downloaded from our GitHub repository.
The minimum sample size for each of our experiments was computed for: simple percent agreement, Gwet’s AC1 coefficient, and Brennan-Prediger coefficient. The minimum sample size was computed using the tables in Handbook of Inter-Rater Reliability, 5th Edition. Volume 1: Analysis of Categorical Ratings. The sample sizes were computed for the three different metrics, with a margin of error of 0.05, a confidence level of 0.90, and for five replicates.
# create a data frame of the obtained sample size results
binary_data = data.frame(
metric = c("Percent Agreement", "Gwet’s AC1 Coefficient", "Brennan-Prediger Coefficient"),
`sample size` = c(216, 1317, 847) |> scales::comma()
)
# generate the table in HTML format
knitr::kable(
binary_data, format = "html", table.attr = "style='width:50%;'",
align = c('l','r')
) |>
kableExtra::kable_styling(full_width = FALSE) |>
kableExtra::column_spec(1, bold = TRUE)
metric | sample.size |
---|---|
Percent Agreement | 216 |
Gwet’s AC1 Coefficient | 1,317 |
Brennan-Prediger Coefficient | 847 |
Therefore, using the highest sample size among the three metrics, we need at least 1,317 samples for the binary classification experiment.
In our experiments, we selected the following LLMs:
Closed source proprietary foundation models:
claude-3-7-sonnet-20250219
, which is Anthropic’s
current best model as of March 12, 2025;
-claude-3-5-haiku-20241022
, which represents the smallest
model provided by Anthropic
as of March 12, 2025;
gpt-4o-2024-11-20
, which is OpenAI’s latest
snapshot of the 4o series as of March 12, 2025. Note that we do not
examine the more recent and capable GPT-4.5
Preview Model as they are very expensive ($150 per million output
tokens versus the $10 per million output tokens for the GPT-4o
series);
gpt-4o-mini-2024-07-18
, which is OpenAI’s
latest stable and mini model as of March 12, 2025. These models are
fast and are cost optimized for low-latency applications;
Open LLMs
The Open LLMs were run locally using the Ollama interface (v 0.6.0),
with the exception of the command-r-plus-08-2024
which we
run using the Cohere API as its 104B parameters would be too large to
run on our GPU (NVIDIA RTX 5000 Ada Generation, with 32GB of Graphics
Memory). The models are:
command-r-plus-08-2024
, which is Cohere’s latest large
language model as of March 12, 2025;
command-r7b
, which is Cohere’s latest small
language model as of March 12, 2025. Note that we are
running this 7B parameter model locally, via the Ollama
interface.;
deepseek-r1:7B
, which is a popular model from DeepSeek and is the
default when pulling the deepseek-r1
models from
Ollama;
deepseek-r1:1.5B
, which is the smallest model from
DeepSeek;
gemma3:27B
, which is the largest open-sourced Gemma3
model from Google, which was released on March 12, 2025;
gemma3:1B
, which is the smallest open-sourced Gemma3
model from Google, which was released on March 12, 2025;
llama3.2:3B
, Meta went small with its open-sourced
Llama3.2 model; this is the largest of the two models that they released
as part of the Llama3.2 series;
llama3.2:1B
, Meta went small with its open-sourced
Llama3.2 model; this is the smallest of the two models that they
released as part of the Llama3.2 series;
phi4:latest
, Phi-4 is a 14B parameter,
state-of-the-art open model from Microsoft.;
phi4-mini
, which is the smallest open-sourced Phi
Model from Microsoft with 3.8B parameters;
# load the necessary libraries
from dotenv import load_dotenv, find_dotenv
from langchain.prompts.chat import ChatPromptTemplate
# Find the .env path
dotenv_path = find_dotenv()
# load the environment variables
load_dotenv(dotenv_path, override=True)
# the API keys for the different LLMs
openai_api_key = os.getenv('OPENAI_API_KEY')
anthropic_api_key = os.getenv('ANTHROPIC_API_KEY')
cohere_api_key = os.getenv('COHERE_API_KEY')
# the LLM models to be used for labeling
# we ran this over two different run times since we
# had to abort the "exaone-deep" models as they were quite slow
# other commented out models were successfully run in the first run
# so we did not need to run them again
models = [
# 'claude-3-7-sonnet-20250219',
# 'claude-3-5-haiku-20241022',
'gpt-4o-2024-11-20',
'gpt-4o-mini-2024-07-18',
# 'command-r-plus-08-2024',
# 'command-r7b',
# 'deepseek-r1:7B',
# 'deepseek-r1:1.5B',
# 'exaone-deep:7.8b',
# 'exaone-deep:2.4b',
'gemma3:27B',
'gemma3:1B',
'llama3.2:3B',
'llama3.2:1B',
'phi4:latest',
'phi4-mini'
]
We created a custom function,
generalized_chat_completion
, to facilitate the interaction
with the LLMs and generate chat completions for each news article in the
dataset. The function takes the following parameters:
csv_path
: the path to the CSV file containing the news
articles;columns_to_keep
: the columns to retain in the final CSV
output;models
: a list of chat models to be used;chat_prompt_template
: the prompt template for
generating chat messages;columns_for_chat_prompt
: the columns to be used as the
user input in the chat prompt template;num_replicates
: the number of replicates per
model;temp
: the temperature for the chat model;max_num_tokens
: the maximum number of tokens for chat
completions;save_to_csv
: whether to save results to CSV;output_file
: the path to the output CSV file; andretry_attempts
: the number of retry attempts for API
errors.The function reads the CSV file, replicates the data based on the
number of models and replicates, sorts the data frame, and iterates
through each row to generate chat completions. It uses the specified
chat model to generate chat responses, with error handling for API
errors. The function saves the results to a CSV file if required and
returns the last chat response content if save_to_csv
is
False
.
import os
import time
import pandas as pd
import datetime as dt
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_cohere import ChatCohere
from langchain_ollama import ChatOllama
def generalized_chat_completion(
csv_path,
columns_to_keep,
models,
chat_prompt_template,
columns_for_chat_prompt,
num_replicates,
temp=0,
max_num_tokens=3000,
save_to_csv=True,
output_file='../results/generalized_classification.csv',
retry_attempts=3
):
"""
Generalized function to generate chat completions from a CSV file, with flexible parameters
for various chat models, sorting order, and error handling.
Parameters:
csv_path (str): Path to the CSV file to read data from.
columns_to_keep (list): Columns to retain in the final CSV output.
models (list): List of chat models to be used.
chat_prompt_template (str): The prompt template for generating chat messages.
columns_for_chat_prompt (list): Columns to be used as the user input in the chat prompt template.
num_replicates (int): Number of replicates per model.
temp (float): Temperature for the chat model.
max_num_tokens (int): Maximum number of tokens for chat completions.
save_to_csv (bool): Whether to save results to CSV.
output_file (str): Path to the output CSV file.
retry_attempts (int): Number of retry attempts for API errors.
Returns:
None or the last chat response content if save_to_csv is False.
"""
# read the CSV file
df = pd.read_csv(csv_path)
num_rows = df.shape[0]
total_repeats = len(models) * num_replicates
# add an index column as 'article_num'
df['article_num'] = df.index
# replicate the data frame based on the total repeats
expanded_df = pd.concat([df] * total_repeats, ignore_index=True)
# generate model and replicate columns
model_column = [model for model in models for _ in range(num_replicates * num_rows)]
replicate_column = [i + 1 for _ in range(len(models)) for i in range(num_replicates) for _ in range(num_rows)]
# add the model and replicate columns to the data frame
expanded_df['replicate'] = replicate_column
expanded_df['chat_model'] = model_column
# sort the data frame by 'chat_model', 'article_num' and 'replicate'
expanded_df = expanded_df.sort_values(by=['chat_model','article_num', 'replicate']).reset_index(drop=True)
# iterate through each row and generate chat completions
for index in range(expanded_df.shape[0]):
prompt_data = {col: expanded_df.loc[index, col] for col in columns_for_chat_prompt}
messages = chat_prompt_template.format_messages(**prompt_data)
# extract model name and assign the correct chat model
model = expanded_df.loc[index, 'chat_model']
chat_model = None
if model == 'gpt-4o-2024-11-20':
chat_model = ChatOpenAI(model="gpt-4o-2024-11-20", temperature=temp, max_tokens=max_num_tokens)
elif model == 'gpt-4o-mini-2024-07-18':
chat_model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=temp, max_tokens=max_num_tokens)
elif model == "claude-3-7-sonnet-20250219":
chat_model = ChatAnthropic(model="claude-3-7-sonnet-20250219", temperature=temp, max_tokens=max_num_tokens)
elif model == "claude-3-5-haiku-20241022":
chat_model = ChatAnthropic(model="claude-3-5-haiku-20241022", temperature=temp, max_tokens=max_num_tokens)
elif model == "command-r-plus-08-2024":
chat_model = ChatCohere(model="command-r-plus-08-2024", temperature=temp, max_tokens=max_num_tokens)
elif model == "command-r7b":
chat_model = ChatOllama(model="command-r7b", temperature=temp, max_tokens=max_num_tokens)
elif model == 'deepseek-r1:7B':
chat_model = ChatOllama(model="deepseek-r1:7B", temperature=temp, max_tokens=max_num_tokens)
elif model == 'deepseek-r1:1.5B':
chat_model = ChatOllama(model="deepseek-r1:1.5B", temperature=temp, max_tokens=max_num_tokens)
elif model == 'exaone-deep:7.8b':
chat_model = ChatOllama(model="exaone-deep:7.8b", temperature=temp, max_tokens=max_num_tokens)
elif model == 'exaone-deep:2.4b':
chat_model = ChatOllama(model="exaone-deep:2.4b", temperature=temp, max_tokens=max_num_tokens)
elif model == 'gemma3:27B':
chat_model = ChatOllama(model="gemma3:27B", temperature=temp, max_tokens=max_num_tokens)
elif model == 'gemma3:1B':
chat_model = ChatOllama(model="gemma3:1B", temperature=temp, max_tokens=max_num_tokens)
elif model == 'llama3.2:3B':
chat_model = ChatOllama(model="llama3.2:3B", temperature=temp, max_tokens=max_num_tokens)
elif model == 'llama3.2:1B':
chat_model = ChatOllama(model="llama3.2:1B", temperature=temp, max_tokens=max_num_tokens)
elif model == 'phi4:latest':
chat_model = ChatOllama(model="phi4:latest", temperature=temp, max_tokens=max_num_tokens)
elif model == 'phi4-mini':
chat_model = ChatOllama(model="phi4-mini", temperature=temp, max_tokens=max_num_tokens)
else:
print(f"Model {model} is not supported. Skipping this row.")
continue
# attempt to generate chat response with retries for error handling
chat_response_content = "--"
chat_response_id = None
for attempt in range(retry_attempts):
try:
chat_response = chat_model.invoke(messages)
chat_response_content = chat_response.content
chat_response_id = chat_response.id
break
except Exception as e:
error_message = str(e)
if attempt == retry_attempts - 1:
print(f"Failed after {retry_attempts} attempts for model {model} on index {index}. Error: {error_message}")
else:
print(f"Attempt {attempt + 1} failed for model {model} on index {index}. Retrying in 30 seconds...")
time.sleep(30)
# create the row with the desired columns
data_row = pd.DataFrame({
**{col: [expanded_df.loc[index, col]] for col in columns_to_keep},
'chat_model': [model],
'chat_date': [dt.datetime.now().strftime('%Y-%m-%d %H:%M:%S')],
'chat_replicate': [expanded_df.loc[index, 'replicate']],
'chat_response': [chat_response_content],
'chat_response_id': [chat_response_id]
})
# save to CSV if required
if not save_to_csv:
continue
elif not os.path.exists(output_file):
data_row.to_csv(output_file, index=False)
else:
existing_df = pd.read_csv(output_file)
updated_df = pd.concat([existing_df, data_row], ignore_index=True)
updated_df.to_csv(output_file, index=False)
if not save_to_csv:
return chat_response_content
We created a custom function, extract_classification
, to
extract the classification labels from the chat completions generated by
the LLMs. The function takes the chat response content and a list of
valid words as input. It extracts the classification label from the chat
response content based on the valid words provided. The function returns
the extracted classification label if it matches any of the valid words;
otherwise, it returns NA
.
extract_classification <- function(text, valid_words) {
# ----------------------------
# Helper: drop rows with unwanted words
# ----------------------------
drop_row = function(row) {
all(is.na(row) | row == "classification.\n\ntemp")
}
# ----------------------------
# Helper: Pre-clean text1
# ----------------------------
clean_text1 <- function(text) {
patterns <- c(
"here ", "",
"classification;", "",
"definitive classification,", "",
"The classification is based on", "",
"classification based on", "",
"classification based ", "",
"classification requires", "it requires",
"classification without", "",
"it is impossible", "classification: Impossible",
"challenging classification", "",
"the classification challenging", "it challenging",
"\n---\n", " ",
"\\*", "",
'\\"', "",
"cannot definitively classify this news as either Positive or Negative", "",
"either Positive or Negative", "",
"as:", "as",
"\n</analysis>: \n<classification>:", " classification:",
"making it challenging to", "",
"classification.</classification>", "",
"\n\nClassification:\n-", "classification:",
"classification.\n\n<classification>:", "classification:",
"article", "news",
"</analysis>", "<analysis>",
"classification\\. ", "",
"classification because|classification for", "",
"positively", "positive",
"negatively", "negative",
"leaning slightly towards", "classification:",
"the impact is likely to be", "classification:"
)
for (i in seq(1, length(patterns), by = 2)) {
text <- gsub(patterns[i], patterns[i + 1], text)
}
return(text)
}
# ----------------------------
# Helper: Normalize sentiment pattern1 to standard format
# ----------------------------
normalize_sentiment_phrases1 <- function(text) {
sentiment_patterns <- c(
"\\n\\s*<analysis>:",
"<classification>:(.*?)is likely to be",
"likely be",
"is likely to be",
"is expected to have a",
"Classification:(.*?)is expected to",
"classification leans towards",
"is likely",
"is likely to have a",
"leading to a",
"likely leads to a",
"suggest a potential",
"Overall, the immediate"
)
sentiment_types <- "(positive|negative|neutral|mixed)"
for (pattern in sentiment_patterns) {
full_pattern <- regex(paste0(pattern, "\\s*", sentiment_types, "\\s*\\n?"), ignore_case = TRUE)
text <- str_replace_all(
text,
full_pattern,
function(m) {
sentiment <- str_extract(m, "(?i)positive|negative|neutral|mixed")
paste0("<classification>: ", str_to_title(sentiment), " ")
}
)
}
return(text)
}
# ----------------------------
# Pre-clean text2
# ----------------------------
clean_text2 <- function(text) {
patterns <- c(
"\n{1,2}(<classification>)", "\\1",
"classification:\n\n<analysis>", "",
"the classification becomes challenging", "it becomes challenging",
"strict classification guidelines\\.", "",
"classification guidelines:|classification guidelines", "classification:",
"\n\nClassification:\nThe news has a", " classification:",
"classification due to", "",
"categorize this news as having either", "to",
"categorize this news as|classify this news as|classify this as|the classification should be|still classified it as|I would categorize this as|classified as|is categorized as|making it|making it a", "classification:",
"\n</classification>", "",
"I classify the impact of this news", "I would classify the impact of this news",
"I would classify the news", "I would classify the impact of this news"
)
for (i in seq(1, length(patterns), by = 2)) {
text <- gsub(patterns[i], patterns[i + 1], text)
}
return(text)
}
# ----------------------------
# Helper: Match a sentence that starts with the classification phrase and ends with a label
# ----------------------------
detect_sentiment_from_classification_sentence <- function(text) {
match <- str_match(
text,
regex(
"I would classify the impact.*?as\\s+[\"']?(positive|negative|mixed)[\"']?[.?!]?",
ignore_case = TRUE
)
)
if (!is.na(match[2])) {
output <- paste0("<classification>: ", str_to_title(match[2]), " ")
return(output) # Capitalize
} else {
return(text)
}
}
# Helper: Normalize sentiment pattern2 to standard format
# ----------------------------
normalize_sentiment_phrases2 <- function(text) {
sentiment_patterns <- c(
"I would classify this news news as",
"the classification is",
"the tone of the news is",
"the news presents a"
)
sentiment_types <- "(positive|negative|neutral|mixed)"
for (pattern in sentiment_patterns) {
full_pattern <- regex(paste0(pattern, "\\s*", sentiment_types, "\\s*\\n?"), ignore_case = TRUE)
text <- str_replace_all(
text,
full_pattern,
function(m) {
sentiment <- str_extract(m, "(?i)positive|negative|neutral|mixed")
paste0("<classification>: ", str_to_title(sentiment), " ")
}
)
}
return(text)
}
# ----------------------------
# Helper: Detect "It is likely that..." sentences
# ----------------------------
detect_likely_sentiment <- function(text){
# Extract all sentences that start with "It is likely that"
matches <- str_match_all(
text,
regex("it is likely that[^.?!]*\\b(positive|negative)\\b[^.?!]*[.?!]", ignore_case = TRUE)
)[[1]]
if (nrow(matches) > 0) {
output <- paste0("<classification>: ", str_to_title(matches[1, 2]), " ") # "Positive" or "Negative"
return(output)
} else {
return(text)
}
}
# ----------------------------
# Step 1: Clean and Normalize
# ----------------------------
text <- clean_text1(text)
text <- normalize_sentiment_phrases1(text)
text <- gsub("</classification>:", "<classification>:", text)
text_temp <- str_extract(text, "(?i)<classification>:\\s*(positive|negative|neutral|mixed)\\s*$")
if (!is.na(text_temp)) text <- text_temp
text <- clean_text2(text)
text <- detect_sentiment_from_classification_sentence(text)
text <- detect_likely_sentiment(text)
text <- normalize_sentiment_phrases2(text)
text = stringr::str_replace(text, "Mixed to ", "")
text = stringr::str_replace_all(text, c(
"Slightly" = "", # Remove the word "slightly"
"clearly" = "", # Remove the word "clearly"
'\\"' = "", # Remove escaped quotes (\\")
'"' = "" # Remove regular quotes (")
))
if (str_detect(text, "<classification>: Positive")) text = "<classification>: Positive"
if (str_detect(text, "<classification>: Negative")) text = "<classification>: Negative"
if (str_detect(text, "<classification>: Neutral")) text = "<classification>: Neutral"
# ----------------------------
# Step 2: Extract final classification
# ----------------------------
# define a regex pattern to match:
# 1. "\n\nclassification" with or without special characters or colons after it
# 2. "\n\n<classification>:" or "\n\n<classification>word</classification>"
matches = stringr::str_match_all(text, "(?i)(classification[^a-zA-Z]*:?\\s*\\w+|<classification>(\\w+)</classification>)")
# get the last match if it exists
if (length(matches[[1]]) > 0) {
if (nrow(matches[[1]]) > 1) {
# Apply the function to filter out unwanted rows
matches[[1]] = matches[[1]][!apply(matches[[1]], 1, drop_row), , drop = FALSE]
}
first_match = head(matches[[1]], 1)
last_match = tail(matches[[1]], 1)
word = NA
word1 = NA
word2 = NA
# Handle the case for "<classification>word</classification>"
if (!is.na(first_match[, 3])) {
word = first_match[, 3] # Extract word between <classification> and </classification>
}else if (!is.na(last_match[, 3])) {
word = last_match[, 3]
}else {
word1 = stringr::str_extract(first_match[, 1], "\\w+$") # Otherwise, extract word after "classification"
word2 = stringr::str_extract(last_match[, 1], "\\w+$")
}
# Check if the extracted word is in the valid_words list
if (tolower(word) %in% tolower(valid_words)) {
return(word)
} else if (tolower(word1) %in% tolower(valid_words)){
return(word1)
} else if (tolower(word2) %in% tolower(valid_words)){
return(word2)
} else {
return(NA) # Return NA if no match is found
}
} else {
return(NA) # Return NA if no match is found
}
}
# Function to get summary statistics for mean percent agreement
pa_summary = function(df, percent_variable, digits=4){
llm_models = unique(df$chat_model)
df_summary = data.frame(model = llm_models, mPa = rep(NA, length(llm_models)))
for (i in 1:length(llm_models)){
m1 = df |>
dplyr::filter(chat_model == llm_models[i]) |>
dplyr::select(percent_variable)
colnames(m1) <- "v1"
df_summary[i,2] = paste0(
round(mean(m1$v1/100), digits), " (",
round(sd(m1$v1/100), digits), ")"
)
}
return(df_summary)
}
# function to compute the other reliability metrics
# should contain one variable called chat_model,
# and vars indicates variables for raters
reliability_coefs = function(df, vars){
no_models <- n_distinct(df$chat_model)
df_coefs = data.frame(matrix(NA, ncol=9, nrow=5*no_models))
colnames(df_coefs) = c("model", "coeff.name", "pa", "pe", "coeff.val", "coeff.se", "conf.int", "p.value", "w.name")
llm_models = unique(df$chat_model)
df_coefs$model = rep(llm_models, each=5)
if (is.integer(vars)) vars = colnames(df)[vars]
df_rates = df |>
dplyr::select(dplyr::all_of(c("chat_model", vars)))
for (i in seq(1, 5*no_models, by=5)){
m1 = df_rates |>
dplyr::filter(chat_model == llm_models[ceiling(i/5)]) |>
dplyr::select(-chat_model)
df_coefs[i,-1] = irrCAC::conger.kappa.raw(m1)$est
df_coefs[(i+1),-1] = irrCAC::fleiss.kappa.raw(m1)$est
df_coefs[(i+2),-1] = irrCAC::gwet.ac1.raw(m1)$est
df_coefs[(i+3),-1] = irrCAC::bp.coeff.raw(m1)$est
df_coefs[(i+4),-1] = irrCAC::krippen.alpha.raw(m1)$est
}
return(df_coefs)
}
In the chunk below, we present two functions:
calculate_agreement
: This function calculates the
agreement between the replicates and the benchmark. It takes the
replicates and the benchmark as input and returns the agreement
percentage. The function also has an optional parameter na
to handle missing values. If na = 0
, the function will not
consider missing values in the calculation.
ensemble_reps
: This function ensembles the
replicates by selecting the most common value. It takes the replicates
as input and returns the most common value. The function also has an
optional parameter na
to handle missing values. If
na = 0
, the function will not consider missing values in
the calculation. In case of ties, the function randomly selects one of
the most common values.
calculate_agreement = function(reps, ground_truth, na = 0) {
na_rm = dplyr::if_else(na == 0, T, F, missing = F)
matches = sum(reps == ground_truth, na.rm = na_rm)
total_reps = length(reps)
agreement = matches / total_reps
return(agreement)
}
# ensemble function to ensemble the replicates
ensemble_reps = function(reps, na = 0){
# handling of NAs
na_rm = dplyr::if_else(na == 0, "no", "ifany", missing = "ifany")
# get the most common value
freq_table = table(reps, useNA = na_rm)
max_freq = max(freq_table)
most_common_values = names(freq_table)[freq_table == max_freq]
# randomly select one of the most common values
most_common = sample(most_common_values, 1)
return(most_common)
}
We use the stocknewsapi to
extract articles related to the stock market. We crafted our request to
get all tickers from January 05, 2025 to March 13, 2025. The data was
pulled on March 14, 2025 at approximately 12:30 EDT. The resulting data
frame was saved as a RDS file. To run the code below, you need to set
the stock_token
environment variable to your API token. In
our case, we saved the token as an environment variable using the
usethis::edit_r_environ(scope = 'project')
function.
# pull the stock_api_token from the environment variable
stock_api_token = Sys.getenv("stock_token")
# crafting the request for the API
request = paste0(
"https://stocknewsapi.com/api/v1/category?", # The base URL for the API
"section=alltickers", # We are interested in all tickers
"&sentiment=positive,negative", # We want non neutral sentiments
"&type=article", # We are interested in articles
"&items=100", # We want 100 items per page
"&date=01052025-03132025", # The date range
"&exchange=NYSE,NASDAQ", # We are interested in the NYSE and NASDAQ
"&country=USA", # We are interested in the USA
"&token=", stock_api_token, # The API token (ours is saved as an ENV Variable)
"&page=") # the pages that we will iterate over
# crafting all requests
all_requests = paste0(request, 1:100)
# pulling the data from the API and cleaning the data prior to saving it
stock_news_df = purrr::map_df(all_requests, ~jsonlite::fromJSON(.x)$data) |>
# convert the topics and tickers cols into chr (they are lists of strings)
dplyr::mutate(tickers = purrr::map_chr(tickers, ~ paste(.x, collapse = ", "))) |>
dplyr::mutate(topics = purrr::map_chr(topics, ~ paste(.x, collapse = ", "))) |>
# convert the date column from character to datetime
dplyr::mutate(date = lubridate::dmy_hms(date, tz = "America/New_York"))
# writing the data frame as RDS and CSV files
readr::write_rds(stock_news_df, "../data/stock_news_data.rds")
readr::write_csv(stock_news_df, "../data/stock_news_data.csv")
The CSV file containing the stock news data can be accessed here.
The stock_news_df
data frame contains 10,000 rows and 10
columns. The names of the columns are: news_url, image_url, title, text,
source_name, date, topics, sentiment, type, and tickers. Furthermore,
the sentiment of the articles is stored in the sentiment
column. The sentiment of those articles is divided into two categories:
positive
, and negative
. Their distribution is
as follows:
Negative | Positive |
---|---|
2912 | 7088 |
In this subsection, we filtered out rows with multiple
tickers
in the tickers column, keeping only rows with a
single ticker (to ensure data quality). Additionally, we ensured the
uniqueness of tickers during down sampling by selecting distinct
tickers.
set.seed(2025) # set the seed for reproducibility
binary_data = stock_news_df |>
dplyr::filter(!stringr::str_detect(tickers, ",")) |> # keep rows without commas (i.e., single tickers)
dplyr::group_by(sentiment) |> # group by sentiment
dplyr::distinct(tickers, .keep_all = TRUE) |> # keep unique tickers
# randomly select <=675 samples per sentiment class (will select all if n < 675)
dplyr::slice_sample(n = 675) |>
dplyr::ungroup() # ungroup the data frame
# save the downsampled dataset as RDS and CSV files
readr::write_rds(binary_data, "../data/binary_classification_data.rds")
readr::write_csv(binary_data, "../data/binary_classification_data.csv")
The resulting binary classification dataset
contains 1,350 samples. The distribution of the sentiment classes is as follows:
Negative | Positive |
---|---|
675 | 675 |
For our labeling task, we defined a system prompt that combines Chain of Thought learning with few-shot learning. The system prompt provides instructions to the LLMs on how to categorize the impact of a news article on a stock’s next-day return as either “Positive” or “Negative”. The system prompt also includes simulated examples (involving Bitcoin which is not in our stock news dataset) to guide the LLMs in their classification task. Then, we defined a user prompt that presents the news article’s title, full text, and the stock ticker symbol to the LLMs (these will be obtained from the CSV file). Both prompts are combined into a chat prompt template that will be used to interact with the LLMs.
# define the system prompt for the categorization task
bin_system_prompt = """
Task: You will be provided with a news article's title, full text, and a stock ticker symbol. Categorize the impact of the news article on a stock's next-day return as either "Positive" or "Negative".
Instructions:
1. Read the title and full text of the article.
2. Analyze how the information might affect the company associated with the given ticker.
3. Identify key factors such as financial performance, market trends, announcements, and industry developments that may influence investor sentiment.
4. Assess the overall tone and its potential impact on the stock price.
Classification Guidelines:
- "Positive": News likely to increase the stock price.
- "Negative": News likely to decrease the stock price.
- Focus on the immediate impact (next day return).
- Weigh the importance of positive vs. negative factors if the article is mixed.
Output Format:
- <analysis>: [Detailed analysis of the article’s impact]
- <classification>: [Final classification: "Positive" or "Negative"]
Note: Your classification must strictly be "Positive" or "Negative" based on the immediate expected impact.
Examples:
Example 1:
Title: Bitcoin Surges as Major Financial Institution Announces BTC Adoption
Text: In a groundbreaking move, a major financial institution announced that it would start offering Bitcoin as part of its investment portfolios. This decision is expected to significantly increase institutional demand for BTC, boosting investor confidence.
Ticker: BTC
<analysis>: The article highlights a major financial institution adopting Bitcoin, which is likely to enhance institutional investment and demand. This news positively affects investor sentiment and suggests an immediate positive impact on BTC's price.
<classification>: Positive
Example 2:
Title: Bitcoin Faces Increased Regulatory Scrutiny Amid Fraud Concerns
Text: Reports have emerged that several governments are planning to implement stricter regulations on cryptocurrency trading, citing concerns about fraud and market manipulation. The regulatory discussions have sparked debate among investors regarding the future of Bitcoin in heavily regulated markets.
Ticker: BTC
<analysis>: The article discusses potential regulatory actions that could negatively influence market sentiment by raising fears of restricted trading and heightened scrutiny. This news suggests a likely negative impact on BTC's price in the immediate term.
<classification>: Negative
"""
# define the user input string template
bin_user_prompt = """
Here is the news article:
<title>
{title}
</title>
<text>
{text}
</text>
The stock ticker symbol you need to consider is: {tickers}
"""
# create the chat prompt template
bin_chat_prompt = ChatPromptTemplate.from_messages([
("system", bin_system_prompt),
("human", bin_user_prompt),
])
In this section, we will use the LLMs to label the sentiment of the
news articles in our binary classification data set. We will use our
generalized_chat_completion
function to interact with the
LLMs and generate chat completions for each news article. The chat
completions will include the LLM’s analysis and classification of the
news article’s impact on the stock’s next-day return. We will run the
labeling process for each LLM model and replicate the process three
times to ensure consistency in the labeling results.
res = generalized_chat_completion(
csv_path = '../data/binary_classification_data.csv',
columns_to_keep = ['date', 'title', 'text', 'tickers'],
models = models,
chat_prompt_template = bin_chat_prompt,
columns_for_chat_prompt = ['title', 'text', 'tickers'],
num_replicates = 5,
output_file = '../results/binary_classification_results.csv'
)
The labeling process has been completed for the binary classification data set. The results have been saved to a CSV file, which can be accessed here.
We present a detailed summary of inference latency performance across models below.
# * Summary Table ---------------------------------------------------------
results = readr::read_csv("../results/binary_classification_results.csv")
results |>
dplyr::filter(chat_model != 'exaone-deep:2.4b') |>
dplyr::mutate(
chat_model = forcats::fct_relevel(
chat_model,
"claude-3-7-sonnet-20250219",
"claude-3-5-haiku-20241022",
"gpt-4o-2024-11-20",
"gpt-4o-mini-2024-07-18",
"command-r-plus-08-2024",
"command-r7b",
"deepseek-r1:7B",
"deepseek-r1:1.5B",
"gemma3:27B",
"gemma3:1B",
"llama3.2:3B",
"llama3.2:1B",
'phi4:latest',
'phi4-mini'
)
) |>
dplyr::group_by(chat_model) |>
dplyr::mutate(
time_diff = chat_date - dplyr::lag(chat_date)
) |> dplyr::select(chat_model, chat_date, time_diff) |>
dplyr::summarise(
n = dplyr::n(),
min_t = min(time_diff, na.rm = TRUE),
q25_t = quantile(time_diff, 0.25, na.rm = TRUE),
mean_t = mean(time_diff, na.rm = TRUE),
med_t = median(time_diff, na.rm = TRUE),
q75_t = quantile(time_diff, 0.75, na.rm = TRUE),
p95_t = quantile(time_diff, 0.95, na.rm = TRUE),
max_t = max(time_diff, na.rm = TRUE),
) |>
dplyr::mutate(
color = c(rep(c('exp', 'cheap'), 7) ),
company = c('Anthropic', 'Anthropic', 'Cohere', 'Cohere', 'DeepSeek', 'DeepSeek',
'Google', 'Google', 'OpenAI', 'OpenAI', 'Meta', 'Meta', 'Microsoft', 'Microsoft')
) -> summary_table
# * Plotting the box plot -------------------------------------------------
# create "fake" spacer levels for spacing the boxplot:
model_levels = c( "spacer-0",
"claude-3-7-sonnet", "claude-3-5-haiku", "spacer-2",
"gpt-4o", "gpt-4o-mini", "spacer-10a", "spacer-10b",
"command-r-plus", "command-r7b", "spacer-4",
"deepseek-r1:7B", "deepseek-r1:1.5B", "spacer-6",
"gemma3:27B", "gemma3:1B", "spacer-8",
"llama3.2:3B", "llama3.2:1B", "spacer-12",
"phi4:latest", "phi4-mini", "spacer-14"
)
# boxplot data:
results |>
dplyr::left_join(summary_table, by = "chat_model") |>
dplyr::mutate(
time_diff = (chat_date - dplyr::lag(chat_date)) |>
stringr::str_remove_all(" secs") |> as.numeric(),
chat_model = forcats::fct_recode(
chat_model,
`claude-3-7-sonnet` = "claude-3-7-sonnet-20250219",
`claude-3-5-haiku` = "claude-3-5-haiku-20241022",
`command-r-plus` = "command-r-plus-08-2024",
`gpt-4o` = "gpt-4o-2024-11-20",
`gpt-4o-mini` = "gpt-4o-mini-2024-07-18"
) |>
factor(levels = model_levels)
) |>
na.omit() -> boxplot_df
# annotation df:
label_df = boxplot_df |>
dplyr::group_by(chat_model) |>
dplyr::summarise(
min = round(quantile(time_diff, 0)),
med = round(quantile(time_diff, 0.5)),
max = round(quantile(time_diff, 1)),
color = dplyr::first(color),
.groups = "drop"
) |>
tidyr::pivot_longer(
cols = c(min, med, max),
names_to = "stat",
values_to = "label"
) |>
dplyr::arrange(factor(chat_model, levels = model_levels)) |>
dplyr::mutate(
group_index = rep(seq_len(length(unique(chat_model))), each = 3),
base_nudge = ifelse(group_index %% 2 == 1, -1, 1),
# Apply tighter nudge for min/max, looser for median
y_nudge = dplyr::case_when(
stat == "med" ~ base_nudge * 0.725,
stat %in% c("min", "max") ~ base_nudge * 0.5
)
) |>
dplyr::filter( !(chat_model == "llama3.2:1B" & stat == 'min') )
# data for the rectangle boxes:
proprietary_models = c("claude-3-7-sonnet", "claude-3-5-haiku", "gpt-4o", "gpt-4o-mini")
y_indices = which(model_levels %in% proprietary_models)
y_min = min(y_indices) - 0.5
y_max = max(y_indices) + 0.5
open_api = c("command-r-plus")
y_indices2 = which(model_levels %in% open_api)
y_min2 = min(y_indices2) - 0.5
y_max2 = max(y_indices2) + 0.5
ollama_models = c(
'command-r7b', 'deepseek-r1:7B', 'deepseek-r1:1.5B', 'gemma3:27B', 'gemma3:1B',
'llama3.2:3B', 'llama3.2:1B', 'phi4:latest', 'phi4-mini'
)
y_indices3 = which(model_levels %in% ollama_models)
y_min3 = min(y_indices3) - 0.5
y_max3 = max(y_indices3) + 0.5
# colors for ylabels
color_lookup = boxplot_df |>
dplyr::distinct(chat_model, color) |>
dplyr::filter(!grepl("^spacer", chat_model)) |>
dplyr::mutate(
label_colored = dplyr::case_when(
color == "exp" ~ paste0("<span style='color:#C3142D'>", chat_model, "</span>"),
color == "cheap" ~ paste0("<span style='color:#808080'>", chat_model, "</span>"),
TRUE ~ chat_model
)
)
model_labels = setNames(color_lookup$label_colored, color_lookup$chat_model)
# Add empty labels for spacers
spacer_labels = rep("", sum(grepl("^spacer", model_levels)))
names(spacer_labels) = model_levels[grepl("^spacer", model_levels)]
# Merge
full_labels = c(model_labels, spacer_labels)
full_labels = full_labels[model_levels] # Reorder to match axis
# plot:
boxplot_df |>
dplyr::mutate(chat_model = factor(chat_model, levels = model_levels)) |>
ggplot2::ggplot(ggplot2::aes(y = chat_model, x = time_diff)) +
ggplot2::geom_boxplot(ggplot2::aes(color = color), size =0.5) +
# adding blanks between pairs of models
ggplot2::geom_blank(data = data.frame(chat_model = model_levels, time_diff = NA)) +
# use ggplot2::stat_summary to annotate the max for each model
ggplot2::geom_text(
data = label_df,
ggplot2::aes(x = label, y = chat_model, label = label, color = color),
position = ggplot2::position_nudge(y = label_df$y_nudge),
size = 2.75,
) +
ggplot2::scale_x_log10() +
ggplot2::scale_y_discrete(
limits = model_levels,
labels = full_labels
) +
ggplot2::scale_color_manual(values = c("exp" = "#C3142D", "cheap" = "#808080")) +
ggplot2::annotate("rect",
xmin = 0.9, xmax = 1200,
ymin = y_min-0.8, ymax = y_max+0.6,
alpha = 0.02,
fill = NA,
color = "black",
linetype = 3
) +
ggplot2::annotate("text",
x = 310,
y = y_max + .35, # just inside the box
label = "Proprietary LLMs\n (via API)",
hjust = 0.5,
vjust = 1,
fontface = "bold",
size = 3
) +
ggplot2::annotate("rect",
xmin = 0.9, xmax = 1200,
ymin = y_min3-1.6, ymax = y_max3+0.6,
alpha = 0.02,
color = "black",
fill = NA,
linetype = 2
) +
ggplot2::annotate("text",
x = 120,
y = y_max3+0.35 , # just inside the box
label = "Open Weight LLMs via Ollama\n (command-r-plus via API)",
hjust = 0.5,
vjust = 1,
fontface = "bold",
size = 3
) +
ggplot2::labs(
title = "Task Completion Time by Model",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
y = "",
x = "Task Completion Time in Seconds (Log Scale)",
caption = 'Annotated with min, median, and max times based on n=6,750 annotations per model.'
) +
ggplot2::theme_classic() +
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 11),
plot.subtitle = ggtext::element_markdown(hjust = 1, face = "bold", size = 10),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 9, face = 'bold'),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 9, face = 'bold'),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
legend.position = "none",
plot.caption = ggtext::element_markdown(hjust = 0.95, size = 8),
axis.ticks.y = ggplot2::element_blank()
)
In this section, we will extract the LLM labels from the binary classification results and create a data frame for further analysis. We will extract the LLM labels from the chat responses and create a new column in the data frame for each LLM model. We will also calculate the frequency of the most common label across replicates to assess the consistency of the LLMs.
pacman::p_load(irrCAC, stringr, tidyverse)
# helper function to calculate the frequency of the most common value
mode_frequency = function(x, drop_na=F) {
# return 0 if all values are NA
if (all(is.na(x))) {
return(0)
}
# calculate the frequency table
freq_table = table(x)
if (drop_na==F){
# return the frequency of the most common value
return(max(freq_table) / length(x))
}else{
return(max(freq_table)/sum(freq_table))
}
}
# read the binary classification results from the CSV file
binary_classification_results = readr::read_csv("../results/binary_classification_results.csv")
binary_classification_results = binary_classification_results |>
filter(!(chat_model %in% c("exaone-deep:2.4b")))
# allowable words for the classification
valid_words <- c("Negative", "Positive", "Neutral", "Mixed", "Inconclusive", "Insufficient", "Impossible", "Ambiguous", "Indeterminate")
# extract the LLM labels from the chat responses
binary_classification_results = binary_classification_results |>
mutate(llm_label = sapply(chat_response, extract_classification, valid_words = valid_words),
llm_label = unname(llm_label),
llm_label = str_to_title(llm_label),
llm_label = ifelse(llm_label %in% c("Neutral", "Mixed", "Inconclusive", "Insufficient", "Impossible", "Ambiguous", "Indeterminate"), NA, llm_label),
llm_label = as.factor(llm_label)
) |>
# select relevant columns
dplyr::select(date, title, text, tickers, chat_model, chat_replicate, llm_label)
# print the frequency of llm_label
print(table(binary_classification_results$llm_label, useNA = "always"))
# pivot the data to wide format for easier comparison (and append the replicate number in name)
binary_analysis_df <- binary_classification_results |>
# pivot the data to wide format for easier comparison (and append the replicate number in name)
tidyr::pivot_wider(names_from = chat_replicate, values_from = llm_label, names_prefix = "rep_") |>
# compute the frequency of the most frequent value across replicates
dplyr::rowwise() |>
dplyr::mutate(
percent_agreement = mode_frequency(
dplyr::c_across(tidyselect::starts_with("rep_"))
) * 100,
percent_agreement_drop_na = mode_frequency(
dplyr::c_across(tidyselect::starts_with("rep_")), drop_na=T
) * 100
) |>
dplyr::ungroup() # ungroup after rowwise operation
# save the analysis dataframe as RDS and CSV files
readr::write_rds(binary_analysis_df, "../results/binary_analysis_df.rds")
readr::write_csv(binary_analysis_df, "../results/binary_analysis_df.csv")
The analysis dataframe for the binary classification experiment has been created. The dataframe contains the LLM labels for each model and replicate, along with the frequency of the most common label across replicates. The results can be accessed in the binary_analysis_df CSV file.
In this section, we study the distribution of LLM labels for binary classification. The figure below, the “invalid” label represents the inconsistency between our prompt (which explicitly instructed the LLM to restrict its classification to either “positive” or “negative”) and the generated classification, which sometimes included terms such as “neutral” or “unsure”.
# library(tidyverse)
# * Getting the Data ------------------------------------------------------
miamired = '#C3142D'
binary_classification_results =
readr::read_csv("../results/binary_analysis_df.csv") |>
dplyr::select(
date, title, tickers:rep_5
) |>
tidyr::pivot_longer(
cols = rep_1:rep_5,
names_to = "rep",
values_to = "llm_label"
) |>
dplyr::filter(chat_model != 'exaone-deep:2.4b')
# * Stuff Needed for Coloring ---------------------------------------------
# * Chat Models -----------------------------------------------------------
# Assign alternating colors to chat_model labels
chat_models = c(
"phi4-mini",
"phi4:latest",
"llama3.2:1B",
"llama3.2:3B",
"gemma3:1B",
"gemma3:27B",
"deepseek-r1:1.5B",
"deepseek-r1:7B",
"command-r7b",
"command-r-plus-08-2024",
"gpt-4o-mini-2024-07-18",
"gpt-4o-2024-11-20",
"claude-3-5-haiku-20241022",
"claude-3-7-sonnet-20250219"
)
model_abbrev = c(
"phi4-mini" = "phi4-mini",
"phi4:latest" = "phi4:latest",
"llama3.2:1B" = "llama3.2:1B",
"llama3.2:3B" = "llama3.2:3B",
"gemma3:1B" = "gemma3:1B",
"gemma3:27B" = "gemma3:27B",
"deepseek-r1:1.5B" = "deepseek-r1:1.5B",
"deepseek-r1:7B" = "deepseek-r1:7B",
"command-r7b" = "command-r7b",
"command-r-plus-08-2024" = "command-r-plus",
"gpt-4o-mini-2024-07-18" = "gpt-4o-mini",
"gpt-4o-2024-11-20" = "gpt-4o",
"claude-3-5-haiku-20241022" = "claude-3-5-haiku",
"claude-3-7-sonnet-20250219"= "claude-3-7-sonnet"
)
# Apply alternating color span tags
label_colors = c("Negative" = "#1F78B4", "Invalid" = "#B0B0B0", "Positive" = "#A6CEE3")
# * Plotting ---------------------------------------------------------------
plots = purrr::imap(chat_models, function(model, idx) {
color = if (idx %% 2 == 1) "#808080" else miamired
model_short = model_abbrev[[model]]
binary_classification_results |>
dplyr::filter(chat_model == model) |>
dplyr::mutate(
llm_label = forcats::fct_expand(llm_label, "Invalid"),
llm_label = forcats::fct_na_value_to_level(llm_label, "Invalid"),
llm_label = forcats::fct_relevel(llm_label, "Positive", "Invalid", "Negative")
) |>
dplyr::group_by(llm_label) |>
dplyr::summarize(count = dplyr::n()) |>
dplyr::mutate(percent = count / sum(count) * 100) |>
ggplot2::ggplot( ggplot2::aes(x = llm_label, y = percent, fill = llm_label)) +
ggplot2::geom_bar(stat = "identity", position = "dodge") +
ggplot2::geom_text( ggplot2::aes(label = sprintf("%.2f%%", percent)),
vjust = -0.5, size = 3, fontface = "bold") +
ggplot2::scale_fill_manual(values = label_colors) +
ggplot2::scale_y_continuous(
limits = c(0, 110),
breaks = seq(0, 100, 50),
labels = scales::label_percent(scale = 1)
) +
ggplot2::ggtitle(model_short) +
ggplot2::theme_classic() +
ggplot2::theme(
plot.title = ggplot2::element_text(color = color, hjust = 0.5, face = "bold", size = 8),
axis.title.x = ggplot2::element_blank(),
axis.title.y = ggplot2::element_blank(),
axis.text = ggplot2::element_text(size = 8, face = 'bold'),
legend.position = "none"
)
})
# Combine plots using patchwork
patchwork::wrap_plots(plots, ncol = 2, axis_titles ='collect') +
patchwork::plot_annotation(
title = "Distribution of LLM Labels for Binary Classification",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
theme = ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 11),
plot.subtitle = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 10),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
legend.position = "none",
)
)
In this section, we conduct the reliability assessment.
The figure below shows the NA-penalized agreement distributions for each model, highlighting the percentage of articles where all five replicates generated identical classifications. The NA-penalty reflects how we count agreement in the presence of invalid responses, where NAs are ignored in the numerator but are kept in the denominator.
# library(tidyverse)
miamired = '#C3142D'
chat_model_colors = rep(c('#808080', '#C3142D'), 7)
mode_frequency = function(x, drop_na=F) {
# return 0 if all values are NA
if (all(is.na(x))) {
return(0)
}
# calculate the frequency table
freq_table = table(x)
if (drop_na==F){
# return the frequency of the most common value
return(max(freq_table) / length(x))
}else{
return(max(freq_table)/sum(freq_table))
}
}
# * Getting the Data ------------------------------------------------------
binary_classification_results =
readr::read_csv("../results/binary_analysis_df.csv") |>
dplyr::select(
date, title, tickers:rep_5
) |>
tidyr::pivot_longer(
cols = rep_1:rep_5,
names_to = "rep",
values_to = "llm_label"
) |>
dplyr::filter(chat_model != 'exaone-deep:2.4b')
# * Stuff Needed for Coloring ---------------------------------------------
# * Chat Models -----------------------------------------------------------
# Assign alternating colors to chat_model labels
chat_models = c(
"phi4-mini",
"phi4:latest",
"llama3.2:1B",
"llama3.2:3B",
"gemma3:1B",
"gemma3:27B",
"deepseek-r1:1.5B",
"deepseek-r1:7B",
"command-r7b",
"command-r-plus-08-2024",
"gpt-4o-mini-2024-07-18",
"gpt-4o-2024-11-20",
"claude-3-5-haiku-20241022",
"claude-3-7-sonnet-20250219"
)
model_abbrev = c(
"phi4-mini" = "phi4-mini",
"phi4:latest" = "phi4:latest",
"llama3.2:1B" = "llama3.2:1B",
"llama3.2:3B" = "llama3.2:3B",
"gemma3:1B" = "gemma3:1B",
"gemma3:27B" = "gemma3:27B",
"deepseek-r1:1.5B" = "deepseek-r1:1.5B",
"deepseek-r1:7B" = "deepseek-r1:7B",
"command-r7b" = "command-r7b",
"command-r-plus-08-2024" = "command-r-plus",
"gpt-4o-mini-2024-07-18" = "gpt-4o-mini",
"gpt-4o-2024-11-20" = "gpt-4o",
"claude-3-5-haiku-20241022" = "claude-3-5-haiku",
"claude-3-7-sonnet-20250219"= "claude-3-7-sonnet"
)
# * Per Task and Model Labels ---------------------------------------------
binary_classification_results |>
tidyr::pivot_wider(
names_from = rep,
values_from = llm_label
) |>
dplyr::select(-date) -> binary_classification_results_wide
binary_analysis_df =
binary_classification_results_wide |>
dplyr::rowwise() |>
dplyr::mutate(
percent_agreement = 100* mode_frequency(dplyr::c_across(starts_with("rep_")))
) |>
dplyr::ungroup()
agreement =
binary_analysis_df |>
dplyr::group_by(chat_model, percent_agreement) |>
dplyr::summarize(count = dplyr::n()) |>
dplyr::ungroup() |>
dplyr::group_by(chat_model) |>
dplyr::mutate(
percent = 100 * (count / sum(count)),
chat_model = model_abbrev[chat_model] |> as.factor(),
chat_model = forcats::fct_relevel(
chat_model,
"phi4-mini", "phi4:latest", "llama3.2:1B", "llama3.2:3B",
"gemma3:1B", "gemma3:27B", "deepseek-r1:1.5B", "deepseek-r1:7B",
"command-r7b", "command-r-plus", "gpt-4o-mini", "gpt-4o",
"claude-3-5-haiku", "claude-3-7-sonnet"
)
) |>
dplyr::ungroup()
# * Plot List -------------------------------------------------------------
chat_model_colors = setNames(chat_model_colors, unique(agreement$chat_model) )
# Split data by chat_model and get the names for titles
chat_model_groups = agreement |>
dplyr::group_by(chat_model) |>
dplyr::group_split()
chat_model_names = agreement |>
dplyr::group_by(chat_model) |>
dplyr::group_keys() |>
dplyr::pull(chat_model)
# Create a list of plots with the chat_model as the title
plot_list = purrr::map2(chat_model_groups, chat_model_names, ~{
title_color = chat_model_colors[[.y]]
title_html = glue::glue("<span style='color:{title_color}'>{.y}</span>")
ggplot2::ggplot(.x, ggplot2::aes(x = percent_agreement, y = percent, fill = chat_model)) +
ggplot2::geom_bar(stat = 'identity', color = 'black') +
ggplot2::scale_fill_manual(values = chat_model_colors) +
ggplot2::labs(
title = title_html,
x = "Percent Agreement within a LLM (NA Penalty)",
#y = 'Percentage of Articles',
y = '',
fill = "LLM Model"
) +
ggplot2::scale_x_continuous(breaks = seq(0, 100, 20)) +
ggplot2::scale_y_continuous(
labels = scales::percent_format(scale = 1),
limits = c(0, 160),
breaks = seq(0, 100, 50)
) +
ggplot2::geom_text(
ggplot2::aes(label = sprintf("%.1f%%", percent)),
vjust = -0.5,
size = 2.5,
fontface = "bold"
) +
ggplot2::theme_classic() +
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7, face = "bold"),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7, face = "bold"),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
legend.position = "none"
)
})
patchwork::wrap_plots(plot_list, ncol = 2, axis_titles ='collect', axes = 'keep') +
patchwork::plot_annotation(
title = "Distribution of NA-Penalized Agreement Across Articles by LLM",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
theme = ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
plot.subtitle = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
panel.grid.major = ggplot2::element_blank(),
panel.grid.minor = ggplot2::element_blank(),
legend.position = "none",
)
) -> combined_plot
cowplot::ggdraw() +
cowplot::draw_label(
"Percentage of Articles",
x = 0.02, y = 0.5, angle = 90, vjust = 0.5,
fontface = "bold", size = 9
) +
cowplot::draw_plot(combined_plot, x = 0.05, y = 0, width = 0.95, height = 1)
pacman::p_load(irrCAC, stringr, tidyverse)
binary_analysis_df <- readr::read_csv("../results/binary_analysis_df.csv")
# binary classification reliability metrics
df = reliability_coefs(binary_analysis_df, 6:10)
df = df |>
mutate(across(3:6, ~round(.x,3)))
cat("The reliability metrics for the binary classification experiment are as follows:\n")
The reliability metrics for the binary classification experiment are as follows:
DT::datatable(df, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE))
cat("The mean percent agreement with NA penalty for each model is as follows:\n")
The mean percent agreement with NA penalty for each model is as follows:
pa_table = pa_summary(binary_analysis_df, "percent_agreement", digits = 3) |>
tidyr::separate(
mPa, into = c("mean_percent_agreement", "sd_percent_agreement"),
sep = " \\("
) |>
dplyr::mutate(
mean_percent_agreement = as.numeric(mean_percent_agreement),
sd_percent_agreement =
as.numeric(stringr::str_remove(sd_percent_agreement, "\\)"))
)
DT::datatable(pa_table, rownames = FALSE,
options = list(pageLength = 14, scrollX = TRUE)
)
cat(
paste(
'From the table above, we can see that the standard deviation of the percent agreement is quite low for,',
paste(pa_table$model[which(pa_table$sd_percent_agreement < 0.1)], collapse = ', '),
'indicating a high level of agreement among those LLMs in the binary classification experiment. However, the others models have higher standard deviations, especially for the',
paste(pa_table$model[which(pa_table$sd_percent_agreement > 0.15)], collapse = ', '),
'model. \n')
)
From the table above, we can see that the standard deviation of the percent agreement is quite low for, claude-3-7-sonnet-20250219, command-r-plus-08-2024, gemma3:27B, gpt-4o-2024-11-20, gpt-4o-mini-2024-07-18, llama3.2:1B, phi4:latest indicating a high level of agreement among those LLMs in the binary classification experiment. However, the others models have higher standard deviations, especially for the deepseek-r1:1.5B, deepseek-r1:7B, llama3.2:3B, phi4-mini model.
pa_table_drop_na = pa_summary(binary_analysis_df, "percent_agreement_drop_na", digits = 3) |>
tidyr::separate(
mPa, into = c("mean_percent_agreement_drop_na", "sd_percent_agreement"),
sep = " \\("
) |>
dplyr::mutate(
mean_percent_agreement_drop_na = as.numeric(mean_percent_agreement_drop_na),
sd_percent_agreement =
as.numeric(stringr::str_remove(sd_percent_agreement, "\\)"))
)
df_wide = df |>
dplyr::select(model, coeff.name, coeff.val) |>
tidyr::pivot_wider(names_from = coeff.name, values_from = coeff.val)
df_wide = dplyr::left_join(
x = pa_table |> dplyr::select(-sd_percent_agreement), y = df_wide,
by = "model"
)
df_wide = dplyr::left_join(
x = pa_table_drop_na |> dplyr::select(-sd_percent_agreement), y = df_wide,
by = "model"
) |>
dplyr::rename(
`Percent Agreement (with NA Penalty)` = mean_percent_agreement,
`Percent Agreement (Ignoring NA)` = mean_percent_agreement_drop_na
) |>
dplyr::select(
model,
`Percent Agreement (with NA Penalty)`, `Percent Agreement (Ignoring NA)`,
dplyr::everything()
)
# get the models with high standard deviation
high_pa_sd_models_index = which(pa_table$sd_percent_agreement > 0.15)
# a summarized table of the metrics
df_wide = df_wide |>
# rounding numeric columns to 3 decimal places
dplyr::mutate(
dplyr::across(dplyr::where(is.numeric), ~ round(., 3)),
high_pa_sd = ifelse(
model %in% pa_table$model[pa_table$sd_percent_agreement > 0.15],
TRUE, FALSE
),
`API Costs / 1M Input Tokens` = c(0.8, 3, 2.5, 0, 0, 0, 0, 0, 2.5, 0.15, 0, 0, 0, 0),
`API Costs / 1M Output Tokens`= c(4, 15, 10, 0, 0 ,0, 0, 0, 10, 0.6, 0, 0, 0, 0)
)
# save the reliability metrics dataframe as RDS and CSV files
readr::write_rds(df_wide, "../results/binary_reliability_metrics_means.rds")
readr::write_csv(df_wide, "../results/binary_reliability_metrics_means.csv")
readr::write_rds(pa_table, "../results/binary_pa_summary.rds")
readr::write_csv(pa_table, "../results/binary_pa_summary.csv")
readr::write_rds(df, "../results/binary_reliability_metrics.rds")
readr::write_csv(df, "../results/binary_reliability_metrics.csv")
The CSV files containing the reliability metrics for the binary classification experiment can be accessed at binary_reliability_metrics_means.csv, binary_pa_summary.csv, and binary_reliability_metrics.csv.
In this section, we study the chance-adjusted reliability coefficient estimates (dots) and their id'{a}k-adjusted confidence intervals (whiskers), constructed to maintain 90% family-wise confidence within each cost group.
# library(tidyverse)
library(grid)
library(gridExtra)
# ---- Setup ----
chat_models <- c(
"phi4-mini",
"llama3.2:1B",
"gemma3:1B",
"deepseek-r1:1.5B",
"command-r7b",
"gpt-4o-mini-2024-07-18",
"claude-3-5-haiku-20241022",
"phi4:latest",
"llama3.2:3B",
"gemma3:27B",
"deepseek-r1:7B",
"command-r-plus-08-2024",
"gpt-4o-2024-11-20",
"claude-3-7-sonnet-20250219"
)
model_abbrev = c(
"phi4-mini" = "phi4-mini",
"llama3.2:1B" = "llama3.2:1B",
"gemma3:1B" = "gemma3:1B",
"deepseek-r1:1.5B" = "deepseek-r1:1.5B",
"command-r7b" = "command-r7b",
"gpt-4o-mini-2024-07-18" = "gpt-4o-mini",
"claude-3-5-haiku-20241022"= "claude-3-5-haiku",
"phi4:latest" = "phi4:latest",
"llama3.2:3B" = "llama3.2:3B",
"gemma3:27B" = "gemma3:27B",
"deepseek-r1:7B" = "deepseek-r1:7B",
"command-r-plus-08-2024" = "command-r-plus",
"gpt-4o-2024-11-20" = "gpt-4o",
"claude-3-7-sonnet-20250219" = "claude-3-7-sonnet"
)
chat_model_colors = rep(c('#808080', '#C3142D'), each = 7)
chat_model_colors = stats::setNames(chat_model_colors, model_abbrev[chat_models])
model_labels <- purrr::map_chr( model_abbrev[chat_models], ~{
if (grepl("^spacer", .x)) return("")
glue::glue("<span style='color:{chat_model_colors[.x]}'><b>{.x}</b></span>")
})
names(model_labels) <- model_abbrev[chat_models]
alpha = (1-(1-.1)^(1/7))
# ---- Load Data ----
intra_df = readr::read_csv("../results/binary_reliability_metrics.csv") |>
dplyr::mutate(
lower_ci = coeff.val - stats::qnorm(1-alpha/2) * coeff.se,
upper_ci = coeff.val + stats::qnorm(1-alpha/2)* coeff.se,
abbrev = factor(model_abbrev[model]),
abbrev = forcats::fct_relevel(abbrev, rev(model_abbrev[chat_models]) )
)
intra_wide = intra_df |>
dplyr::select(abbrev, coeff.name, coeff.val) |>
tidyr::pivot_wider(names_from = coeff.name, values_from = coeff.val) |>
dplyr::arrange(dplyr::desc(abbrev))
colnames(intra_wide) <- c("LLM", "C", "F", 'AC', "BP", "K")
intra_wide = intra_wide |>
dplyr::select(LLM, AC, BP, C, F, K) |>
dplyr::mutate(
dplyr::across(where(is.numeric), ~ ifelse(is.na(.x), "--", sprintf("%.2f", .x)))
)
# *Plot the Data ----------------------------------------------------------
intra_df |>
ggplot2::ggplot(ggplot2::aes(x = abbrev, y = coeff.val)) +
ggplot2::geom_blank() +
ggplot2::geom_point(ggplot2::aes(color = abbrev), size = 1.5, na.rm = TRUE) +
ggplot2::geom_errorbar(
ggplot2::aes(ymin = lower_ci, ymax = upper_ci, color = abbrev),
width = 0.5,
na.rm = TRUE
) +
# Annotate lower CI values
ggplot2::geom_text(
data = intra_df,
ggplot2::aes(
y = lower_ci,
label = sprintf("%.3f", lower_ci),
color = abbrev
),
size = 2.25,
hjust = 1.2,
na.rm = TRUE
) +
# Annotate upper CI values
ggplot2::geom_text(
data = intra_df,
ggplot2::aes(
y = upper_ci,
label = sprintf("%.3f", upper_ci),
color = abbrev
),
size = 2.25,
hjust = -0.2,
na.rm = TRUE
) +
ggplot2::scale_color_manual(values = chat_model_colors, na.translate = FALSE) +
ggplot2::scale_x_discrete(labels = model_labels) +
ggplot2::scale_y_continuous(limits = c(0.785, 1.053), breaks = seq(0.8, 1.0, 0.05)) +
ggplot2::coord_flip() +
ggplot2::facet_wrap(~ coeff.name, ncol = 2, scales = "fixed", axes = 'all') +
ggplot2::labs(
title = "Intra-LLM Reliability",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
x = NULL,
y = NULL
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
plot.subtitle = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7, face = 'bold'),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7, face = 'bold'),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
legend.position = "none",
panel.grid = ggplot2::element_blank(),
axis.line = ggplot2::element_line(color = "black"),
axis.ticks = ggplot2::element_line(color = 'black')
) -> original_plot
# * Adding a table in the empty space at the bottom right -----------------
# Custom theme
booktabs_theme <- ttheme_minimal(
core = list(
fg_params = list(fontface = "plain", cex = 0.6, hjust = 0, x = 0.05)
),
colhead = list(
fg_params = list(fontface = "bold", cex = 0.65, hjust = 0, x = 0.05),
bg_params = list(fill = NA) # No shading, like booktabs
),
padding = unit(c(1.7, 1.7), "mm") # Tighter padding
)
# Base table
table_grob <- gridExtra::tableGrob(intra_wide, rows = NULL, theme = booktabs_theme)
table_grob$heights <- ggplot2::unit(rep(1, nrow(table_grob)), "lines") * 0.5
# Add title row
title_grob <- grid::textGrob(
"Intra-LLM Coefficient Estimates",
gp = grid::gpar(fontface = "bold", fontsize = 9),
x = 0.5, hjust = 0.5
)
# Add title row above the table
table_grob <- gtable::gtable_add_rows(table_grob, heights = unit(1.5, "lines"), pos = 0)
table_grob <- gtable::gtable_add_grob(
table_grob,
grobs = title_grob,
t = 1, l = 1, r = ncol(table_grob)
)
table_width <- sum(table_grob$widths)
table_height <- sum(table_grob$heights)
padding <- ggplot2::unit(.5, "mm")
# Color code the table
core_llm_indices <- which(
table_grob$layout$name == "core-fg" &
table_grob$layout$l == 1 # column 1
)
model_colors <- rep(c("#808080", "#C3142D"), each=7) # matches intra_wide
for (i in seq_along(core_llm_indices)) {
grob_index <- core_llm_indices[i]
original_gp <- table_grob$grobs[[grob_index]]$gp
table_grob$grobs[[grob_index]]$gp <- modifyList(original_gp, gpar(col = model_colors[i]))
}
# Add border around entire table using grobTree
bordered_table <- grid::grobTree(
grid::rectGrob(
width = table_width+ padding,
height = table_height + padding,
gp = grid::gpar(fill = NA, lwd = 0.7, col ='black')
),
table_grob
)
final_plot <- cowplot::ggdraw() +
cowplot::draw_plot(original_plot, 0, 0, 1, 1) + # main plot takes full area
cowplot::draw_grob(bordered_table, x = 0.742, y = 0.16, width = 0.01, height = 0.01)
final_plot
After confirming strong internal consistency within individual models, we examine agreement between different LLMs. Instead of analyzing all possible combinations within each group, we focus on top-N combinations ranked by Krippendorff’s Alpha, with N ranging from 2 to 7.
# ---- Setup ----
chat_models = c(
"Top 2 Cheaper",
"Top 2 Expensive",
"Top 3 Cheaper",
"Top 3 Expensive",
"Top 4 Cheaper",
"Top 4 Expensive",
"Top 5 Cheaper",
"Top 5 Expensive",
"Top 6 Cheaper",
"Top 6 Expensive",
"Top 7 Cheaper",
"Top 7 Expensive"
)
model_abbrev = c(
"Top 2 Cheaper" = "Top 2 Cheaper",
"Top 2 Expensive" = "Top 2 Expensive",
"Top 3 Cheaper" = "Top 3 Cheaper",
"Top 3 Expensive" = "Top 3 Expensive",
"Top 4 Cheaper" = "Top 4 Cheaper",
"Top 4 Expensive" = "Top 4 Expensive",
"Top 5 Cheaper" = "Top 5 Cheaper",
"Top 5 Expensive" = "Top 5 Expensive",
"Top 6 Cheaper" = "Top 6 Cheaper",
"Top 6 Expensive" = "Top 6 Expensive",
"Top 7 Cheaper" = "Top 7 Cheaper",
"Top 7 Expensive" = "Top 7 Expensive"
)
chat_model_abbrev = c(
"phi4-mini" = "phi4-mini",
"phi4:latest" = "phi4:latest",
"llama3.2:1B" = "llama3.2:1B",
"llama3.2:3B" = "llama3.2:3B",
"gemma3:1B" = "gemma3:1B",
"gemma3:27B" = "gemma3:27B",
"deepseek-r1:1.5B" = "deepseek-r1:1.5B",
"deepseek-r1:7B" = "deepseek-r1:7B",
"command-r7b" = "command-r7b",
"command-r-plus-08-2024" = "command-r-plus",
"gpt-4o-mini-2024-07-18" = "gpt-4o-mini",
"gpt-4o-2024-11-20" = "gpt-4o",
"claude-3-5-haiku-20241022"= "claude-3-5-haiku",
"claude-3-7-sonnet-20250219" = "claude-3-7-sonnet"
)
chat_model_colors = rep(c('#808080', '#C3142D'), length.out = length(chat_models))
chat_model_colors = stats::setNames(chat_model_colors, model_abbrev[chat_models])
model_labels <- purrr::map_chr(model_abbrev[chat_models], ~{
if (grepl("^spacer", .x)) return("")
glue::glue("<span style='color:{chat_model_colors[.x]}'><b>{.x}</b></span>")
})
names(model_labels) <- model_abbrev[chat_models]
# ---- Load Data ----
intra_df = read_csv("../results/binary_reliability_metrics.csv") %>%
filter(coeff.name == "Krippendorff's Alpha") %>%
select(model, coeff.val, coeff.se)
intra_df$model <- factor(intra_df$model, levels = c("phi4-mini",
"phi4:latest", "llama3.2:1B", "llama3.2:3B", "gemma3:1B",
"gemma3:27B", "deepseek-r1:1.5B", "deepseek-r1:7B",
"command-r7b", "command-r-plus-08-2024",
"gpt-4o-mini-2024-07-18", "gpt-4o-2024-11-20",
"claude-3-5-haiku-20241022", "claude-3-7-sonnet-20250219")
)
intra_df <- intra_df %>%
arrange(model) %>%
mutate(cost = rep(c("Cheaper", "Expensive"), 7)) %>%
select(cost, everything()) %>%
arrange(cost, desc(coeff.val))
binary_analysis_df <- read_csv("../results/binary_analysis_df.csv")
df_long <- binary_analysis_df %>%
pivot_longer(cols = starts_with("rep"), names_to = "rep", values_to = "value")
inter_temp <- df_long %>%
select(chat_model, value) %>%
group_by(chat_model) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = chat_model, values_from = value) %>%
select(-id)
find_inter_metrics <- function(k, cost){
models <- as.character(intra_df$model[intra_df$cost==cost])[1:k]
temp <- inter_temp[models]
temp$chat_model = paste("Top", k, cost)
reliability_coefs(temp, 1:k) %>%
mutate(lower_ci = coeff.val - stats::qnorm(0.975) * coeff.se,
upper_ci = coeff.val + stats::qnorm(0.975) * coeff.se) %>%
select(model, coeff.name, coeff.val, lower_ci, upper_ci)
}
# Define values to loop over
k_vals <- 2:7
cost_levels <- c("Cheaper", "Expensive")
# Create all combinations of k and cost
combinations <- expand.grid(k = 2:7, cost = c("Cheaper", "Expensive"), KEEP.OUT.ATTRS = FALSE)
# Apply the function for each combination
inter_df <- pmap_dfr(combinations, function(k, cost) {
find_inter_metrics(k, cost) %>%
mutate(k = k, cost = cost)
})
inter_df$abbrev <- factor(model_abbrev[inter_df$model])
intra_wide <- intra_df %>%
mutate(Rank = rep(paste("Top", 1:7), 2),
LLM = chat_model_abbrev[model]
) %>%
select(Rank, LLM)
inter_wide <- data.frame(Rank = 1:7,
Cheaper = intra_wide$LLM[1:7],
Expensive = intra_wide$LLM[8:14])
row.names(inter_wide) = NULL
# *Plot the Data -------NULL# *Plot the Data ----------------------------------------------------------
inter_df |>
ggplot2::ggplot(ggplot2::aes(x = abbrev, y = coeff.val)) +
ggplot2::geom_blank() +
ggplot2::geom_point(ggplot2::aes(color = abbrev), size = 1.5, na.rm = TRUE) +
ggplot2::geom_errorbar(
ggplot2::aes(ymin = lower_ci, ymax = upper_ci, color = abbrev),
width = 0.5,
na.rm = TRUE
) +
# Annotate lower CI values
ggplot2::geom_text(
data = inter_df,
ggplot2::aes(
y = lower_ci,
label = sprintf("%.3f", lower_ci),
color = abbrev
),
size = 2.25,
hjust = 1.2,
na.rm = TRUE
) +
# Annotate upper CI values
ggplot2::geom_text(
data = inter_df,
ggplot2::aes(
y = upper_ci,
label = sprintf("%.3f", upper_ci),
color = abbrev
),
size = 2.25,
hjust = -0.2,
na.rm = TRUE
) +
ggplot2::scale_color_manual(values = chat_model_colors, na.translate = FALSE) +
ggplot2::scale_x_discrete(labels = model_labels) +
ggplot2::scale_y_continuous(limits = c(0.615, 0.953), breaks = seq(0.65, 0.9, 0.05)) +
ggplot2::coord_flip() +
ggplot2::facet_wrap(~ coeff.name, ncol = 2, scales = "fixed", axes = 'all') +
ggplot2::labs(
title = "Inter-LLM Reliability",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
x = NULL,
y = NULL
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
plot.subtitle = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7, face = 'bold'),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7, face = 'bold'),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
legend.position = "none",
panel.grid = ggplot2::element_blank(),
axis.line = ggplot2::element_line(color = "black"),
axis.ticks = ggplot2::element_line(color = 'black')
) -> original_plot
# * Adding a table in the empty space at the bottom right -----------------
# Custom theme
booktabs_theme <- ttheme_minimal(
core = list(
fg_params = list(fontface = "plain", cex = 0.6, hjust = 0, x = 0.05)
),
colhead = list(
fg_params = list(fontface = "bold", cex = 0.65, hjust = 0, x = 0.05),
bg_params = list(fill = NA) # No shading, like booktabs
),
padding = unit(c(4.5, 4.5), "mm") # Tighter padding
)
# Base table
table_grob <- gridExtra::tableGrob(inter_wide, rows = NULL, theme = booktabs_theme)
table_grob$heights <- ggplot2::unit(rep(1, nrow(table_grob)), "lines")*0.7
# Add title row
title_grob <- grid::textGrob(
"LLM Rank based on Kripendorff's Alpha",
gp = grid::gpar(fontface = "bold", fontsize = 9),
x = 0.5, hjust = 0.5
)
# Add title row above the table
table_grob <- gtable::gtable_add_rows(table_grob, heights = unit(1.5, "lines"), pos = 0)
table_grob <- gtable::gtable_add_grob(
table_grob,
grobs = title_grob,
t = 1, l = 1, r = ncol(table_grob)
)
table_width <- sum(table_grob$widths)
table_height <- sum(table_grob$heights)
padding <- ggplot2::unit(.5, "mm")
# Color code the table
core_llm_indices <- which(
table_grob$layout$name == "core-fg" &
table_grob$layout$l == 1 # column 1
)
# Find all core (body cell) grobs
core_indices <- which(table_grob$layout$name == "core-fg")
# Loop over each core cell and color by column and center the first column
for (i in core_indices) {
col_num <- table_grob$layout[i, "l"]
original_gp <- table_grob$grobs[[i]]$gp
grob <- table_grob$grobs[[i]] # current cell grob
if (col_num == 1) {
# Center-align column 1 (Rank)
grob$hjust <- 0.5
grob$x <- unit(0.5, "npc")
} else if (col_num == 2) {
# Cheaper LLM column — gray
grob$gp <- modifyList(original_gp, gpar(col = "#808080"))
} else if (col_num == 3) {
# Expensive LLM column — red
grob$gp <- modifyList(original_gp, gpar(col = "#C3142D"))
}
table_grob$grobs[[i]] <- grob
}
# Add border around entire table using grobTree
bordered_table <- grid::grobTree(
grid::rectGrob(
width = table_width + padding,
height = table_height + padding,
gp = grid::gpar(fill = NA, lwd = 0.7, col ='black')
),
table_grob
)
final_plot <- cowplot::ggdraw() +
cowplot::draw_plot(original_plot, 0, 0, 1, 1) + # main plot takes full area
cowplot::draw_grob(bordered_table, x = 0.742, y = 0.16, width = 0.01, height = 0.01)
final_plot
In this section, we will compare the LLM labels with the labels provided by the StockNewsAPI (benchmark). We will calculate the agreement between the LLM labels and the “benchmark” labels using various reliability metrics. We will also assess the validity of the LLM labels by comparing them with the “benchmark” labels.
stock_sentiment_df = readr::read_csv("../data/binary_classification_data.csv") |>
dplyr::select(date, title, text, tickers, sentiment)
validity_df =
readr::read_csv("../results/binary_analysis_df.csv") |>
dplyr::select(-percent_agreement) |>
dplyr::left_join(stock_sentiment_df, by = c("date", "title", "text", "tickers")) |>
# percent of rep_1, rep_2, .., agreement with the benchmark
dplyr::rowwise() |>
dplyr::mutate(
agreement_na_penalty = calculate_agreement(
reps = dplyr::c_across(dplyr::starts_with("rep_")),
ground_truth = dplyr::c_across(dplyr::contains("sentiment"))
),
agreement_na_as_na = calculate_agreement(
reps = dplyr::c_across(dplyr::starts_with("rep_")),
ground_truth = dplyr::c_across(dplyr::contains("sentiment")),
na = NA
)
)
# save the validity dataframe as RDS and CSV files
readr::write_rds(validity_df, "../results/binary_sens_validity_df.rds")
readr::write_csv(validity_df, "../results/binary_sens_validity_df.csv")
We report accuracy, true positive rate, true negative rate, positive
predictive value, and F1 score to provide a comprehensive performance.
The figure below shows the LLM classification performance v.s. benchmark
(those obtained from the StockNewsAPI
). The dot represents
each metric’s mean value (from its intra-rater five replicates), and the
whiskers length reflect the standard error.
# ---- Setup ----
chat_models = c(
"phi4-mini",
"phi4:latest",
"llama3.2:1B",
"llama3.2:3B",
"gemma3:1B",
"gemma3:27B",
"deepseek-r1:1.5B",
"deepseek-r1:7B",
"command-r7b",
"command-r-plus-08-2024",
"gpt-4o-mini-2024-07-18",
"gpt-4o-2024-11-20",
"claude-3-5-haiku-20241022",
"claude-3-7-sonnet-20250219"
)
model_abbrev = c(
"phi4-mini" = "phi4-mini",
"phi4:latest" = "phi4:latest",
"llama3.2:1B" = "llama3.2:1B",
"llama3.2:3B" = "llama3.2:3B",
"gemma3:1B" = "gemma3:1B",
"gemma3:27B" = "gemma3:27B",
"deepseek-r1:1.5B" = "deepseek-r1:1.5B",
"deepseek-r1:7B" = "deepseek-r1:7B",
"command-r7b" = "command-r7b",
"command-r-plus-08-2024" = "command-r-plus",
"gpt-4o-mini-2024-07-18" = "gpt-4o-mini",
"gpt-4o-2024-11-20" = "gpt-4o",
"claude-3-5-haiku-20241022"= "claude-3-5-haiku",
"claude-3-7-sonnet-20250219" = "claude-3-7-sonnet"
)
chat_model_colors = rep(c('#808080', '#C3142D'), length.out = length(chat_models))
chat_model_colors = stats::setNames(chat_model_colors, model_abbrev[chat_models])
model_labels <- purrr::map_chr( model_abbrev[chat_models], ~{
if (grepl("^spacer", .x)) return("")
glue::glue("<span style='color:{chat_model_colors[.x]}'><b>{.x}</b></span>")
})
names(model_labels) <- model_abbrev[chat_models]
# ---- Load Data ----
inter_df =
tibble::tribble(
~Model, ~Accuracy, ~Acc_SE, ~TPR, ~TPR_SE, ~TNR, ~TNR_SE, ~PPV, ~PPV_SE, ~F1, ~F1_SE,
"phi4-mini", 0.830, 0.001, 0.817, 0.001, 0.843, 0.003, 0.839, 0.003, 0.828, 0.001,
"phi4:latest", 0.822, 0.001, 0.842, 0.003, 0.802, 0.001, 0.809, 0.000, 0.825, 0.001,
"llama3.2:1B", 0.864, 0.005, 0.881, 0.003, 0.847, 0.006, 0.852, 0.005, 0.867, 0.004,
"llama3.2:3B", 0.763, 0.003, 0.797, 0.000, 0.728, 0.005, 0.746, 0.004, 0.770, 0.002,
"gemma3:1B", 0.876, 0.002, 0.818, 0.008, 0.933, 0.005, 0.924, 0.004, 0.868, 0.003,
"gemma3:27B", 0.823, 0.001, 0.819, 0.003, 0.828, 0.001, 0.826, 0.001, 0.823, 0.001,
"deepseek-r1:1.5B", 0.775, 0.008, 0.725, 0.015, 0.825, 0.002, 0.806, 0.005, 0.764, 0.010,
"deepseek-r1:7B", 0.832, 0.002, 0.835, 0.005, 0.829, 0.001, 0.830, 0.001, 0.832, 0.003,
"command-r-plus", 0.774, 0.002, 0.715, 0.002, 0.832, 0.006, 0.810, 0.005, 0.760, 0.002,
"command-r7b", 0.847, 0.003, 0.883, 0.004, 0.810, 0.003, 0.823, 0.003, 0.852, 0.003,
"gpt-4o-mini", 0.820, 0.003, 0.800, 0.005, 0.840, 0.002, 0.834, 0.003, 0.817, 0.004,
"gpt-4o", 0.787, 0.007, 0.743, 0.011, 0.832, 0.003, 0.815, 0.004, 0.777, 0.008,
"claude-3-5-haiku", 0.848, 0.002, 0.882, 0.004, 0.815, 0.003, 0.826, 0.002, 0.853, 0.002,
"claude-3-7-sonnet",0.831, 0.003, 0.867, 0.004, 0.795, 0.003, 0.809, 0.002, 0.837, 0.003
)
inter_df = inter_df |>
dplyr::rename(Accuracy_mean = Accuracy,
TPR_mean = TPR,
TNR_mean = TNR,
PPV_mean = PPV,
F1_mean = F1) |>
dplyr::rename(Accuracy_se = Acc_SE,
TPR_se = TPR_SE,
TNR_se = TNR_SE,
PPV_se = PPV_SE,
F1_se = F1_SE)
inter_long = inter_df |>
tidyr::pivot_longer(
cols = -Model,
names_to = c("Metric", ".value"),
names_sep = "_"
) |>
dplyr::mutate(
lower = mean - se/sqrt(5),
upper = mean + se/sqrt(5),
Model = factor(Model, levels = rev(model_abbrev[chat_models]))
)
table_df <- inter_df |>
dplyr::select(Model, Accuracy_mean, TPR_mean, TNR_mean, PPV_mean, F1_mean) %>%
dplyr::rename(
LLM = Model,
ACC = Accuracy_mean,
TPR = TPR_mean,
TNR = TNR_mean,
PPV = PPV_mean,
F1 = F1_mean) |>
dplyr::mutate(dplyr::across(dplyr::where(is.numeric), ~ sprintf("%.2f", .))
)
# *Plot the Data ----------------------------------------------------------
inter_long |>
ggplot2::ggplot(ggplot2::aes(x = Model, y = mean, color = Model)) +
ggplot2::geom_blank() +
ggplot2::geom_point(size = 0.9, na.rm = TRUE) +
ggplot2::geom_errorbar(
ggplot2::aes(ymin = lower, ymax = upper),
width = 0.4,
na.rm = TRUE
) +
ggplot2::geom_text(ggplot2::aes(label = sprintf("%.3f", lower), y = lower), size = 2.25, hjust = 1.2, na.rm = TRUE) +
ggplot2::geom_text(ggplot2::aes(label = sprintf("%.3f", upper), y = upper), size = 2.25, hjust = -0.2, na.rm = TRUE) +
ggplot2::scale_color_manual(values = chat_model_colors, na.translate = FALSE) +
ggplot2::scale_x_discrete(labels = model_labels) +
ggplot2::scale_y_continuous(limits = c(0.65, 1.02), breaks = seq(0.7, 1.0, 0.1)) +
ggplot2::coord_flip() +
ggplot2::facet_wrap(~ Metric, ncol = 2, scales = "fixed", axes = 'all') +
ggplot2::labs(
title = "Classification Performance vs. StockNewsAPI Labels",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
x = NULL,
y = NULL
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
plot.subtitle = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 8),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 8),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 8),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7, face = 'bold'),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7, face = 'bold'),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
legend.position = "none",
panel.grid = ggplot2::element_blank(),
axis.line = ggplot2::element_line(color = "black"),
axis.ticks = ggplot2::element_line(color = 'black')
) -> original_plot
# * Adding a table in the empty space at the bottom right -----------------
# Custom theme
booktabs_theme <- ttheme_minimal(
core = list(
fg_params = list(fontface = "plain", cex = 0.55, hjust = 0, x = 0.05)
),
colhead = list(
fg_params = list(fontface = "bold", cex = 0.6, hjust = 0, x = 0.05),
bg_params = list(fill = NA) # No shading, like booktabs
),
padding = unit(c(1.7, 1.7), "mm") # Tighter padding
)
# Base table
table_grob <- gridExtra::tableGrob(table_df, rows = NULL, theme = booktabs_theme)
table_grob$heights <- ggplot2::unit(rep(1, nrow(table_grob)), "lines") * 0.5
# Add title row
title_grob <- grid::textGrob(
"Validity with Benchmark Model",
gp = grid::gpar(fontface = "bold", fontsize = 9),
x = 0.5, hjust = 0.5
)
# Add title row above the table
table_grob <- gtable::gtable_add_rows(table_grob, heights = unit(1.5, "lines"), pos = 0)
table_grob <- gtable::gtable_add_grob(
table_grob,
grobs = title_grob,
t = 1, l = 1, r = ncol(table_grob)
)
table_width <- sum(table_grob$widths)
table_height <- sum(table_grob$heights)
padding <- ggplot2::unit(.5, "mm")
# Color code the table
core_llm_indices <- which(
table_grob$layout$name == "core-fg" &
table_grob$layout$l == 1 # column 1
)
model_colors <- rep(c("#808080", "#C3142D"), length.out = 14) # matches inter_wide
for (i in seq_along(core_llm_indices)) {
grob_index <- core_llm_indices[i]
original_gp <- table_grob$grobs[[grob_index]]$gp
table_grob$grobs[[grob_index]]$gp <- modifyList(original_gp, gpar(col = model_colors[i]))
}
# Add border around entire table using grobTree
bordered_table <- grid::grobTree(
grid::rectGrob(
width = table_width+ padding,
height = table_height + padding,
gp = grid::gpar(fill = NA, lwd = 0.7, col ='black')
),
table_grob
)
final_plot <- cowplot::ggdraw() +
cowplot::draw_plot(original_plot, 0, 0, 1, 1) + # main plot takes full area
cowplot::draw_grob(bordered_table, x = 0.742, y = 0.16, width = 0.01, height = 0.01)
final_plot
In this section, we will repeat our analysis from the previous section, but this time we will compare the LLM labels with the “External Criterion” effects to what actually happened to the stock prices. We will calculate the agreement between the LLM labels and the “External Criterion” labels using our agreement metrics.
In the code chunk below, we use R and the tidyquant
package to extract the stock prices for the tickers associated with the
binary classification labels. We then save the raw stock prices to a CSV
file for further analysis.
tickers_df = readr::read_csv(
file = "../data/binary_classification_data.csv"
)
tickers_list = tickers_df$tickers |> unique()
min_date = min(tickers_df$date)
max_date = max(tickers_df$date)
stock_prices = tidyquant::tq_get(
x = tickers_list,
from = min_date,
to = max_date,
get = "stock.prices"
)
stock_prices$symbol |> unique() |> length()
other_stocks = c("MGC", "^GSPC")
#AHT, CRWV, MTTR (after 2025-03-03 not available)
other_prices = tidyquant::tq_get(
x = other_stocks,
from = min_date,
to = max_date,
get = "stock.prices"
)
stock_prices |>
dplyr::select(symbol, date, adjusted) |>
dplyr::rename(ticker = symbol) |>
dplyr::bind_rows(other_prices) |>
dplyr::arrange(ticker, date) |>
readr::write_csv(
"../results/stock_prices_raw.csv"
)
# read the stock prices and compute the returns
returns_df =
readr::read_csv("../results/stock_prices_raw.csv") |>
dplyr::group_by(ticker) |>
dplyr::mutate(
return = (adjusted - dplyr::lag(adjusted)) / dplyr::lag(adjusted),
perc_return = return * 100
) |>
dplyr::ungroup()
# calculate the baseline s&p returns
sp500_returns = returns_df |>
dplyr::filter(ticker == '^GSPC') |>
dplyr::select(date, sp500_perc_return = perc_return)
# combine that with the stock returns
returns_df = returns_df |>
dplyr::left_join(sp500_returns, by = "date") |>
dplyr::filter(ticker != '^GSPC') |>
dplyr::mutate(
excess_return = perc_return - sp500_perc_return,
excess_return_sign = dplyr::case_when(
excess_return > 0 ~ "Positive",
excess_return < 0 ~ "Negative",
TRUE ~ NA
),
leading_news_effect = dplyr::lead(excess_return_sign)
)
# save the returns dataframe as RDS and CSV files
readr::write_rds(returns_df, "../results/stock_returns_df.rds")
readr::write_csv(returns_df, "../results/stock_returns_df.csv")
# show the results for the first 5 tickers of interest (excluding S&P 500)
DT::datatable(
returns_df |>
dplyr::filter( ticker %in% unique(returns_df$ticker)[1:5] ),
rownames = FALSE,
extensions = c("FixedColumns"),
options = list(
pageLength = 10,
scrollX = TRUE,
fixedColumns = list(leftColumns = 1)
)
) |>
DT::formatRound(columns = 3:7, digits = 2) |>
DT::formatStyle(
columns = 'leading_news_effect', # Apply to the 9th column (news_effect)
valueColumns = 'leading_news_effect',
color = DT::JS("value == 'Positive' ? '#2C7BB6' : '#D7191C '"),
fontWeight = 'bold' # Bold the text
)
News Effect
with Our Binary Classification LabelsIn the code chunk below, we merge the binary classification labels with the leading news effect. We also calculate the distribution of the leading news effect and identify the tickers with missing values.
# read the stock returns dataframe
adjusted_news_df =
readr::read_csv("../results/binary_analysis_df.csv") |>
dplyr::mutate(date = lubridate::as_date(date)) |>
dplyr::select(-percent_agreement)
## Rows: 18900 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): title, text, tickers, chat_model, rep_1, rep_2, rep_3, rep_4, rep_5
## dbl (2): percent_agreement, percent_agreement_drop_na
## dttm (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# NYSE holidays within the date range
min_date = min(adjusted_news_df$date) |> as.Date()
max_date = max(adjusted_news_df$date) |> as.Date()
# nyse holidays within our range dates
nyse_holidays =
timeDate::holidayNYSE(year = lubridate::year(min_date):lubridate::year(max_date)) |>
as.data.frame() |>
dplyr::rename(off_dates = 1) |>
dplyr::mutate(off_dates = as.Date(off_dates)) |>
dplyr::filter(off_dates >= min_date & off_dates <= max_date) |>
dplyr::pull(off_dates)
# weekends within the date range
weekends =
data.frame(off_dates = seq.Date(min_date, max_date, by = "day")) |>
dplyr::mutate(
weekday = lubridate::wday(off_dates, label = TRUE, abbr = FALSE)
) |>
dplyr::filter(weekday %in% c('Saturday', 'Sunday') ) |>
dplyr::pull(off_dates)
# combine the holidays and weekends
# off_dates = c(nyse_holidays, weekends) |> as.Date()
off_dates = weekends
# trading days
trading_days = seq.Date(min_date, max_date, by = "day")
trading_days_index = !trading_days %in% off_dates
trading_days = trading_days[trading_days_index]
# helper function to find the next trading day after the current date
next_trading_day = function(current_date, trading_days) {
next_day = trading_days[trading_days > current_date]
if (length(next_day) > 0) {
return(min(next_day)) # return the next available trading day
} else {
return(NA) # if no future trading day is available (unlikely in this case)
}
}
# create a binary gold dataframe with the leading news effect
binary_gold_df =
adjusted_news_df |>
dplyr::mutate(
day_of_week = lubridate::wday(date, label = TRUE, abbr = FALSE),
impact_date = date, # set impact_date as original date first
# if it's a holiday or weekend, pick the next available trading day
impact_date = dplyr::if_else(
date %in% off_dates,
purrr::map_dbl(date, ~ next_trading_day(.x, trading_days)) |> as.Date(),
impact_date # Keep original date if not a weekend/holiday
)
) |>
# left join with returns_df using the adjusted impact_date
dplyr::left_join(
returns_df |> dplyr::select(date, ticker, leading_news_effect),
by = c("impact_date" = "date", "tickers" = "ticker")
)
# distribution of the leading news effect
news_effect_table =
table(binary_gold_df$leading_news_effect, useNA = "ifany") / 14
# explanation with cat
cat(
"The table below shows the average distribution of the leading news effect, normalized by the number of LLM models (14):",
"\n\nNegative news effect count per model:", news_effect_table["Negative"],
"\nPositive news effect count per model:", news_effect_table["Positive"],
"\nMissing or NA values per model:", news_effect_table[3],
"\n\nThe missing or NA values are likely due to the fact that we could not retrieve stock prices for certain tickers, possibly because they represent private stocks."
)
The table below shows the average distribution of the leading news effect, normalized by the number of LLM models (14):
Negative news effect count per model: 671 Positive news effect count per model: 672 Missing or NA values per model: 7
The missing or NA values are likely due to the fact that we could not retrieve stock prices for certain tickers, possibly because they represent private stocks.
The comparison against actual market behavior, depicted in the figure below, contrasts the benchmark results. The dot represents each metric’s mean value (from its intra-rater five replicates), and the whiskers length reflect the standard error.
# library(tidyverse)
# library(grid)
# library(gridExtra)
# ---- Setup ----
chat_models = c(
"phi4-mini",
"phi4:latest",
"llama3.2:1B",
"llama3.2:3B",
"gemma3:1B",
"gemma3:27B",
"deepseek-r1:1.5B",
"deepseek-r1:7B",
"command-r7b",
"command-r-plus-08-2024",
"gpt-4o-mini-2024-07-18",
"gpt-4o-2024-11-20",
"claude-3-5-haiku-20241022",
"claude-3-7-sonnet-20250219"
)
model_abbrev = c(
"phi4-mini" = "phi4-mini",
"phi4:latest" = "phi4:latest",
"llama3.2:1B" = "llama3.2:1B",
"llama3.2:3B" = "llama3.2:3B",
"gemma3:1B" = "gemma3:1B",
"gemma3:27B" = "gemma3:27B",
"deepseek-r1:1.5B" = "deepseek-r1:1.5B",
"deepseek-r1:7B" = "deepseek-r1:7B",
"command-r7b" = "command-r7b",
"command-r-plus-08-2024" = "command-r-plus",
"gpt-4o-mini-2024-07-18" = "gpt-4o-mini",
"gpt-4o-2024-11-20" = "gpt-4o",
"claude-3-5-haiku-20241022"= "claude-3-5-haiku",
"claude-3-7-sonnet-20250219" = "claude-3-7-sonnet"
)
chat_model_colors = rep(c('#808080', '#C3142D'), length.out = length(chat_models))
chat_model_colors = stats::setNames(chat_model_colors, model_abbrev[chat_models])
model_labels <- purrr::map_chr( model_abbrev[chat_models], ~{
if (grepl("^spacer", .x)) return("")
glue::glue("<span style='color:{chat_model_colors[.x]}'><b>{.x}</b></span>")
})
names(model_labels) <- model_abbrev[chat_models]
# ---- Load Data ----
inter_df =
tibble::tribble(
~Model, ~Accuracy, ~Acc_SE, ~TPR, ~TPR_SE, ~TNR, ~TNR_SE, ~PPV, ~PPV_SE, ~F1, ~F1_SE,
"phi4-mini", 0.487, 0.002, 0.495, 0.002, 0.479, 0.006, 0.485, 0.002, 0.490, 0.000,
"phi4:latest", 0.513, 0.003, 0.526, 0.001, 0.500, 0.004, 0.511, 0.003, 0.518, 0.002,
"llama3.2:1B", 0.512, 0.000, 0.527, 0.001, 0.497, 0.001, 0.510, 0.000, 0.518, 0.000,
"llama3.2:3B", 0.478, 0.000, 0.525, 0.003, 0.432, 0.002, 0.478, 0.000, 0.501, 0.002,
"gemma3:1B", 0.489, 0.002, 0.442, 0.005, 0.535, 0.001, 0.486, 0.002, 0.463, 0.004,
"gemma3:27B", 0.509, 0.002, 0.505, 0.003, 0.512, 0.001, 0.507, 0.002, 0.506, 0.002,
"deepseek-r1:1.5B", 0.493, 0.003, 0.447, 0.010, 0.539, 0.004, 0.490, 0.003, 0.468, 0.007,
"deepseek-r1:7B", 0.496, 0.002, 0.509, 0.002, 0.483, 0.001, 0.494, 0.002, 0.501, 0.002,
"command-r-plus", 0.512, 0.004, 0.450, 0.002, 0.574, 0.007, 0.511, 0.005, 0.479, 0.003,
"command-r7b", 0.507, 0.004, 0.553, 0.002, 0.462, 0.005, 0.505, 0.003, 0.527, 0.003,
"gpt-4o-mini", 0.502, 0.002, 0.482, 0.004, 0.522, 0.003, 0.500, 0.003, 0.491, 0.003,
"gpt-4o", 0.509, 0.002, 0.461, 0.007, 0.558, 0.005, 0.508, 0.002, 0.483, 0.004,
"claude-3-5-haiku", 0.512, 0.003, 0.549, 0.006, 0.476, 0.002, 0.509, 0.003, 0.528, 0.004,
"claude-3-7-sonnet",0.514, 0.003, 0.548, 0.005, 0.480, 0.002, 0.511, 0.002, 0.529, 0.004
)
inter_df = inter_df |>
dplyr::rename(Accuracy_mean = Accuracy,
TPR_mean = TPR,
TNR_mean = TNR,
PPV_mean = PPV,
F1_mean = F1) |>
dplyr::rename(Accuracy_se = Acc_SE,
TPR_se = TPR_SE,
TNR_se = TNR_SE,
PPV_se = PPV_SE,
F1_se = F1_SE)
inter_long = inter_df |>
tidyr::pivot_longer(
cols = -Model,
names_to = c("Metric", ".value"),
names_sep = "_"
) |>
dplyr::mutate(
lower = mean - (se/sqrt(5)),
upper = mean + (se/sqrt(5)),
Model = factor(Model, levels = rev(model_abbrev[chat_models]))
)
table_df <- inter_df |>
dplyr::select(Model, Accuracy_mean, TPR_mean, TNR_mean, PPV_mean, F1_mean) %>%
dplyr::rename(
LLM = Model,
ACC = Accuracy_mean,
TPR = TPR_mean,
TNR = TNR_mean,
PPV = PPV_mean,
F1 = F1_mean) |>
dplyr::mutate(dplyr::across(dplyr::where(is.numeric), ~ sprintf("%.2f", .))
)
# *Plot the Data ----------------------------------------------------------
inter_long |>
ggplot2::ggplot(ggplot2::aes(x = Model, y = mean, color = Model)) +
ggplot2::geom_blank() +
ggplot2::geom_point(size = 0.9, na.rm = TRUE) +
ggplot2::geom_errorbar(
ggplot2::aes(ymin = lower, ymax = upper),
width = 0.4,
na.rm = TRUE
) +
ggplot2::geom_text(ggplot2::aes(label = sprintf("%.3f", lower), y = lower), size = 2.25, hjust = 1.2, na.rm = TRUE) +
ggplot2::geom_text(ggplot2::aes(label = sprintf("%.3f", upper), y = upper), size = 2.25, hjust = -0.2, na.rm = TRUE) +
ggplot2::scale_color_manual(values = chat_model_colors, na.translate = FALSE) +
ggplot2::scale_x_discrete(labels = model_labels) +
ggplot2::scale_y_continuous(limits = c(0.375, 0.65), breaks = seq(0.4, .61, 0.1)) +
ggplot2::coord_flip() +
ggplot2::facet_wrap(~ Metric, ncol = 2, scales = "fixed", axes = 'all') +
ggplot2::labs(
title = "Classification Performance vs. Actual Market Behavior",
subtitle = "<span style='color:#808080'>Cheaper</span> vs. <span style='color:#C3142D'>more expensive (time, cost) </span> LLMs by company",
x = NULL,
y = NULL
) +
ggplot2::theme_minimal() +
ggplot2::theme(
plot.title = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 9),
plot.subtitle = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 8),
axis.title.x = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 8),
axis.title.y = ggtext::element_markdown(hjust = 0.5, face = "bold", size = 8),
axis.text.x = ggtext::element_markdown(hjust = 0.5, size = 7, face = 'bold'),
axis.text.y = ggtext::element_markdown(hjust = 1, size = 7, face = 'bold'),
strip.text = ggtext::element_markdown(face = "bold", size = 8),
legend.position = "none",
panel.grid = ggplot2::element_blank(),
axis.line = ggplot2::element_line(color = "black"),
axis.ticks = ggplot2::element_line(color = 'black')
) -> original_plot
# * Adding a table in the empty space at the bottom right -----------------
# Custom theme
booktabs_theme <- ttheme_minimal(
core = list(
fg_params = list(fontface = "plain", cex = 0.55, hjust = 0, x = 0.05)
),
colhead = list(
fg_params = list(fontface = "bold", cex = 0.6, hjust = 0, x = 0.05),
bg_params = list(fill = NA) # No shading, like booktabs
),
padding = unit(c(1.7, 1.7), "mm") # Tighter padding
)
# Base table
table_grob <- gridExtra::tableGrob(table_df, rows = NULL, theme = booktabs_theme)
table_grob$heights <- ggplot2::unit(rep(1, nrow(table_grob)), "lines") * 0.5
# Add title row
title_grob <- grid::textGrob(
"Validity with External Criterion",
gp = grid::gpar(fontface = "bold", fontsize = 9),
x = 0.5, hjust = 0.5
)
# Add title row above the table
table_grob <- gtable::gtable_add_rows(table_grob, heights = unit(1.5, "lines"), pos = 0)
table_grob <- gtable::gtable_add_grob(
table_grob,
grobs = title_grob,
t = 1, l = 1, r = ncol(table_grob)
)
table_width <- sum(table_grob$widths)
table_height <- sum(table_grob$heights)
padding <- ggplot2::unit(.5, "mm")
# Color code the table
core_llm_indices <- which(
table_grob$layout$name == "core-fg" &
table_grob$layout$l == 1 # column 1
)
model_colors <- rep(c("#808080", "#C3142D"), length.out = 14) # matches inter_wide
for (i in seq_along(core_llm_indices)) {
grob_index <- core_llm_indices[i]
original_gp <- table_grob$grobs[[grob_index]]$gp
table_grob$grobs[[grob_index]]$gp <- modifyList(original_gp, gpar(col = model_colors[i]))
}
# Add border around entire table using grobTree
bordered_table <- grid::grobTree(
grid::rectGrob(
width = table_width+ padding,
height = table_height + padding,
gp = grid::gpar(fill = NA, lwd = 0.7, col ='black')
),
table_grob
)
final_plot <- cowplot::ggdraw() +
cowplot::draw_plot(original_plot, 0, 0, 1, 1) + # main plot takes full area
cowplot::draw_grob(bordered_table, x = 0.742, y = 0.16, width = 0.01, height = 0.01)
final_plot
Email: fmegahed@miamioh.edu | Phone: +1-513-529-4185 | Website: Miami University Official↩︎
Email: ychen4@udayton.edu | Phone: +1-937-229-2405 | Website: University of Dayton Official↩︎
Email: farmerl2@miamioh.edu | Phone: +1-513-529-4823 | Website: Miami University Official↩︎
Email: leeyh2@miamioh.edu | Phone: +1-513-529-2164 | Website: Miami University Official↩︎
Email: wangj249@miamioh.edu | Phone: +1-513-529-1577 | Website: Miami University Official↩︎
Email: i.m.zwetsloot@uva.nl | Website: University of Amsterdam Official↩︎