User guide

The following guide is designed to present the more general details on using the package. Below:

  • We first present how to run a simple query using some embedding model.

  • We then show how to run multiple queries on multiple embeddings.

  • After that, we show how to compare the results obtained from running multiple sets of queries on multiple embeddings using different metrics through ranking calculation.

  • Finally, we show how to calculate the correlations between the rankings obtained.


To accurately study the biases contained in word embeddings, queries may contain words that could be offensive to certain groups or individuals. The relationships studied between these words DO NOT represent the ideas, thoughts or beliefs of the authors of this library. This applies to this and all pages of the documentation.


If you are not familiar with the concepts of query, target and attribute set, please visit the the framework section on the library’s about page. These concepts will be widely used in the following sections.

A jupyter notebook with this code is located in the following link: WEFE User Guide.

Run a Query

The following code explains how to run a gender query using Glove. embeddings and the Word Embedding Association Test (WEAT) as fairness metric.

Below we show the three usual steps for performing a query in WEFE:

# Load the package
from wefe.query import Query
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics.WEAT import WEAT
from wefe.datasets.datasets import load_weat
import gensim.downloader as api

Load a word embeddings model as a WordEmbedding object.

Here, we load the word embedding pretrained model using the gensim library and then we create a WordEmbeddingModel instance. For this example, we will use a 25-dimensional Glove embedding model trained from a Twitter dataset.

twitter_25 = api.load('glove-twitter-25')
model = WordEmbeddingModel(twitter_25, 'glove twitter dim=25')

Create the query using a Query object

Define the target and attribute words sets and create a Query object that contains them. Some well-known word sets are already provided by the package and can be easily loaded by the user. Users can also set their own custom-made sets.

For this example, we will create a query with gender terms with respect to family and career. The words we will use will be taken from the set of words used in the WEAT paper (included in the package).

# load the weat word sets
word_sets = load_weat()

gender_query = Query([word_sets['male_terms'], word_sets['female_terms']],
              [word_sets['career'], word_sets['family']],
              ['Male terms', 'Female terms'], ['Career', 'Family'])

Run the Query

Instantiate the metric that you will use and then execute run_query with the parameters created in the previous steps.

The bias measurement process consists of three stages:

  1. Checking the measurement parameters.

  2. Transform the word sets into word embeddings.

  3. Calculate the metric.

In this case we are going to use the WEAT metric.

weat = WEAT()
result = weat.run_query(gender_query, model)
{'query_name': 'Male terms and Female terms wrt Career and Family',
 'result': 0.3165841,
 'weat': 0.3165841,
 'effect_size': 0.677944,
 'p-value': None}

By default, the results are a dict containing the query name (in the key query_name) and the calculated value of the metric in the result key. It also contains a key with the name and the value of the calculated metric (which is duplicated in the “results” key).

Depending on the metric class used, the result dict can also return more metrics, detailed word-by-word values or other statistics. Also some metrics allow you to change the default value in results, which will have implications a little further down the line.

In this case, WEAT returns the value of weat and the effect_size, with weat as default in the results key.

Metric Params

Each metric allows to vary the behavior of run_query according to different parameters. For example: there are parameters to change the preprocessing of the words, others to warn errors or to modify what the method returns by default.

The parameters of each metric are detailed in the API documentation.

In this case, if we want run_query returns effect_size instead of weat in the result , when we execute run_query we can pass the parameter return_effect_size equal to True. Note that this parameter is only of the class WEAT.

weat = WEAT()
result = weat.run_query(gender_query, model, return_effect_size = True)
{'query_name': 'Male terms and Female terms wrt Career and Family',
 'result': 0.677944,
 'weat': 0.3165841,
 'effect_size': 0.677944,
 'p-value': None}

Word preprocessors

There may be word embeddings models whose words are not cased or that do not have accents. In Glove, for example, all its words in its vocabulary are lowercase. However, many words in WEAT’s ethnicity dataset contain cased words.

['Adam', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Justin', 'Ryan', 'Andrew', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Paul', 'Jonathan', 'Peter', 'Amanda', 'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Katie', 'Betsy', 'Kristin', 'Nancy', 'Stephanie', 'Ellen', 'Lauren', 'Colleen', 'Emily', 'Megan', 'Rachel']

If we carelessly execute the following query, when transforming word sets to embeddings we could lose many words or the whole of several sets.

You can specify that run_query log the words that were lost in the transformation to vectors by using the parameter warn_not_found_words=True.

ethnicity_query = Query(
    ], [word_sets['pleasant_5'], word_sets['unpleasant_5']],
    ['European american names(5)', 'African american names(5)'],
    ['Pleasant(5)', 'Unpleasant(5)'])

result = weat.run_query(ethnicity_query,
WARNING:root:The following words from set 'European american names(5)' do not exist within the vocabulary of glove twitter dim=25: ['Adam', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Justin', 'Ryan', 'Andrew', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Paul', 'Jonathan', 'Peter', 'Amanda', 'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Katie', 'Betsy', 'Kristin', 'Nancy', 'Stephanie', 'Ellen', 'Lauren', 'Colleen', 'Emily', 'Megan', 'Rachel']
WARNING:root:The transformation of 'European american names(5)' into glove twitter dim=25 embeddings lost proportionally more words than specified in 'lost_words_threshold': 1.0 lost with respect to 0.2 maximum loss allowed.
WARNING:root:The following words from set 'African american names(5)' do not exist within the vocabulary of glove twitter dim=25: ['Alonzo', 'Jamel', 'Theo', 'Alphonse', 'Jerome', 'Leroy', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Lavon', 'Marcellus', 'Wardell', 'Nichelle', 'Shereen', 'Ebony', 'Latisha', 'Shaniqua', 'Jasmine', 'Tanisha', 'Tia', 'Lakisha', 'Latoya', 'Yolanda', 'Malika', 'Yvette']
WARNING:root:The transformation of 'African american names(5)' into glove twitter dim=25 embeddings lost proportionally more words than specified in 'lost_words_threshold': 1.0 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
{'query_name': 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)', 'result': nan, 'weat': nan, 'effect_size': nan}


In order to give more robustness to the results, if more than 20% (by default) of the words from any of the word sets of the query are not included in the word embedding model, the result of the metric will be np.nan. This behavior can be changed using a float number parameter called lost_vocabulary_threshold.

One of the parameters of run_query, preprocessor_args allows to run a preprocessor to each word of all sets before getting its vectors. This preprocessor can specify that words be preprocessed to lowercase, remove accents or any other custom preprocessing given by the user.

The possible options for preprocessor_args are:

  • lowercase: bool. Indicates if the words are transformed to lowercase.

  • strip_accents: bool, {'ascii', 'unicode'}: Specifies if the accents of the words are eliminated. The stripping type can be specified. True uses 'unicode' by default.

  • preprocessor: Callable. It receives a function that operates on each word. In the case of specifying a function, it overrides the default preprocessor (i.e., the previous options stop working).

weat = WEAT()
result = weat.run_query(ethnicity_query,
                            'lowercase': True,
                            'strip_accents': True
{'query_name': 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)', 'result': 3.7529151, 'weat': 3.7529151, 'effect_size': 1.2746819, 'p-value': None}

It may happen that first you want to try to find the vector of a word in uppercase, (since this vector may contain more information than the one of the word lowercased) and if it is not exists in the model, then try to find its lowercase representation. This behavior can be specified by specifying preprocessing options in secondary_preprocessor_args and leaving the primary by default (i,e,. without providing it).

In general, the search for vectors will be done first by using the preprocessor specified in preprocessor_args and then the specified in secondary_preprocessor_args if this was provided. Therefore, any combination of these is also supported.

weat = WEAT()
result = weat.run_query(ethnicity_query,
                            'lowercase': True,
                            'strip_accents': True
{'query_name': 'European american names(5) and African american names(5) wrt Pleasant(5) and Unpleasant(5)',
 'result': 3.7529151,
 'weat': 3.7529151,
 'effect_size': 1.2746819,
 'p-value': None}

Running multiple Queries

We usually want to test several queries that study some criterion of bias: gender, ethnicity, religion, politics, socioeconomic, among others. Let’s suppose you’ve created 20 queries that study gender bias on different models of embeddings. Trying to use run_query on each pair embedding-query can be a bit complex and will require extra work to implement.

This is why the library also implements a function to test multiple queries on various word embedding models in a single call: the run_queries util.

The following code shows how to run various gender queries on Glove embedding models with different dimensions trained from the Twitter dataset. The queries will be executed using WEAT metric.

from wefe.query import Query
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics import WEAT, RNSB

from wefe.datasets import load_weat
from wefe.utils import run_queries

import gensim.downloader as api

Load the models:

Load three different Glove Twitter embedding models. These models were trained using the same dataset varying the number of embedding dimensions.

model_1 = WordEmbeddingModel(api.load('glove-twitter-25'),
                             'glove twitter dim=25')
model_2 = WordEmbeddingModel(api.load('glove-twitter-50'),
                             'glove twitter dim=50')
model_3 = WordEmbeddingModel(api.load('glove-twitter-100'),
                             'glove twitter dim=100')

models = [model_1, model_2, model_3]

Load the word sets and create the queries

Now, we will load the WEAT word set and create three queries. The three queries are intended to measure gender bias.

# Load the WEAT word sets
word_sets = load_weat()

# Create gender queries
gender_query_1 = Query([word_sets['male_terms'], word_sets['female_terms']],
                       [word_sets['career'], word_sets['family']],
                       ['Male terms', 'Female terms'], ['Career', 'Family'])

gender_query_2 = Query([word_sets['male_terms'], word_sets['female_terms']],
                       [word_sets['science'], word_sets['arts']],
                       ['Male terms', 'Female terms'], ['Science', 'Arts'])

gender_query_3 = Query([word_sets['male_terms'], word_sets['female_terms']],
                       [word_sets['math'], word_sets['arts_2']],
                       ['Male terms', 'Female terms'], ['Math', 'Arts'])

gender_queries = [gender_query_1, gender_query_2, gender_query_3]

Run the queries on all Word Embeddings using WEAT.

Now, to run our list of queries and models, we call run_queries using the parameters defined in the previous step. The mandatory parameters of the function are 3:

  • a metric,

  • a list of queries, and,

  • a list of embedding models.

It is also possible to provide a name for the criterion studied in this set of queries through the parameter queries_set_name.

# Run the queries
WEAT_gender_results = run_queries(WEAT,
                                  queries_set_name='Gender Queries')
WARNING:root:The transformation of 'Science' into glove twitter dim=25 embeddings lost proportionally more words than specified in 'lost_words_threshold': 0.25 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'Male terms and Female terms wrt Science and Arts' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
WARNING:root:The transformation of 'Science' into glove twitter dim=50 embeddings lost proportionally more words than specified in 'lost_words_threshold': 0.25 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'Male terms and Female terms wrt Science and Arts' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.
WARNING:root:The transformation of 'Science' into glove twitter dim=100 embeddings lost proportionally more words than specified in 'lost_words_threshold': 0.25 lost with respect to 0.2 maximum loss allowed.
ERROR:root:At least one set of 'Male terms and Female terms wrt Science and Arts' query has proportionally fewer embeddings than allowed by the lost_vocabulary_threshold parameter (0.2). This query will return np.nan.


Male terms and Female terms wrt Career and Family

Male terms and Female terms wrt Science and Arts

Male terms and Female terms wrt Math and Arts

glove twitter dim=25




glove twitter dim=50




glove twitter dim=100





If more than 20% (by default) of the words from any of the word sets of the query are not included in the word embedding model, the metric will return Nan. This behavior can be changed using a float number parameter called lost_vocabulary_threshold.

Setting metric params

As you can see from the results above, there is a whole column that has no results. As the warnings point out, when transforming the words of the sets into embeddings, there is a loss of words that is greater than the allowed by the parameter lost_vocabulary_threshold. Therefore, all those queries return np.nan. In this case, it would be very useful to use the word preprocessors seen above.

When we use run_queries, we can also provide specific parameters for each metric. We can do this by passing a dict with the metric params to the metric_params parameter. In this case, we will use preprocessor_args to lower the words.

WEAT_gender_results = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Gender Queries')



Male terms and Female terms wrt Career and Family

Male terms and Female terms wrt Science and Arts

Male terms and Female terms wrt Math and Arts

glove twitter dim=25




glove twitter dim=50




glove twitter dim=100




As you can see from the results table, no query was lost now.

Plot the results in a barplot

The library also provides an easy way to plot the results obtained from a run_queries execution into a plotly braplot.

from wefe.utils import run_queries, plot_queries_results

# Plot the results
WEAT gender results

Aggregating Results

The execution of run_queries in the previous step gave us many results evaluating the gender bias in the tested embeddings. However, these do not tell us much about the overall fairness of the embedding models with respect to the criteria evaluated. Therefore, we would like to have some mechanism that allows us to aggregate the results directly obtained in run_query so that we can evaluate the bias as a whole.

A simple way to aggregate the results would be to average their absolute values. For this, when using run_queries, you must set the aggregate_results parameter as True. This default value will activate the option to aggregate the results by averaging the absolute values of the results and put them in the last column.

This aggregation function can be modified through the aggregation_function parameter. Here you can specify a string that defines some of the aggregation types that are already implemented, as well as provide a function that operates in the results dataframe.

The aggregation functions available are:

  • Average avg.

  • Average of the absolute values abs_avg.

  • Sum sum.

  • Sum of the absolute values, abs_sum.


Notice that some functions are more appropriate for certain metrics. For metrics returning only positive numbers, all the previous aggregation functions would be OK. In contrast, for metrics returning real values (e.g., WEAT, RND, etc…), aggregation functions such as sum would make different outputs to cancel each other.

Let’s aggregate the results from previous example by the average of the absolute values:

WEAT_gender_results_agg = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Gender Queries')


Male terms and Female terms wrt Career and Family

Male terms and Female terms wrt Science and Arts

Male terms and Female terms wrt Math and Arts

WEAT: Gender Queries average of abs values score

glove twitter dim=25





glove twitter dim=50





glove twitter dim=100





WEAT aggregated gender results

Finally, we can ask the function to return only the aggregated values (through return_only_aggregation parameter) and then plot them.

WEAT_gender_results_only_agg = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Gender Queries')


WEAT: Gender Queries average of abs values score

glove twitter dim=25


glove twitter dim=50


glove twitter dim=100


WEAT only aggregated gender results

Calculate Rankings

When we want to measure various criteria of bias in different embedding models, two major problems arise:

  1. One type of bias can dominate the other because of significant differences in magnitude.

  2. Different metrics can operate on different scales, which makes them difficult to compare.

To show that, suppose we have two sets of queries: one that explores gender biases and another that explores ethnicity biases, and we want to test these sets of queries on 3 Twitter Glove models of 25, 50 and 100 dimensions each, using both WEAT and Relative Negative Sentiment Bias (RNSB) as bias metrics.

  1. Let’s show the first problem: the bias scores obtained from one set of queries are much higher than those from the other set, even when the same metric is used.

We executed the gender and ethnicity queries using WEAT and the 3 models mentioned above. The results obtained are:


WEAT: Gender Queries average of abs values score

WEAT: Ethnicity Queries average of abs values score

glove twitter dim=25



glove twitter dim=50



glove twitter dim=100



As can be seen, the results of ethnicity bias are much greater than those of gender.

  1. For the second problem: Metrics deliver their results on different scales.

We executed the gender queries using WEAT and RNSB metrics and the 3 models mentioned above. The results obtained are:


WEAT: Gender Queries average of abs values score

RNSB: Gender Queries average of abs values score

glove twitter dim=25



glove twitter dim=50



glove twitter dim=100



We can see differences between the results of both metrics of an order of magnitude.

To address these two problems, we propose to create rankings. Rankings allow us to focus on the relative differences reported by the metrics (for different models) instead of focusing on the absolute values.

Now, let’s create rankings using the data used above. The following code will load the models and create the queries:

from wefe.query import Query
from wefe.datasets.datasets import load_weat
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics import WEAT, RNSB
from wefe.utils import run_queries, create_ranking, plot_ranking, plot_ranking_correlations

import gensim.downloader as api

# Load the models
model_1 = WordEmbeddingModel(api.load('glove-twitter-25'),
                             'glove twitter dim=25')
model_2 = WordEmbeddingModel(api.load('glove-twitter-50'),
                             'glove twitter dim=50')
model_3 = WordEmbeddingModel(api.load('glove-twitter-100'),
                             'glove twitter dim=100')

models = [model_1, model_2, model_3]

# Load the WEAT word sets
word_sets = load_weat()

# Create gender queries
gender_query_1 = Query([word_sets['male_terms'], word_sets['female_terms']],
                       [word_sets['career'], word_sets['family']],
                       ['Male terms', 'Female terms'], ['Career', 'Family'])
gender_query_2 = Query([word_sets['male_terms'], word_sets['female_terms']],
                       [word_sets['science'], word_sets['arts']],
                       ['Male terms', 'Female terms'], ['Science', 'Arts'])
gender_query_3 = Query([word_sets['male_terms'], word_sets['female_terms']],
                       [word_sets['math'], word_sets['arts_2']],
                       ['Male terms', 'Female terms'], ['Math', 'Arts'])

# Create ethnicity queries
ethnicity_query_1 = Query([word_sets['european_american_names_5'],
                          [word_sets['pleasant_5'], word_sets['unpleasant_5']],
                          ['European Names', 'African Names'],
                          ['Pleasant', 'Unpleasant'])

ethnicity_query_2 = Query([word_sets['european_american_names_7'],
                          [word_sets['pleasant_9'], word_sets['unpleasant_9']],
                          ['European Names', 'African Names'],
                          ['Pleasant 2', 'Unpleasant 2'])

gender_queries = [gender_query_1, gender_query_2, gender_query_3]
ethnicity_queries = [ethnicity_query_1, ethnicity_query_2]

Now, we will run the queries with WEAT, WEAT Effect Size and RNSB:

# Run the queries WEAT
WEAT_gender_results = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Gender Queries')

WEAT_ethnicity_results = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Ethnicity Queries')

# Run the queries with WEAT Effect Size

WEAT_EZ_gender_results = run_queries(WEAT,
                                         'preprocessor_args': {
                                             'lowercase': True
                                         'return_effect_size': True
                                     queries_set_name='Gender Queries')

WEAT_EZ_ethnicity_results = run_queries(WEAT,
                                         'preprocessor_args': {
                                             'lowercase': True
                                         'return_effect_size': True
                                     queries_set_name='Ethnicity Queries')

# Run the queries using RNSB
RNSB_gender_results = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Gender Queries')

RNSB_ethnicity_results = run_queries(
    metric_params={'preprocessor_args': {
        'lowercase': True
    queries_set_name='Ethnicity Queries')

To create the ranking we’ll use the create_ranking function. This function takes all the DataFrames containing the calculated scores and uses the last column to create the ranking. It assumes that there is an aggregation in this column.

from wefe.utils import run_queries, create_ranking, plot_ranking, plot_ranking_correlations

gender_ranking = create_ranking([
    WEAT_gender_results, WEAT_EZ_gender_results, RNSB_gender_results



WEAT: Gender Queries average of abs values score (1)

WEAT: Gender Queries average of abs values score (2)

RNSB: Gender Queries average of abs values score

glove twitter dim=25




glove twitter dim=50




glove twitter dim=100




ethnicity_ranking = create_ranking([
    WEAT_ethnicity_results, WEAT_EZ_gender_results, RNSB_ethnicity_results



WEAT: Ethnicity Queries average of abs values score

WEAT: Gender Queries average of abs values score

RNSB: Ethnicity Queries average of abs values score

glove twitter dim=25




glove twitter dim=50




glove twitter dim=100




Plotting the rankings

Finally, we can plot the rankings in barplots using the plot_ranking function. The function can be used in two ways:

With facet by Metric and Criteria:

This image shows the rankings separated by each bias criteria and metric (i.e, by each column). Each bar represents the position of the embedding in the corresponding criterion-metric ranking.

plot_ranking(gender_ranking, use_metric_as_facet=True)
Gender ranking with facet
plot_ranking(ethnicity_ranking, use_metric_as_facet=True)
Ethnicity ranking with facet

Without facet

This image shows the accumulated rankings for each embedding model. Each bar represents the sum of the rankings obtained by each embedding. Each color within a bar represents a different criterion-metric ranking.

Gender ranking without facet
Ethnicity ranking with without facet

Ranking Correlations

We can see how the rankings obtained in the previous section relate to each other by using a correlation matrix. To do this we provide a function called calculate_ranking_correlations. This function takes the rankings as input and calculates the Spearman correlation between them.

from wefe.utils import calculate_ranking_correlations, plot_ranking_correlations
correlations = calculate_ranking_correlations(gender_ranking)


WEAT: Gender Queries average of abs values score (1)

WEAT: Gender Queries average of abs values score (2)

RNSB: Gender Queries average of abs values score

WEAT: Gender Queries average of abs values score (1)




WEAT: Gender Queries average of abs values score (2)




RNSB: Gender Queries average of abs values score




This function uses the corr() method of the ranking dataframe. This allows you to change the correlation calculation method to: ‘pearson’, ‘spearman’, ‘kendall’.

In the following example we use the kendall correlation.

calculate_ranking_correlations(gender_ranking, method='kendall')


WEAT: Gender Queries average of abs values score (1)

WEAT: Gender Queries average of abs values score (2)

RNSB: Gender Queries average of abs values score

WEAT: Gender Queries average of abs values score (1)




WEAT: Gender Queries average of abs values score (2)




RNSB: Gender Queries average of abs values score




Finally, we also provide a function to graph the correlations. This function enables us to visually analyze how the rankings relate to each other.

correlation_fig = plot_ranking_correlations(correlations)
Ranking correlations