Specific Language in Tweets Predicts HIV Prevalence

Specific language in tweets predicts HIV prevalence

Future-oriented behaviors, including planning for the future and seeking long-term rewards, are associated with higher levels of health and well-being. Such behaviors are also associated with less impulsivity, addictive behavior, and risky sexual behavior, as well as more proactive safe sex practices.

One measure of future-oriented behavior is use of future-oriented language. Researchers are now harnessing the combined power of big data and social networking to examine language on a large scale. Specifically, researchers are analyzing the language of tweets and determining whether specific language in tweets predicts HIV prevalence.

Previous research has suggested that tweeted references to sex and drug use positively correlate with counties’ HIV rates. Building on these findings and the knowledge that future-oriented behavior is correlated with better health, another team of researchers hypothesized that HIV risk in communities where risky behavior is common may be attenuated in future-focused communities. This team recently published its results in the journal Health Psychology in the paper “Future-Oriented Tweets Predict Lower County-Level HIV Prevalence in the United States.”

To test their hypothesis, the researchers analyzed tweets, as the natural language in tweets provides a simple means of assessing the degree to which individuals think about and engage in risky behaviors. Furthermore, tweets provide a large pool of data from a wide variety of people.

The team’s predictions were that:

  • Future orientation would be associated with fewer HIV cases, and
  • Future orientation would buffer HIV risk in more vulnerable counties, as indicated by higher rates of sexually transmitted infections (STI) and more frequent Twitter references to risky behavior

Gathering the Data and Setting Up the Analysis of Language in Tweets

The researchers started with a pool of 706 million tweets sent between June 2009 and March 2010. Given that the researchers needed to know the geographic locations of the tweets to appropriately analyze the data, they used geolocation coordinates, the free-response location field accompanying a tweet, and an established set of rules to map tweets to locations and locations to counties. Data were aggregated at the county level. The final set of data included 151 million messages.

The team used two sets of analyses, theory- and data-driven approaches, to examine words used in tweets. In the analyses, the researchers controlled for three strong structural correlates of county-level HIV prevalence: percentage of blacks, percentage of foreign born, and population density.

The first analysis used a modified version of the Linguistic Inquiry and Word Count (LIWC), which compares all words in a given text against defined word lists. The researchers used three word lists: future tense, which included words like “could” and “gonna”; risky leisure activities, such as “bong” and “stoned”; and safe leisure activities, including “scrapbook” and “beach.” The analysis ultimately compared the number of words in a tweet to the number of words that were contained in any one of the three defined lists.

The second analysis was a data-driven approach, termed differential language analysis (DLA), and allowed for the possibility of finding unpredicted patterns in the data. Words and phrases were extracted from tweets and from this, topics were identified. Words and phrases that correlated, either strongly positively or strongly negatively, with HIV prevalence were represented as word clouds.

Future-Oriented Language in Tweets Predicts HIV Prevalence

Overall, the analysis showed that future-oriented language correlated with lower HIV prevalence.

The LIWC analysis showed that HIV prevalence is lower in counties with higher rates of future tense usage compared with counties with either medium- or low- future tense use. Furthermore, HIV rates were generally higher both in counties with more frequent references to risky activities and in counties with higher STI prevalence. However, future-oriented language was also found to act as a buffer against other risk factors because the researchers found no correlation between risky language and STI rates and HIV rates in counties with high rates of future tense use in tweets.

Meanwhile, the DLA analysis produced word clouds that depicted words that most strongly correlate with HIV prevalence. Interestingly, the words that negatively correlated with HIV prevalence were ones that refer to the future tense, preparation for the future, and thinking about alternate possible futures. Words that positively correlated with HIV prevalence were associated with urban nightlife, consumerism, and slang.

The authors believe that their work can be used to develop future interventions that leverage the power of social contagion. Because behaviors can spread though social networks relatively easily, it is possible that at-risk individuals will pick up future-oriented behavior patterns. However, it is not known whether future orientation itself has a causal influence on HIV risk, suggesting that further work is needed to elucidate this relationship. The authors also hope their model can be used to predict and more effectively combat future outbreaks of HIV.

Read the Paper

Future-Oriented Tweets Predict Lower County-Level HIV Prevalence in the United States

Related Reads

Twitter as a Remedy for Public Health Woes? Identifying Natural Helping Networks in Social Media


Photo Credit: Shutterstock/wavebreakmedia