python - Extract content of <script> with BeautifulSoup

link管理
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
1/ I am trying to extract a part of the script using beautiful soup but it prints Nothing. What's wrong ?
URL = "http://www.reuters.com/video/2014/08/30/woman-who-drank-restaurants-tainted-tea?videoId=341712453"
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)
for script in soup("script"):
        script.extract()
list_of_scripts = soup.findAll("script")
print list_of_scripts
2/ The goal is to extract the value of the attribute "transcript":
<script type="application/ld+json">
    "@context": "http://schema.org",
    "@type": "VideoObject",
    "video": {
        "@type": "VideoObject",
        "headline": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",
        "caption": "Woman who drank restaurant&#039;s tainted tea hopes for industry...",  
        "transcript": "Jan Harding is speaking out for the first time about the ordeal that changed her life.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"Immediately my whole mouth was on fire.\"               The Utah woman was critically burned in her mouth and esophagus after taking a sip of sweet tea laced with a toxic cleaning solution at Dickey's BBQ.               SOUNDBITE: JAN HARDING, DRANK TAINTED TEA, SAYING:               \"It was like a fire beyond anything you can imagine. I mean, it was not like drinking hot coffee.\"               Authorities say an employee mistakenly mixed the industrial cleaning solution containing lye into the tea thinking it was sugar.               The Hardings hope the incident will bring changes in the restaurant industry to avoid such dangerous mixups.               SOUNDBITE: JIM HARDING, HUSBAND, SAYING:               \"Bottom line, so no one ever has to go through this again.\"               The district attorney's office is expected to decide in the coming week whether criminal charges will be filed.",
From the documentation:
As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.
So basically the accepted answer from falsetru above is all good, but use .string instead of .text with newer versions of Beautiful Soup, or you'll be puzzled as I was by .text always returning None for <script> tags.
                You commented, not answered. btw urllib2 is also depreciated. Edit the answer above or give a new answer.
– flywire
                Apr 1, 2021 at 22:27
                Click edit on original answer and just change .text to .string. Add your comments to the question.
– flywire
                Apr 2, 2021 at 22:20
                @flywire: Tried to update the answer from @falsetru but update got rejected (there was more to it than that since that answer is in Python2 so other tweaks were necessary including re. urllib2 as you noted). I think having my answer separately as it is works fine though, I don't plan on retrying that update: Key takeaway is needing .string for <script> and co.
– Andrew Richards
                May 11, 2021 at 16:32
extract remove tag from the dom. That's why you get empty list.
Find script with the type="application/ld+json" attribute and decode it using json.loads. Then, you can access the data like Python data structure. (dict for the given data)
for Python 2.x:
import json
import urllib2
from bs4 import BeautifulSoup
URL = ("http://www.reuters.com/video/2014/08/30/"
        "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib2.urlopen(URL).read()
soup = BeautifulSoup(oururl)
data = json.loads(soup.find('script', type='application/ld+json').text)
print data['video']['transcript']
UPDATE: for Python 3.x:
import json
import urllib.request
from bs4 import BeautifulSoup
URL = ("http://www.reuters.com/video/2014/08/30/"
       "woman-who-drank-restaurants-tainted-tea?videoId=341712453")
oururl= urllib.request.urlopen(URL).read()
soup = BeautifulSoup(oururl)
data = json.loads(soup.find('script', type='application/ld+json').text)
print(data['video']['transcript'])                                                                                      
                When I try with this link: link, using this code: data = soup.findAll('span', id='articleText') I get an empty content again even if I don't use extract: <span id="articleText"> <span id="midArticle_start"></span> <span class="focusParagraph"></span></span>
– sqya
                Oct 4, 2014 at 16:35
                @laihob, It's different question. Isn't it?  Anyway, try: print ''.join(soup.find('span', id='articleText').strings)
– falsetru
                Oct 4, 2014 at 16:47
                Yeah, The previous question worked well. This time, I want to extract the article in that link which is located inside the <span id="articleText">, I will try what you said now. thx
– sqya
                Oct 4, 2014 at 16:51
                Just tried it. Well, actually, soup.find('span', id='articleText').strings gives None as a result
– sqya
                Oct 4, 2014 at 16:55
                Had to do some modifications for my specific case, but this answer did help me get there most of the way.
– Thom Ives
                Apr 1, 2020 at 21:52
Thanks for the inspiration. I've been trying for hours how to do it. But let me tell you that since Python3 doesn't work with urllib2 anymore, we must use the requests library instead urllib2. I just drop here the updated version. Enjoy ;)
import json
import requests
from bs4 import BeautifulSoup
url = input('Enter url:')
html = requests.get(url)
soup = BeautifulSoup(html.text,'html.parser')
data = json.loads(soup.find('script', type='application/ld+json').text)
print(data['articleBody'])
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.