添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App crashes with the below exception.
Using okhttp version : 3.12.13
Crash happening only on Android 13 devices while doing network call.

Any pointers on why this is happening only on specific devices?
Please let me know if any additional details are required to debug this further.

pid: 0, tid: 3268 >>> com.example.app <<<

backtrace:
#00 pc 0x0000000000038600 /apex/com.android.conscrypt/lib64/libssl.so (bssl::ssl_cert_dup(bssl::CERT*)+68)
#1 pc 0x000000000003f984 /apex/com.android.conscrypt/lib64/libssl.so (SSL_new+484)
#2 pc 0x000000000002212c /apex/com.android.conscrypt/lib64/libjavacrypto.so (NativeCrypto_SSL_new(_JNIEnv*, _jclass*, long, _jobject*)+24)
#3 pc 0x0000000000461554 /apex/com.android.art/lib64/libart.so (art_quick_generic_jni_trampoline+148)
#4 pc 0x0000000000209a9c /apex/com.android.art/lib64/libart.so (nterp_helper+1948)
#5 pc 0x0000000000024644 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.NativeSsl.newInstance+12)
#6 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#7 pc 0x000000000001983c /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngine.newSsl)
#8 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#9 pc 0x000000000001b0e6 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngine.+94)
#10 pc 0x000000000020a254 /apex/com.android.art/lib64/libart.so (nterp_helper+3924)
#11 pc 0x0000000000018822 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngineSocket.newEngine+54)
#12 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#13 pc 0x0000000000018d68 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.ConscryptEngineSocket.+52)
#14 pc 0x000000000020a958 /apex/com.android.art/lib64/libart.so (nterp_helper+5720)
#15 pc 0x0000000000021814 /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.Java8EngineSocket.)
#16 pc 0x000000000020a958 /apex/com.android.art/lib64/libart.so (nterp_helper+5720)
#17 pc 0x00000000000360ec /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.Platform.createEngineSocket+16)
#18 pc 0x0000000000209334 /apex/com.android.art/lib64/libart.so (nterp_helper+52)
#19 pc 0x0000000000031c8c /apex/com.android.conscrypt/javalib/conscrypt.jar (com.android.org.conscrypt.OpenSSLSocketFactoryImpl.createSocket+84)
#20 pc 0x00000000026cc1c4 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.RealConnection.connectTls+164)
#21 pc 0x00000000026cd3f8 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.RealConnection.establishProtocol+440)
#22 pc 0x00000000026cdedc /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.RealConnection.connect+1884)
#23 pc 0x00000000025bae44 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.StreamAllocation.findConnection+1812)
#24 pc 0x00000000025bb44c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.StreamAllocation.findHealthyConnection+92)
#25 pc 0x00000000025bbbf8 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.StreamAllocation.newStream+280)
#26 pc 0x00000000026cb940 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.connection.ConnectInterceptor.intercept+224)
#27 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#28 pc 0x00000000026d2618 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+104)
#29 pc 0x00000000026cb03c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.cache.CacheInterceptor.intercept+1468)
#30 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#31 pc 0x00000000026d2618 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+104)
#32 pc 0x00000000026d0820 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.BridgeInterceptor.intercept+4288)
#33 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#34 pc 0x00000000026d4e5c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept+700)
#35 pc 0x00000000026d2c48 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.http.RealInterceptorChain.proceed+1544)
#36 pc 0x00000000026c97b8 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.RealCall.getResponseWithInterceptorChain+3528)
#37 pc 0x00000000026c6ea0 /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.RealCall$AsyncCall.execute+128)
#38 pc 0x00000000025ad87c /data/app/~~x5omOPFUGuwkf_7eG32MIw==/com.example.app-_KOMTFChZ1oamat1_QhM_g==/oat/arm64/base.odex (okhttp3.internal.NamedRunnable.run+124)
#39 pc 0x0000000000588960 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor.runWorker+976)
#40 pc 0x0000000000585b48 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (java.util.concurrent.ThreadPoolExecutor$Worker.run+72)
#41 pc 0x00000000003fe840 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (java.lang.Thread.run+80)
#42 pc 0x0000000000457b6c /apex/com.android.art/lib64/libart.so (art_quick_invoke_stub+556)
#43 pc 0x0000000000484e54 /apex/com.android.art/lib64/libart.so (art::ArtMethod::Invoke(art::Thread*, unsigned int*, unsigned int, art::JValue*, char const*)+156)
#44 pc 0x0000000000484b20 /apex/com.android.art/lib64/libart.so (art::JValue art::InvokeVirtualOrInterfaceWithJValuesart::ArtMethod*(art::ScopedObjectAccessAlreadyRunnable const&, _jobject*, art::ArtMethod*, jvalue const*)+400)
#45 pc 0x00000000005ce334 /apex/com.android.art/lib64/libart.so (art::Thread::CreateCallback(void*)+1684)
#46 pc 0x00000000000b6668 /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+208)
#47 pc 0x00000000000532cc /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)

Most likely thing here, based on where and how it's crashing, is some kind of native heap corruption. Either in the app, okhttp or the version of Conscrypt shipping as a Mainline module .

That module is identical across Android 11 through 14, so if you're only seeing crashes on 13 then again it points to heap corruption as the native allocator changed between 12 and 13.

If you can consistently reproduce this, and are willing to share the code, then the best way forward is to open an Android bug at https://issuetracker.google.com/issues/new?component=190923&template=841312 and then we can try and debug it.

This issue is happening both on Android 12 and 13 devices. Does not seem to be related to changes in native heap allocator.
Since this is a segmentation fault, there can be one of the 3 possibilities : The crash could happen while

  • accessing an out-of-bound memory location
  • accessing invalid memory
  • writing to a read-only memory.
  • Do you have any idea if the library is getting into some bad state during a network call?

    Where can we see the conscrypt releases? This issue started happening from March 2023. Can we debug this better?

    I am not able to get a repro locally though.

    @prbprbprb

    Since this is a segmentation fault, there can be one of the 3 possibilities : The crash could happen while
    accessing an out-of-bound memory location
    accessing invalid memory

    Right, but the root cause of that could be any kind of heap corruption. bssl::ssl_cert_dup() is following the native pointers in various objects, so if they get corrupted then it can easily be trying to access invalid memory locations.

    Typically that happens if there is a concurrency bug between threads, or a native pointer gets re-used after its memory has been freed.

    The same platform version of Conscrypt runs on Android 11 through 14, so the fact you're only seeing crashes on 12 and 13 is unexpected.

    The fact that it the issue started in March makes me think some other component is corrupting the heap, as we didn't ship any Conscrypt changes in February or March.

    However the fact it's consistently crashing in SSL_New() makes me think it may be a Conscrypt bug after all.

    Can we debug this better?

    Without any kind of repro steps it's going to be very difficult.

    Thanks for responding.
    Is there any way to change the conscrypt version or upgrade to the latest and ship it?

    Is there any other way we could debug this? Any thoughts?

    There is one change that is specific to Android 13. There was a change in the garbage collection algorithm. Could this be causing it?

    But we also observed huge volumes of this crash in Android 12 as well.

    @prbprbprb
    One new observation is that the total duration of garbage collection - art.gc.gc-time that we fetch from the Debug.getRuntimeStat class is very high just before the crash happens.

    Could this crash be a manifestation of OOMs in native heap?

    Thanks for the update! I believe the userfaultd GC is active on (at least some) Android 12 devices now, so what you're seeing does suggest it's related to low memory conditions and aggressive GCing.

    What's interesting is that ssl_cert_dup() is consistently failing when copying a certificate from an SSLSessionContext object which might be shared across many TLS connections. And what I missed until just now is that AbstractSessionContext manages its own native pointer . The pattern we use everywhere else in Conscrypt it to wrap the pointer in a NativeRef subclass, and then when calling JNI code we pass in the NativeRef object as an Object (as well as the pointer) in order to prevent premature finalization while running in native code. In this case, it's actually the AbstractSessionContext that gets passed in , which ought to be sufficient to keep the context object alive and prevent its finalizer from running. Maybe I'm missing something there though... I shall ask some ART folk.

    Another possibility is a good old-fashioned concurrency bug exacerbated by the device being slowed down by GCs. When a TLS connection is established the certificate is set up in a convoluted way... The native TLS code calls back into Java to select a certificate, and that callback calls further JNI code to set the certificate on the native SSL object with not much (if any) locking. However as far as I can see this code path never modifies the certificate data on the SSL_CTX object, so I can't really see a scenario where one thread is updating SSL_STX->cert while another is copying it. @davidben ?

    We don't have a definitive root cause for google#1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so:
    1. Make the native pointer private and move all accesses into AbstractSessionContext
    2. Zero it out on finalisation
    3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX.
    We don't have a definitive root cause for google#1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so:
    1. Make the native pointer private and move all accesses into AbstractSessionContext
    2. Zero it out on finalisation
    3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX.
    We don't have a definitive root cause for google#1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so:
    1. Make the native pointer private and move all accesses into AbstractSessionContext
    2. Zero it out on finalisation
    3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX aside from atomic refcounts.
    The above change is broadly equivalent to turning the native pointer into a NativeRef, which would mean its finalizer shouldn't run until after the AbstractSessionContext object is unreachable, but (currently) NativeRefs don't zero out the native address on finalization.
    We don't have a definitive root cause for #1131 but it seems like either use-after-free (e.g. finalizer ordering) or concurrency issue, so:
    1. Make the native pointer private and move all accesses into AbstractSessionContext
    2. Zero it out on finalisation
    3. Add locking. Note we only need a read lock for the sslNew() path as this is thread safe and doesn't modify the native SSL_CTX aside from atomic refcounts.
    The above change is broadly equivalent to turning the native pointer into a NativeRef, which would mean its finalizer shouldn't run until after the AbstractSessionContext object is unreachable, but (currently) NativeRefs don't zero out the native address on finalization.

    Thanks for merging a possible fix for the above issue. @prbprbprb
    When can we expect this to reflect in the OTA updates so that I can track if this change has fixed the crash?

    Assuming that the crash is happening due to the concurrency issue of AbstractSessionContext, I am curious to know why this would be specific to Android 12 and 13 devices. Could they be related to userfaultd GC algo in any way or due to high memory usage?

    Oh, sorry, I missed the. previous comment!

    #1154 (and also #1157 and #1164 ) are planned to go out in the November Mainline build, that is they'll start reaching devices at the start of November and should be fully rolled out by the end of that month. Non-Mainline devices (e.g. Android Go) won't get the fix then, but the next time their vendor sends an OTA... But the fixes apart from #1164 are already in AOSP for them and #1164 should land in AOSP today.

    For the second part of your question (root cause), I'm frankly not sure because we haven't managed to reproduce the issues. However the finalizers in question all had latent bugs and so I'm moderately confident that the fixes will help. At the very least they should prevent native crashes, although it's possible that if there are other concurrency issues we missed then you may still see NullPointerExceptions .

    I suspect these bugs have been causing crashes forever, just at a frequency low enough that nobody noticed and then recent ART changes (e.g. GC patterns) meant we started seeing them more often.

    The long term fix here is to use Cleaners rather than finalizers to free up native resources which are less error-prone, but that isn't simple so long as we still support OpenJDK 8 and Android API levels < 33.

    I'm not sure there's a public source of that information but I'll try and find out. Very approximately though, the first few weeks are taken up with "canary" rollouts to detect issues, then there's a progressive rollout to 50%, 99% and the last percent only get updated towards the very end of the month.

    Hi @prbprbprb ,

    Can we consider that the mainline build with the fix was merged and is available on at least the Android 13 phones? The crash does not seem to be showing a downward trend in Play console at least.

    In the below article, they have mentioned only the update of 2 Mainline components -

    https://source.android.com/docs/security/bulletin/2023-11-01

    Does that mean conscrypt lib update was not included? Would be great if you could share any resource around mainline build release timeline/ notes.

    Thanks

    Ah, that note is a bit confusing. You're linking the release notes for the November Security Bulletin, which goes out as an OTA update because it needs to be able to update components anywhere in the Android platform. But what the release notes are saying is that the fixes for those two CVEs are going out as part of a Mainline update, rather than with the security bulletin OTA[1]. There are no security fixed for Conscrypt in the November bulletin, so it isn't mentioned.

    Meanwhile, it appears that the November Mainline train is still in its canary phase due to the US Thanksgiving holidays, which means it is on less than 2% of Mainline devices (maybe even less than that), which I wasn't expecting... It looks to me like it's supposed to ramp up to 99% by the end of this week, so if you don't hear any more from me by Friday then please ping the issue again.

    [1] It's not really feasible for OTAs to update Mainline modules, or for Mainline updates to update non-Mainline components.

    Hi @prbprbprb ,

    Thanks a lot for the detailed explanation to my queries.

    Could you please confirm if the mainline build rollout is 100% now? Is there anyway we could check if devices have received this update? (Any document?)

    One update: The native crash is translating to a java crash now (which we could catch with a try-catch). Please let me know if there is a fix for this that you are aware of or any possible cause.

    Exception java.lang.RuntimeException: javax.net.ssl.SSLException: Invalid session context
      at com.android.org.conscrypt.ConscryptEngine.newSsl (ConscryptEngine.java:208)
      at com.android.org.conscrypt.ConscryptEngine.<init> (ConscryptEngine.java:199)
      at com.android.org.conscrypt.ConscryptEngineSocket.newEngine (ConscryptEngineSocket.java:117)
      at com.android.org.conscrypt.ConscryptEngineSocket.<init> (ConscryptEngineSocket.java:104)
      at com.android.org.conscrypt.Java8EngineSocket.<init> (Java8EngineSocket.java:62)
      at com.android.org.conscrypt.Platform.createEngineSocket (Platform.java:334)
      at com.android.org.conscrypt.OpenSSLSocketFactoryImpl.createSocket (OpenSSLSocketFactoryImpl.java:163)
      at okhttp3.internal.connection.RealConnection.connectTls (RealConnection.kt)
      at okhttp3.internal.connection.RealConnection.establishProtocol (RealConnection.kt)
      at okhttp3.internal.connection.RealConnection.connect (RealConnection.kt)
      at okhttp3.internal.connection.ExchangeFinder.findConnection (ExchangeFinder.kt)
      at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection (ExchangeFinder.kt)
      at okhttp3.internal.connection.ExchangeFinder.find (ExchangeFinder.kt)
      at okhttp3.internal.connection.RealCall.initExchange$okhttp (RealCall.kt)
      at okhttp3.internal.connection.ConnectInterceptor.intercept (ConnectInterceptor.kt)
      at okhttp3.internal.http.RealInterceptorChain.proceed (RealInterceptorChain.kt)
      at okhttp3.internal.cache.CacheInterceptor.intercept (CacheInterceptor.kt)
      at okhttp3.internal.http.RealInterceptorChain.proceed (RealInterceptorChain.kt)
      at okhttp3.internal.http.BridgeInterceptor.intercept (BridgeInterceptor.kt)
      at okhttp3.internal.http.RealInterceptorChain.proceed (RealInterceptorChain.kt)
      at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept (RetryAndFollowUpInterceptor.kt)
      at okhttp3.internal.http.RealInterceptorChain.proceed (RealInterceptorChain.kt)
      at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp (RealCall.kt)
      at okhttp3.internal.connection.RealCall$AsyncCall.run (RealCall.kt)
      at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1145)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:644)
      at java.lang.Thread.run (Thread.java:1012)
    Caused by javax.net.ssl.SSLException: Invalid session context
      at com.android.org.conscrypt.AbstractSessionContext.newSsl (AbstractSessionContext.java:216)
      at com.android.org.conscrypt.NativeSsl.newInstance (NativeSsl.java:80)
      at com.android.org.conscrypt.ConscryptEngine.newSsl (ConscryptEngine.java:206)
    

    Thanks

    Sorry, got lost in the Christmas backlog!

    On the plus side, we did indeed fix the code path causing the native crashes, and now we have a Java stack trace to work with.

    On the minus side this situation shouldn't be possible........The root cause exception is because a socket factory is trying create a new SSL session for a new socket but the native pointer to its ssl session context is 0.

    Every SSLContext contains has a reference to a ClientSessionContext object which has a pointer to a native SSL_CTX struct. This is created from the SSLContext constructor and the native struct is created from the ClientSessionContext constructor and there is no code path which allows this to be zero without throwing an exception.

    => There is no way to create an SSLContext with a native pointer of 0

    (if the code creating the SSL_CTX throws then you can have a ClientSessionContext with a 0 pointer, which will eventually get finalised but this will just be a no-op)

    Since #1154 the native pointer is never shared outside the class and all accesses are synchronized.

    => There is no way for it to become 0 due to concurrency bugs

    => The only way for the native pointer to become 0 is through finalisation.

    The ClientSessionContext is widely shared. It is created by the SSLContext which passes it to every SSLSocketFactory it creates and thence it gets passed to every SSLSocket inside the socket's SSLParameters.

    => As the crash happens during socket creation there should be no way the ClientSessionContext can have been finalised because the SSLContext and SSLSocketFactory still exist and have references to it

    There's probably a flaw in my reasoning but I'm failing to see it. :/

    The crash is too consistent for an ART bug, and so far this is the only report of it that I'm aware of... Is it possible your app is doing anything unusual with reflection around SSLContext or SSLParameters? Or catching and ignoring OOM exceptions?